Data Retrieving in Data Science (Best Tutorial 2019)

Data Retrieving

Data Retrieving in Data Science

The Data Retrieve superstep in data science is a practical method for importing completely into the processing ecosystem a data lake consisting of various external data sources. The Retrieve superstep is the first contact between your data science and the source systems.

 

I will guide you through a methodology of how to handle this discovery of the data up to the point you have all the data you need to evaluate the system you are working with, by deploying your data science skills.

 

Data Lakes

Data Lakes

The Pentaho CTO James Dixon is credited with coining the term data lake. If you think of a datamart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. 

 

The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.

 

Since this blog, a significant categorization of research and commercial activity have made the term famous and even notorious. Unfortunately, the term is so all-­ encompassing that you could describe any data that is on the edge of the ecosystem and your business as part of your data lake.

 

I have spent hours and even days in think tanks, theoretical and scientific debates on the complexities and nuances of the data lake.

 

So, for the practices and procedures of this blog, I describe the data lake as follows: a company’s data lake covers all data that your business is authorized to process, to attain an improved profitability of your business’s core accomplishments.

 

The data lake is the complete data world your company interacts with during its business lifespan. In simple terms, if you generate data or consume data to perform your business tasks, that data is in your company’s data lake.

 

I will use terms related to a natural lake to explain the specific data lake concepts. So, as a lake needs rivers and streams to feed it, the data lake will consume an unavoidable deluge of data sources from upstream and deliver it to downstream partners.

 

Tip Do not fail to appreciate the worth of an unprocessed data source.

 

I had a business customer who received the entire quarterly shipping manifests of his logistics company on a DVD but merely collected them in a plastic container for five years because he assumed they were superfluous to requirements, as he had the electronic e-mails for each shipment online.

 

Fortunately, one of my interns took the time to load a single DVD and noticed that the complete supply chain routing path and coding procedure were also on the DVD.

 

This new information enriched the data lake by more than 30% in real-value logistics costs, by adding this new data to the company’s existing data lake. Do not limit the catchment area of your data lake, as the most lucrative knowledge is usually in the merging of data you never before accepted as important.

 

Note There are numerous snippets of code that I use to demonstrate concepts throughout this blog. I am presuming that you have an R- and Python-enabled environment ready to try these examples. Performing the processes enriches knowledge transfer.

 

Data Swamps

Data Swamps

As I spoke of data lakes and the importance of loading business data to uncover unknown value, I must also explain the data swamp. I have heard and seen various demonstrations asserting data lakes as evil.

 

I have a different point of view. Any unmanaged data structure is evil. On the other hand, I am comfortable admitting that I have billed for many hours for repairing structures containing these evil data ecosystems.

 

Data swamps are simply data lakes that are not managed. They are not to be feared. They need to be tamed. Following are my four critical steps to avoid a data swamp.

 

Start with Concrete Business Questions

Simply dumping a horde of data into a data lake, with no tangible purpose in mind, will result in a big business risk. Countless times, I have caught sight of data-loading agreements that state that they want to load all the data. 

 

My recommendation is to perform a comprehensive analysis of the entire set of data you have and then apply a metadata classification for the data, stating full data lineage for allowing it into the data lake. The data lake must be enabled to collect the data required to answer your business questions.

 

If, for example, you wanted to calculate the shipping patterns for the last five years, you would require all the data for the last five years related to shipping to be available in the data lake.

 

You would also need to start a process of collecting the relevant data in the future, at agreed intervals, to ensure that you could use the same data science in a year’s time, to evaluate the previous five years’ data. The data lineage ensures that you can reference the data sources once you have your results.

 

Tip At a minimum, assert the true source of the data and a full timestamp of when it entered the data lake. The more data lineage you can provide, the better. Ensure that you have accredited sources.

Example:

It is better to say “I have found that your shipping pattern is good, and I have ensured that we have all the data for the last five years, by getting it signed off by our logistics managers as true and complete.” That statement gives your data science a solid foundation and accredited lineage.

 

Last year, I had a junior data scientist try to proof his data science results by convincing a CFO that the source unknown, date unknown DVDs with shipping invoices that he found in the archives were a good data lineage for his four weeks of data science.

 

Sadly, I must report, that results in document plus the DVDs are now part of my “How not to perform data science” presentation!

 

The same data scientist, ten days later, with the same data science algorithms but accredited lineage, presented a major cost saving to my customer.

 

Data Governance

The role of data governance, data access, and data security does not go away with the volume of data in the data lake. It simply collects together into a worse problem, if not managed.

 

Warning Here be dragons! Finding a piece of unmanaged data in a petabyte, or even terabyte, data lake will cost you hours of time.

 

Spend the time on data governance up front, as recovery is not easy. I typically spend 70%+ of the project’s time on governance and classification of data sources. 

 

Data Source Catalog

Data Source Catalog

Metadata that link ingested data-to-data sources are a must-have for any data lake. I suggest you note the following as general rules for the data you process.

  • Unique data catalog number: I normally use YYYYMMDD/ NNNNNN/NNN. E.g. 20171230/000000001/001 for data first registered into the metadata registers on December 30, 2017, as data source 1 of data type 1. This is a critical requirement.
  • Short description (keep it under 100 characters): Country codes and country names (Country Codes—ISO 3166)
  • Long description (keep it as complete as possible): Country codes and country names used by VKHC as standard for country entries
  • Contact information for external data source: ISO 3166-1:2013 code lists from ISO 3166 Country Codes
  • Expected frequency: Irregular (i.e., no fixed frequency, also known as ad hoc). Other options are near-real-time, every 5 seconds, every minute, hourly, daily, weekly, monthly, or yearly.
  • Internal business purpose: Validate country codes and names.

I have found that if your data source catalog is up to date, the rest of the processing is stress-free.

 

Business Glossary

The business glossary maps the data-source fields and classifies them into respective lines of business. This glossary is a must-have for any good data lake.

 

Create a data-mapping registry with the following information:

  • Unique data catalog number: I normally use ­YYYYMMDD/ NNNNNN/NNN.
  • Unique data mapping number: I normally use NNNNNNN/NNNNNNNNN. E.g., 0000001/000000001 for field 1 mapped to internal field 1
  • External data source field name: States the field as found in the raw data source
  • External data source field type: Records the full set of the field’s data types when loading the data lake
  • Internal data source field name: Records every internal data field name to use once loaded from the data lake
  • Internal data source field type: Records the full set of the field’s types to use internally once loaded
  • Timestamp of last verification of the data mapping: I normally use YYYYMMDD-HHMMSS-SSS that supports timestamp down to a thousandth of a second.

 

Note There can be a many-to-many relationship between external and internal mappings. Keep the mappings to a controlled level of consolation. However, balance the time you spend resolving them against the gain in efficiency.

 

The business glossary records the data sources ready for the retrieved processing to load the data. In several of my customer’s systems, I have had to perform this classification process, using machine learning with success.

 

This simply points the classification bots at a data lake and finds any new or changed metadata in that manner. Once newly identified metadata is collected, the human operators have only to deal with the exceptions.

 

I have several bots that can perform these discovery tasks, but I will not spend more time in this blog on how I accomplished this, as, in general, every system requires a custom build for each customer’s ecosystem.

 

I recommend that you invest time automating your data lake, as this allows you more time to perform the stimulating data analytics part of the data science.

 

Analytical Model Usage

Data tagged in respective analytical models define the profile of the data that requires loading and guides the data scientist to what additional processing is required. I perform the following data analytical models on every data set in my data lake by default.

 

Let’s start by loading a sample data set.

Note I will use a sample data set, to show each of the steps of this process. So, you will require either R or Python, to perform some data processing.

 

Disclaimer As I introduce you to many processing R and Python packages in the course of this blog, I will provide you with only a basic introduction to these tools. I suggest that you investigate the other details of these packages, to improve your overall understanding and capabilities.

 

For R, perform the following commands:

 RStudio editor

Start your RStudio editor.

Note For the rest of this blog, I will use setwd(C:/VKHCG) for Windows and setwd(/home/<youruser>/VKHCG) for Linux ecosystems.

 

Execute the following to import a data loader package: library(readr)

Note See https://cran.r-project.org/web/packages/readr/index. html for more details.

This should load an amusing R data loading package. So, let’s load the data for the examples for Vermeulen company.

Let’s load a table named IP_DATA_ALL.csv.

Base=getwd()
FileName=paste0(Base,'/01-Vermeulen/00-RawData/IP_DATA_ALL.csv')
IP_DATA_ALL <- read_csv(FileName)
You should have successfully loaded a data set with eight columns.
Using the spec(IP_DATA_ALL) command should yield the following results:
cols(
ID = col_integer(),
Country = col_character(),
'Place Name' = col_character(),
'Post Code' = col_character(),
Latitude = col_double(),
Longitude = col_double(),
'First IP Number' = col_integer(),
'Last IP Number' = col_integer()
)

This informs you that you have the following eight columns:

  • The ID of type integer
  • Place name of type character
  • Postcode of type character
  • Latitude of type numeric double
  • Longitude of type numeric double
  • First IP number of type integer
  • Last IP number of type integer

Now you can look at the data by using View(IP_DATA_ALL). You will see a data grid appear with your data.

 

Data Field Name Verification

Data Field Name Verification

I use this to validate and verify the data field’s names in the retrieved processing in an easy manner. To add an extra library, you will have to run the following commands:

 

Warning Ensure that you have the latest package, by performing an install. packages(tibble). So, if one of my proofreaders reports that the set_tidy_ names() command are not present, I use R 3.4.1 and tibble version 1.3.4.

library(tibble)
set_tidy_names(IP_DATA_ALL, syntactic = TRUE, quiet = FALSE)
You will detect that some field names are not easy to use.
New names:
Place Name -> http://Place.Name
Post Code -> Post.Code
First IP Number -> First.IP.Number
Last IP Number -> Last.IP.Number

This informs you that four of the field names are not valid and suggests new field names that are valid.

 

A tip I always use existing tools that are available, as this saves you time to verify that they work as expected. Good data scientists learn from others.

You can fix any detected invalid column names by executing 

IP_DATA_ALL_FIX=set_tidy_names(IP_DATA_ALL, syntactic = TRUE, quiet = TRUE)

By using command View(IP_DATA_ALL_FIX), you can check that you have fixed the columns.

 

Well done! You just fixed the data field names. So now, let’s store your progress.

Base=getwd()
FileDir= paste0(Base,'/01-Vermeulen/01-Retrieve/01-EDS)
dir.create(FileDir)
FileDir= paste0(Base,'/01-Vermeulen/01-Retrieve/01-EDS/01-R')
dir.create(FileDir)
FileName=paste0(FileDir,'/IP_DATA_ALL_FIX.csv')
write.csv(IP_DATA_ALL_FIX, FileName)
You should now have a file named IP_DATA_ALL_FIX.csv.

 

Unique Identifier of Each Data Entry

Unique Identifier of Each Data Entry

Allocate a unique identifier within the system that is independent of the given file name. This ensures that the system can handle different files from different paths and keep track of all data entries in an effective manner.

 

Then allocate a unique identifier for each record or data element in the files that are retrieved.

 

Tip I have found that the extra time spent to perform this task prevents major data referential issues. At scale, the probability of getting similar or even the same data loaded into the data lake more than once is high.

Now, investigate your sample data.

 

For R, see the following.

To add the unique identifier, run the following command:

IP_DATA_ALL_with_ID=rowid_to_column(IP_DATA_ALL_FIX, var = "RowID")

Check if you successfully added the unique identifier, as follows:

View(IP_DATA_ALL_with_ID)

 

This will show you that every record has a unique ID named “Row ID.” So, let’s store your progress.

Base=getwd()

FileName=paste0(FileDir,'/IP_DATA_ALL_with_ID.csv')

write.csv(IP_DATA_ALL_with_ID,FileName)

You should now have a file named IP_DATA_ALL_with_ID.csv.

 

Data Type of Each Data Column

Determine the best data type for each column, to assist you in completing the business glossary, to ensure that you record the correct import processing rules.

Use the following R command:

sapply(IP_DATA_ALL_with_ID, typeof)

 

You should see a complete data type analysis of the data you’ve loaded.

RowId ID Country Place. Post. Latitude Longitude First. Last.

Name Code IP.Number IP.Number integer integer character character character double double integer integer

 

Warning Keep your eye on data types, such as account numbers or identification numbers, that naturally appear to be numeric but are, in reality, text columns.

 

Histograms of Each Column

Histograms of Each Column

I always generate a histogram across every column, to determine the spread of the data value. In the practical section of this blog, I will generate a histogram against the sample data set, to demonstrate how to perform this action.

 

Now we translate the data set to a data table, which will assist us to investigate the data in an enhanced manner. I will show you in R how to find the histogram for the country data, as follows:

library(data.table)
hist_country=data.table(Country=unique(IP_DATA_ALL_with_ID[http://is.na(IP_DATA_
ALL_with_ID ['Country']) == 0, ]$Country))
setorder(hist_country,'Country')
hist_country_with_id=rowid_to_column(hist_country, var = "RowIDCountry")
View(hist_country_with_id)
IP_DATA_COUNTRY_FREQ=data.table(with(IP_DATA_ALL_with_ID, table(Country)))
View(IP_DATA_COUNTRY_FREQ)
Top-Two Country Codes Frequency
US 512714
GB 127069

 

Business Insights:

  • The two biggest subset volumes are from the US and GB.
  • The US has just over four times the data as GB.

 

This information is useful for designing parallel processing of loaded data. During parallel processing, the skew of data can result in unwanted wait times or overloading of processors, owing to specific split criteria for the parallel processing.

 

In the case just discussed, if you spread the data, for example, for one processor per country code, you would have a result of US and GB running longer than the other country codes, owing to the volumes. Even the mismatch in volumes between GB and US would result in more skew processing.

 

Caution Parallel processing is an essential requirement for data science, and it is important that you understand the behaviors of your data model and/or algorithms when dealing with nonconforming data spreads.

 

Sometimes, you will be better off not going parallel, as the split process may take longer than the savings it brings to the overall process.

 

So, let’s store your progress.

Base=getwd()

FileName=paste0(FileDir,'/IP_DATA_COUNTRY_FREQ.csv')

IP_DATA_COUNTRY_FREQ.to_csv(FileName)

You should now have a file named IP_DATA_COUNTRY_FREQ.csv.

 

Tip As you start to understand the analytic characteristics of the data lake, you will start to formulate your plans on how you would process it. I normally keep a few key notes on the data profile of the data lake.

 

Profiling the data lake is normally a structured approach. My team and I keep a simple cheat sheet on each data source on which we note what the profile of the data is. We share this via our GitHub source control, to ensure that any new discoveries are circulated to everyone involved.

 

Let’s look at some more key characteristics you may want to investigate. Histograms explain how the complete data set is spread over a set of values. I will show you in R how to find the histogram for the latitude data field.

 

hist_latitude =data.table(Latitude=unique(IP_DATA_ALL_with_ID [http://is.na(IP_ DATA_ALL_with_ID ['Latitude']) == 0, ]$Latitude))
setkeyv(hist_latitude, 'Latitude')
setorder(hist_latitude)
hist_latitude_with_id=rowid_to_column(hist_latitude, var = "RowID")
View(hist_latitude_with_id)
IP_DATA_Latitude_FREQ=data.table(with(IP_DATA_ALL_with_ID, table(Latitude)))
View(IP_DATA_Latitude_FREQ)
Top-Three Latitudes Frequency
35.6427 4838
51.5092 4442
48.8628 3230

 

Business Insights:

  • The two biggest data volumes are from latitudes 35.6427 and 51.5092.
  • The spread appears to be nearly equal between the top-two latitudes.

So, let’s store your progress.

Base=getwd()
FileName=paste0(FileDir,'/IP_DATA_Latitude_FREQ.csv')
IP_DATA_Latitude_FREQ.to_csv(FileName)
You should now have a file named IP_DATA_Latitude_FREQ.csv.
I will show you in R how to find the histogram for the longitude data.
hist_longitude =data.table(Longitude=unique(IP_DATA_ALL_with_ID [http://is.na(IP_
DATA_ALL_with_ID ['Longitude']) == 0, ]$Longitude))
setkeyv(hist_longitude, 'Longitude')
setorder(hist_longitude,'Longitude')
hist_longitude_with_id=rowid_to_column(hist_longitude, var = "RowID")
View(hist_longitude_with_id)
IP_DATA_Longitude_FREQ=data.table(with(IP_DATA_ALL_with_ID, table(Longitude)))
View(IP_DATA_Longitude_FREQ)
Top-Three Longitudes Frequency
139.7677 4837
-0.0955 4436
2.3292 3230
So, let’s store your progress.
Base=getwd()
FileName=paste0(FileDir,'/IP_DATA_Longitude_FREQ.csv')
IP_DATA_Longitude_FREQ.to_csv(FileName)

You should now have a file named IP_DATA_Longitude_FREQ.csv.

 

Minimum Value

Determine the minimum value in a specific column.

In R, use the following. I suggest that you inspect the country. min(hist_country$Country)

You should return AD. AD is the country code for Andorra (an independent principality situated between France and Spain in the Pyrenees Mountains).

 

Tip I will be using a common processing function in R named sapply().

If you have not used it before, I suggest you look at ­R-bloggers apply-sapply-lapply-in-r/. 

In simple terms, it applies a function to a data set. Please read up on it, as you will use it frequently in your data science.

 

I will now apply the function min to the data set in hist_country’s country field.
You can also use sapply(hist_country[,'Country'], min, na.rm=TRUE) You should return AD.

 

I suggest you examine latitude next.

In R, execute

sapply(hist_latitude_with_id[,'Latitude'], min, na.rm=TRUE)

The result is -54.2767. What does this tell you?

 

Fact: The range of latitude for the Southern Hemisphere is from -90 to 0. So, if you do not have any latitudes farther south than -54.2767, you can improve your retrieve routine.

 

Calculate the following in R:

round(1-(-54.2767/-90),4)*100

The result is 39.69%. A 39% minimum saving in processing can be achieved using this discovery.

I suggest you examine longitude next.

 

In R, execute

sapply(hist_longitude_with_id[,'Longitude'], min, na.rm=TRUE)

The result is -176.5. What does this tell you?

 

Fact: The range of longitude for the Western Hemisphere is from -180 to 0.

So, if you do not have any latitudes more westerly than -176.5, you can improve your retrieve routine.

Calculate the following in R:

round(1-(-176.5/-180),4)*100

The result is 1.94%. A small 1% saving in processing can be achieved using this discovery.

 

Business Insights:

Business Insights

•\ The insight on Andorra is interesting but does not impact the processing plan for retrieval of the data.

 

•\ The range in latitude for the Southern Hemisphere is from -90 to 0, with the minimum latitude confirmed at greater than -55. Knowing this, you do not now have to process values for +/-39% of the Southern Hemisphere. That is major gain inefficiency.

 

•\ The range in longitude for the Western Hemisphere is -180 to 0, with the minimum longitude confirmed at greater than -177. Therefore, you will not achieve similar gain inefficiencies, as you can only not load less than 2% of the total range of the Eastern Hemisphere.

 

You have discovered that when you planned for 180 processes to load the individual bands of latitudes, you already removed the need for 35 of those processes. You also discovered that if you have loaded via bands based on longitudes, the improvements are not so remarkable.

 

Maximum Value

Determine the maximum value in a specific column. In R, use max(hist_country$Country)

  • You should return ZW. You can also use sapply(hist_country[,'Country'], max, na.rm=TRUE)
  • You should return ZW. I suggest that you investigate latitude next.

In R, execute

sapply(hist_latitude_with_id[,'Latitude'], max, na.rm=TRUE)

The result is 78.2167. What does this tell you?

 

Fact: The range in latitude for the Northern Hemisphere is from 0 to 90. So, if you do not have any latitudes more northerly than 78.2167, you can improve your retrieve routine.

 

Calculate the following in R:

round(1-(78.2167/90),4)*100

The result is 13.09%. A 13% minimum saving in processing can be achieved using this discovery.

 

In R, execute

sapply(hist_longitude_with_id[,'Longitude'], max, na.rm=TRUE)

The result is 179.2167. What does this tell you?

Fact: The range in longitude for the Eastern Hemisphere is from 0 to 180. So, if you do not have any longitudes more easterly than 179.2167, you can improve your retrieve routine.

 

Calculate the following in R:

round(1-(179.2167/180),4)*100

The result is 0.44%. A small 0.4% saving in processing can be achieved using this discovery.

 

Mean

If the column is numeric in nature, determine the average value in a specific column.

Let’s examine latitude.

In R, use sapply(hist_latitude_with_id[,'Latitude'], mean, na.rm=TRUE)

The result is 38.23803.

Let’s examine longitude.

In R, use sapply(hist_longitude_with_id[,'Longitude'], mean, na.rm=TRUE)

The result is -13.87398.

 

Median

Determine the value that splits the data set into two parts in a specific column.

In R, use the following command against the latitude column: sapply(hist_latitude_with_id[,'Latitude'], median, na.rm=TRUE)

The result is 43.2126.

In R, use the following command against the longitude column: sapply(hist_longitude_with_id[,'Longitude'], median, na.rm=TRUE)

The result is 1.8181.

 

Mode

Determine the value that appears most in a specific column. In R, use the following command against the country column:

IP_DATA_COUNTRY_FREQ=data.table(with(IP_DATA_ALL_with_ID, table(Country)))
setorder(IP_DATA_COUNTRY_FREQ,-N)
IP_DATA_COUNTRY_FREQ[1,'Country']
The result is US.
In R, use the following command against the latitude column:
IP_DATA_Latitude_FREQ=data.table(with(IP_DATA_ALL_with_ID,
table(Latitude)))
setorder(IP_DATA_Latitude_FREQ,-N)
IP_DATA_Latitude_FREQ[1,'Latitude']
The result is 35.6427.
In R, use the following command against the longitude column:
IP_DATA_Longitude_FREQ=data.table(with(IP_DATA_ALL_with_ID, table(Longitude)))
setorder(IP_DATA_Longitude_FREQ,-N)
IP_DATA_Longitude_FREQ[1,'Longitude']
The result is 139.7677.

 

Range

For numeric values, you determine the range of the values by taking the maximum value and subtracting the minimum value.

In R, use the following command against the latitude column: sapply(hist_latitude_with_id[,'Latitude'], range, na.rm=TRUE)


The range in latitude is -54.2767 to 78.2167.

In R, use the following command against the longitude column: sapply(hist_longitude_with_id[,'Longitude'], range, na.rm=TRUE)

The range in longitude is -176.5000 to 179.2167.

In R, use the following command against the country column: sapply(hist_country_with_id[,'Country'], range, na.rm=TRUE)

Country Range is: "AD" to "ZW"

 

Quartiles

Quartiles are the base values dividing a data set into quarters. Simply sort the data column and split it into four groups that are of four equal parts.

 

In R, use the following command against the latitude column: sapply(hist_latitude_with_id[,'Latitude'], quantile, na.rm=TRUE)

The results are as follows:

Latitude

0% -54.2767

25% 36.4151

50% 43.2126

75% 48.6219

100% 78.2167

Knowledge Checkpoint Could you perform the process for the longitude? Is there any other insight you could achieve? How would you perform the task?

Here is my attempt:

In R, use the following command against the longitude column: sapply(hist_longitude_with_id[,'Longitude'], quantile, na.rm=TRUE)

The results are as follows:

Latitude

0% -176.50000
25% -80.20185
50% 1.81810
75% 12.69480
100% 179.21670

 

Standard Deviation

In R, use the following command against the latitude column: sapply(hist_latitude_with_id[,'Latitude'], sd, na.rm=TRUE)

The result is 19.86071.

In R, use the following command against the longitude column: sapply(hist_longitude_with_id[,'Longitude'], sd, na.rm=TRUE)

The result is 68.981.

 

Skewness

Skewness describes the shape or profile of the distribution of the data in the column.

Remember This is the characteristic that gives you a profile for doing parallel processing.

 

In R, use the following command against the latitude column:

Warning You will install a new package now: install.packges('e1071'). This is a set of useful functions known as E1071, elaborated by the Vienna University of Technology’s Department of Statistics’ Probability Theory Group.

 

There are thousands of R research groups worldwide that are achieving extremely useful progress in data processing. I am simply enriching our processing capability by adding their already proven processing capability.

library(e1071)

skewness(hist_latitude_with_id$Latitude, na.rm = FALSE, type = 2)

The result is -2.504929.

 

Negative skew: The mass of the distribution is concentrated on the right, and the distribution is said to be left-skewed. The result is only an indicator of how the data is skewed. By combining it with a histogram, you can gather good insights into how the data is spread.

 

Example:

If you are looking at a distribution of the latitudes ordered in ascending order, from right to left, and found the skew to be negative, it is highly improbable that you will find fewer locations on the left, but more probable that you will find more on the right-hand side of the scale.

 

This is just a quick estimate. The histogram of the sizes will give you a precise count of each size.

 

Knowledge Checkpoint Is it true about the longitude also? Using View(hist_ longitude_with_id), can you predict what you should find? Can you guess or calculate the skew yourself?

 

In R, use the following command against the longitude column: 

skewness(hist_longitude_with_id$Longitude,na.rm = FALSE, type = 2)

The result is 0.4852374.

Positive skew: The mass of the distribution is concentrated on the left, and the distribution is said to be right-skewed.

Warning Missing data or unknown values is a common but unavoidable fact of data science.

 

Example:

If you have data only about customers in the United States and Germany, does that mean you have no customers in the UK? What if your data lineage infers that you have data, but it is simply not 100% filled in on the data lake? What do you process?

 

Missing or Unknown Values

Missing or Unknown Values

Identify if you have missing or unknown values in the data sets.

In R, use the following command against the country column:

missing_country=data.table(Country=unique(IP_DATA_ALL_with_ID[http://is.na(IP_ DATA_ALL_with_ID ['Country']) == 1, ]))

View(missing_country)

 

Data Pattern

I have used the following process for years, to determine a pattern of the data values themselves. Here is my standard version:

 

Replace all alphabet values with an uppercase case A, all numbers with an uppercase N, and replace any spaces with a lowercase letter b and all other unknown characters with a lowercase u.

 

As a result, “Good blog 101” becomes “AAAAbAAAAbNNNu.” This pattern creation is beneficial for designing any specific assess rules in blog, as specific unwanted patterns can be processed to remove or repair the invalid pattern. 

 

This pattern view of data is a quick way to identify common patterns or determine standard layouts.

 

Example 1:

If you have two patterns—NNNNuNNuNN and uuNNuNNuNN—for a date column,

it is easy to see that you have an issue with the uuNNuNNuNN pattern.

It is also easy to create a quality matching grid in regular expressions: (regexpr()).

See https://cran.r-project.org/web/packages/stringr/vignettes/regular-­ expressions.html.

 

Example 2:

It also helps with determining how to load the data in a parallel manner, as each pattern can be loaded in separate retrieve procedures.

 

If the same two patterns, NNNNuNNuNN and uuNNuNNuNN, are found, you can send NNNNuNNuNN directly to be converted into a date, while uuNNuNNuNN goes through a quality-improvement process to then route back to the same queue as NNNNuNNuNN, once it complies. 

This is a common use of patterns to separate common standards and structures.

 

So, let’s see how you can achieve this. Let’s pattern our first column. I suggest the country column.

In R, use the following commands:
library(readr)
library(data.table)
Base=getwd()
FileName=paste0(Base,'/01-Vermeulen/00-RawData/IP_DATA_ALL.csv')
IP_DATA_ALL <- read_csv(FileName)
hist_country=data.table(Country=unique(IP_DATA_ALL$Country))
pattern_country=data.table(Country=hist_country$Country,
PatternCountry=hist_country$Country)
oldchar=c(letters,LETTERS)
newchar=replicate(length(oldchar),"A")
for (r in seq(nrow(pattern_country))){ s=pattern_country[r,]$PatternCountry; for (c in seq(length(oldchar))){
s=chartr(oldchar[c],newchar[c],s)
};
for (n in seq(0,9,1)){
s=chartr(as.character(n),"N",s)
};
s=chartr(" ","b",s)
s=chartr(".","u",s)
pattern_country[r,]$PatternCountry=s;
};
View(pattern_country)

 

So, let’s store your progress.

Base=getwd()
FileName=paste0(FileDir,'/pattern_country.csv')
pattern_country.to_csv(FileName)
You should now have a file named pattern_country.csv.
Let’s pattern another column. I suggest the latitude column.
In R, use the following commands:
library(readr)
library(data.table)
Base=getwd()
FileName=paste0(Base,'/01-Vermeulen/00-RawData/IP_DATA_ALL.csv')
IP_DATA_ALL <- read_csv(FileName)
hist_latitude=data.table(Latitude=unique(IP_DATA_ALL$Latitude))
pattern_latitude=data.table(latitude=hist_latitude$Latitude,
Patternlatitude=as.character(hist_latitude$Latitude))
oldchar=c(letters,LETTERS)
newchar=replicate(length(oldchar),"A")
for (r in seq(nrow(pattern_latitude))){ s=pattern_latitude[r,]$Patternlatitude; for (c in seq(length(oldchar))){
s=chartr(oldchar[c],newchar[c],s)
};
for (n in seq(0,9,1)){
s=chartr(as.character(n),"N",s)
};
s=chartr(" ","b",s)
s=chartr("+","u",s)
s=chartr("-","u",s)
s=chartr(".","u",s)
pattern_latitude[r,]$Patternlatitude=s;
};
setorder(pattern_latitude,latitude)
View(pattern_latitude[1:3])
The results are as follows:
latitude Patternlatitude
1 -54.2767 uNNuNNNN
2 -54.1561 uNNuNNNN
3 -51.7 uNNuN
So, let’s store your progress.
Base=getwd()
FileName=paste0(FileDir,'/pattern_country_asc.csv')
pattern_country.to_csv(FileName)
You should now have a file named pattern_country_asc.csv.
setorder(pattern_latitude,-latitude)
View(pattern_latitude[1:3])
The results are as follows:
latitude Patternlatitude
1 78.2167 NNuNNNN
2 77.3025 NNuNNNN
3 72.7 NNuN

 

So let’s store your progress.

Base=getwd() FileName=paste0(FileDir,'/pattern_country_desc.csv')

pattern_country.to_csv(FileName)

You should now have a file named pattern_country_desc.csv.

Note At this point, you can save all your work, as I will now discuss various non-programming information.

 

Data Quality

Data Quality

More data points do not mean that data quality is less relevant. Data quality can cause the invalidation of a complete data set, if not dealt with correctly. 

 

Warning Bad quality data will damage even the best-designed data science, model. So, spend the time to ensure that you understand any data issues early in the process. I will discuss in detail how to resolve any data quality issues. 

 

Audit and Version Management

Up to now, I have simply allowed you to save and work with data in a use-once method.

That is not the correct way to perform data science. You must always report the following:

  • Who used the process?
  •  When was it used?
  • Which version of code was used?

 

Training the Trainer Model

Training the Trainer Model

To prevent a data swamp, it is essential that you train your team also. Data science is a team effort.

People, process, and technology are the three cornerstones to ensure that data is curated and protected. You are responsible for your people; share the knowledge you acquire from this blog. The process I teach you, you need to teach them. Alone, you cannot achieve success.

 

Technology requires that you invest time to understand it fully. We are only at the dawn of major developments in the field of data engineering and data science.

Remember: 

A big part of this process is to ensure that business users and data scientists understand the need to start small, have concrete questions in mind and realize that there is work to do with all data to achieve success.

 

Understanding the Business Dynamics of the Data Lake

You now have the basic techniques of how to handle a retrieve step in your data science process. Let’s now process the examples I supplied. Now you can measure your own capability against the example.

 

R Retrieve Solution

First, I will guide you through an R-based solution for the Retrieve superstep.

Vermeulen PLC

I will guide you through the first company. Start your RStudio editor.

Note For the rest of the blog, I will use setwd("C:/VKHCG") for Windows and setwd("/home/<youruser>/VKHCG") for Linux ecosystems. Load IP_ DATA_ALL.

Now, execute the following command via RStudio:

getwd()

Test if your workspace directory is correct. You should see the directory you set.

Loading IP_DATA_ALL

This data set contains all the IP address allocations in the world. It will help you to locate your customers when interacting with them online.

Warning I am using default directory references: for windows, C:/VKHCG;for Linux, /home/<youruser>/VKHCG.

 

If you have code and data at neither of these locations, you will have to make code changes for every example in this blog.

 

Let’s start our first load for the data process.

Create a new R script file and save it as Retrieve-IP_DATA_ALL.r in directory C:\VKHCG\01-Vermeulen\01-Retrieve.

Now, start copying the following code into the file:
rm(list=ls()) #will remove ALL objects
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG")
setwd(BaseSet)
}
Base=getwd() FileDir= paste0(Base,'/01-Vermeulen/01-Retrieve/01-EDS/01-R')
dir.create(FileDir)
FileDirLog=paste0(FileDir,'/log')
dir.create(FileDirLog)
FileDirRun=paste0(FileDirLog,'/Run0001')
dir.create(FileDirRun)
StartTime=Sys.time()
# Set up logging
debugLog=paste0(FileDirRun,'/debug.Log')
infoLog=paste0(FileDirRun,'/info.Log')
errorLog=paste0(FileDirRun,'/error.Log')
write(paste0('Start Debug Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=debugLog,append = FALSE)
write(paste0('Start Information Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=infoLog,append = FALSE)
write(paste0('Start Error Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=errorLog,append = FALSE)
UserName='Practical Data Scientist'
write(paste0(UserName,' Load library: ', 'readr'), infoLog,append = TRUE)
library(readr)
write(paste0(UserName,' Load library: ', 'data.table'), infoLog,append = TRUE)
library(data.table)
write(paste0(UserName,' Load library: ', 'tibble'), infoLog,append = TRUE)
library(tibble)
FileName=paste0(Base,'/01-Vermeulen/00-RawData/IP_DATA_ALL.csv') write(paste0(UserName,' Retrieve data file: ', FileName),
file=infoLog,append = TRUE)
IP_DATA_ALL <- read_csv(FileName,
col_types = cols(
Country = col_character(),
ID = col_skip(),
'First IP Number' = col_skip(),
'Last IP Number' = col_skip(),
Latitude = col_double(),
Longitude = col_double(),
'Place Name' = col_character(),
'Post Code' = col_character()
), na = "empty")
IP_DATA_ALL_FIX=set_tidy_names(IP_DATA_ALL, syntactic = TRUE, quiet = TRUE)
IP_DATA_ALL_with_ID=rowid_to_column(IP_DATA_ALL_FIX, var = "RowID")
FileNameOut=paste0(FileDir,'/Retrieve_IP_DATA.csv')
fwrite(IP_DATA_ALL_with_ID, FileNameOut)
write(paste0(UserName,' Stores Retrieve data file: ', FileNameOut),
file=infoLog,append = TRUE)
StopTime=Sys.time()
write(paste0('Stop Debug Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=debugLog,append = TRUE)
write(paste0('Stop Information Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=infoLog,append = TRUE)
write(paste0('Stop Error Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=errorLog,append = TRUE)
View(IP_DATA_ALL_with_ID)

Save and execute the script.

 

If you completed the code correctly, you should see four outcomes, as follows:

\ 1.\ In directory C:\VKHCG\01-Vermeulen\01-Retrieve, you should have a file named Retrieve-IP_DATA_ALL.r.


\ 2.\ In directory C:\VKHCG\01-Vermeulen\01-Retrieve\01-EDS\01-R\ log\Run0001, you should have three files: debug.Log, error.Log, and info.Log. If you open info.Log in a text editor, you should see the following:

 

Start Information Log File 2017/05/08 03:35:54 Practical Data Scientist Load library:

readr Practical Data Scientist Load library: data.table Practical Data Scientist Load library: tibble

Practical Data Scientist Retrieve data file: ­C:/VKHCG/01-Vermeulen/00-­ RawData/IP_DATA_ALL.csv
Practical Data Scientist Stores Retrieve data file: C:/VKHCG/01-­ Vermeulen/01-Retrieve/01-EDS/01-R/Retrieve_IP_DATA.csv Stop Information Log File 2017/05/08 03:35:57

\ 3.\ In directory C:\VKHCG\01-Vermeulen\01-Retrieve\01-EDS\01-R, You should have a file named Retrieve_IP_DATA.csv.

\ 4.\ You will also see a display of the file you loaded, as follows:

 

Row ID Country http://Place.Name Post.Code Latitude Longitude

1 BW Gaborone NA -24.6464 25.9119

2 BW Gaborone NA -24.6464 25.9119

3 BW Gaborone NA -24.6464 25.9119

4 BW Gaborone NA -24.6464 25.9119

 

So, what have you achieved? You’ve

  • Loaded the IP_DATA_ALL.csv into R
  • Removed, by using col_skip() for three columns: ID, First IP Number, and Last IP Number
  • Set data type, by using col_character() for three columns: Country, Place Name, and Post Code, i.e., set it to be a string of characters
  • Converted all missing data to empty
  • Changed Place Name and Post Code to http://Place.Name and Post. Code
  • Learned how to log your progress throughout the program

Wow, you have taken the first steps to become a practical data scientist. Now, I suggest that you close Retrieve-IP_DATA_ALL.r, and we will move on to the next file to load.

 

Loading IP_DATA_C_VKHCG

This is a specific data load, as it only contains one class C address that VKHCG is using for its data systems.

Create a new R script file and save it as Retrieve-IP_DATA_ALL.r in directory C:\VKHCG\01-Vermeulen\01-Retrieve.

 

Now, you can start copying the following code into the script file:

rm(list=ls()) #will remove ALL objects
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG")
setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/01-Vermeulen/01-Retrieve/01-EDS/01-R')
dir.create(FileDir)
FileDirLog=paste0(FileDir,'/log')
dir.create(FileDirLog)
FileDirRun=paste0(FileDirLog,'/Run0002')
dir.create(FileDirRun)
StartTime=Sys.time()
# Set up logging
debugLog=paste0(FileDirRun,'/debug.Log')
infoLog=paste0(FileDirRun,'/info.Log')
errorLog=paste0(FileDirRun,'/error.Log')
write(paste0('Start Debug Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=debugLog,append = FALSE)
write(paste0('Start Information Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=infoLog,append = FALSE)
write(paste0('Start Error Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=errorLog,append = FALSE)
UserName='Practical Data Scientist'
write(paste0(UserName,' Load library: ', 'readr'), infoLog,append = TRUE)
library(readr)
write(paste0(UserName,' Load library: ', 'data.table'), infoLog,append = TRUE)
library(data.table)
write(paste0(UserName,' Load library: ', 'tibble'), infoLog,append = TRUE)
library(tibble)
FileName=paste0(Base,'/01-Vermeulen/00-RawData/IP_DATA_C_VKHCG.csv') write(paste0(UserName,' Retrieve data file: ', FileName),
file=infoLog,append = TRUE)
IP_DATA_C_VKHCG <- read_csv (FileName, col_types = cols(
`IP Address` = col_character(),
`IP Number` = col_double(),
w = col_integer(),
x = col_integer(),
y = col_integer(),
z = col_integer()))
IP_DATA_C_VKHCG_FIX=set_tidy_names(IP_DATA_C_VKHCG, syntactic = TRUE, quiet = TRUE)
IP_DATA_C_VKHCG_with_ID=rowid_to_column(IP_DATA_C_VKHCG_FIX, var = "RowID")
setorderv(IP_DATA_C_VKHCG_with_ID, 'IP Number', order= -1L, na.last=FALSE)
FileNameOut=paste0(FileDir,'/Retrieve_IP_C_VKHCG.csv') fwrite(IP_DATA_C_VKHCG_with_ID, FileNameOut) write(paste0(UserName,' Stores Retrieve data file: ', FileNameOut),
file=infoLog,append = TRUE)
StopTime=Sys.time()
write(paste0('Stop Debug Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=debugLog,append = TRUE)
write(paste0('Stop Information Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=infoLog,append = TRUE)
write(paste0('Stop Error Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=errorLog,append = TRUE)
View(IP_DATA_C_VKHCG_with_ID)

Save the script file Retrieve-IP_DATA_ALL.r. Now source it to execute.

If completed as required, you should have a new file named Retrieve_IP_C_VKHCG.csv.

 

Loading IP_DATA_CORE

The next data source supplies all the core router’s addresses within VKHCG. You will now load IP_DATA_CORE.csv into Retrieve_IP_DATA_CORE.csv.

 

Create a new R script file and save it as Retrieve-IP_DATA_CORE.r in directory C:\VKHCG\01-Vermeulen\01-Retrieve. Now, you can start to copy the following code into the script file:

rm(list=ls()) #will remove ALL objects
setwd("C:/VKHCG")
Base=getwd()
FileDir= paste0(Base,'/01-Vermeulen/01-Retrieve/01-EDS/01-R')
dir.create(FileDir)
FileDirLog=paste0(FileDir,'/log')
dir.create(FileDirLog)
FileDirRun=paste0(FileDirLog,'/Run0003')
dir.create(FileDirRun)
StartTime=Sys.time()
# Set up logging
debugLog=paste0(FileDirRun,'/debug.Log')
infoLog=paste0(FileDirRun,'/info.Log')
errorLog=paste0(FileDirRun,'/error.Log')
write(paste0('Start Debug Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=debugLog,append = FALSE)
write(paste0('Start Information Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=infoLog,append = FALSE)
write(paste0('Start Error Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=errorLog,append = FALSE)
UserName='Practical Data Scientist'
write(paste0(UserName,' Load library: ', 'readr'), infoLog,append = TRUE)
library(readr)
write(paste0(UserName,' Load library: ', 'data.table'), infoLog,append = TRUE)
library(data.table)
write(paste0(UserName,' Load library: ', 'tibble'), infoLog,append = TRUE)
library(tibble)
FileName=paste0(Base,'/01-Vermeulen/00-RawData/IP_DATA_CORE.csv') write(paste0(UserName,' Retrieve data file: ', FileName),
file=infoLog,append = TRUE)
IP_DATA_CORE <- read_csv (FileName,
col_types = cols(
Country = col_character(),
`First IP Number` = col_double(),
ID = col_integer(),
`Last IP Number` = col_double(),
Latitude = col_double(),
Longitude = col_double(),
`Place Name` = col_character(),
`Post Code` = col_character()
),
na = "empty"
)
IP_DATA_CORE_FIX=set_tidy_names(IP_DATA_CORE, syntactic = TRUE, quiet = TRUE)
IP_DATA_CORE_with_ID=rowid_to_column(IP_DATA_CORE_FIX, var = "RowID")
setorderv(IP_DATA_CORE_with_ID, c('Country','http://Place.Name','Post.Code'), order= 1L, na.last=FALSE)
FileNameOut=paste0(FileDir,'/Retrieve_IP_ DATA_CORE.csv') fwrite(IP_DATA_CORE_with_ID, FileNameOut) write(paste0(UserName,' Stores Retrieve data file: ',
FileNameOut),
file=infoLog,append = TRUE)
StopTime=Sys.time()
write(paste0('Stop Debug Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=debugLog,append = TRUE)
write(paste0('Stop Information Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=infoLog,append = TRUE)
write(paste0('Stop Error Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=errorLog,append = TRUE)
View(IP_DATA_CORE_with_ID)

Save the Retrieve-IP_DATA_CORE.r file and source it to execute.


Now that you have the file Retrieve_IP_ DATA_CORE.csv completed, you should be able to follow the basic principles of retrieving files.

 

Loading COUNTRY-CODES

Loading COUNTRY-CODES

Now, we add a reference file to the main dataset, to assist with requirements for the access data processing in the blog. For now, it is only another data file, but I will guide you through the process of how to use this reference data.

 

Create a new R script file and save it as Retrieve-Country_Code.r in directory C:\VKHCG\01-Vermeulen\01-Retrieve. Now, you can start to copy the following code into the script file:

rm(list=ls()) #will remove ALL objects
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG")
setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/01-Vermeulen/01-Retrieve/01-EDS/01-R')
dir.create(FileDir)
FileDirLog=paste0(FileDir,'/log')
dir.create(FileDirLog)
FileDirRun=paste0(FileDirLog,'/Run0004')
dir.create(FileDirRun)
StartTime=Sys.time()
# Set up logging
debugLog=paste0(FileDirRun,'/debug.Log')
infoLog=paste0(FileDirRun,'/info.Log')
errorLog=paste0(FileDirRun,'/error.Log')
write(paste0('Start Debug Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=debugLog,append = FALSE)
write(paste0('Start Information Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=infoLog,append = FALSE)
write(paste0('Start Error Log File ', format(StartTime, "%Y/%d/%m %H:%M:%S")), file=errorLog,append = FALSE)
UserName='Practical Data Scientist'
write(paste0(UserName,' Load library: ', 'readr'), infoLog,append = TRUE)
library(readr)
write(paste0(UserName,' Load library: ', 'data.table'), infoLog,append = TRUE)
library(data.table)
write(paste0(UserName,' Load library: ', 'tibble'), infoLog,append = TRUE)
library(tibble)
FileName=paste0(Base,'/01-Vermeulen/00-RawData/Country_Code.csv') write(paste0(UserName,' Retrieve data file: ', FileName),
file=infoLog,append = TRUE)
Country_Code <- read_csv (FileName,
col_types = cols(Country = col_character(), 'ISO-2-CODE' = col_character(), 'ISO-3-Code' = col_character(), 'ISO-M49' = col_integer() ),
na = "empty"
)
Country_Code_with_ID=rowid_to_column(Country_Code, var = "RowID")
setorderv(Country_Code_with_ID, 'ISO-2-CODE',order= 1L, na.last=FALSE)
FileNameOut=paste0(FileDir,'/Retrieve_Country_Code.csv')
fwrite(Country_Code_with_ID, FileNameOut)
write(paste0(UserName,' Stores Retrieve data file: ', FileNameOut),
file=infoLog,append = TRUE)
StopTime=Sys.time()
write(paste0('Stop Debug Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=debugLog,append = TRUE)
write(paste0('Stop Information Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=infoLog,append = TRUE)
write(paste0('Stop Error Log File ', format(StopTime, "%Y/%d/%m %H:%M:%S")), file=errorLog,append = TRUE)
View(Country_Code_with_ID)

Save Retrieve-Country_Code.r and source it to execute.

Note From this point, I have removed the logging code, as I want you to get to the core data retrieve process. Logging must be implemented when you deploy your new skills against real data lakes.

 

Krennwallner AG

Remember that this company markets our customers’ products on billboards across the country.

Note I am loading from a new location named C:\VKHCG\02-Krennwallner\ 00-RawData.

 

Loading DE_Billboard_Locations

I will next guide you to retrieve the billboard locations for Germany.

First, load DE_Billboard_Locations.csv into Retrieve_DE_Billboard_Locations.csv.


Create a new R script file and save it as Retrieve-DE_Billboard_Locations.r in directory C:\VKHCG\02-Krennwallner\01-Retrieve. Now, you can start copying the following code into the script file:
rm(list=ls()) #will remove ALL objects
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG")
setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/02-Krennwallner/01-Retrieve/01-EDS/01-R')
library(readr)
library(data.table)
FileName=paste0(Base,'/02-Krennwallner/00-RawData/DE_Billboard_Locations.csv')
DE_Billboard_Locations <- read_csv(FileName, col_types = cols(
Country = col_character(),
ID = col_integer(),
Latitude = col_double(),
Longitude = col_double(),
PlaceName = col_character()
), na = "empty")
setnames(DE_Billboard_Locations,'PlaceName','http://Place.Name')
setorder(DE_Billboard_Locations,Latitude,Longitude)
FileNameOut=paste0(FileDir,'/Retrieve_DE_Billboard_Locations.csv')
fwrite(DE_Billboard_Locations, FileNameOut) View(DE_Billboard_Locations)

Save Retrieve-DE_Billboard_Locations.r and source it to execute.

 

Tip Did you spot the data issues in column http://Place.Name?

The data error you spotted is owing to the special characters in German names. The export program that delivered the data to the data lake has caused an issue.

 

This is a common problem that a data scientist must handle to translate the data lake into business knowledge.

The Krennwallner company’s data is loaded, and we can move on to the next company.

 

Hillman Ltd

This is our logistics company. Note I am loading from a new location: ..\VKHCG\03-Hillman\00-RawData. Loading GB_Postcode_Full

This data source contains all the postal code information for the United Kingdom.

 

First, load DE_Billboard_Locations.csv into Retrieve_GB_Postcode_Full.csv.

Create a new RsScript file and save it as Retrieve-GB_Postcode_Full.r in directory C:/VKHCG/ 03-Hillman/01-Retrieve/01-EDS/01-R.
rm(list=ls()) #will remove ALL objects
library(readr)
library(data.table)
library(tibble)
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG") setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/03-Hillman/01-Retrieve/01-EDS/01-R') FileName=paste0(Base,'/03-Hillman/00-RawData/GB_Postcode_Full.csv') GB_Postcode_Full <- read_csv(FileName,
col_types = cols(
AreaName = col_character(),
Country = col_character(),
ID = col_integer(),
PlaceName = col_character(),
PostCode = col_character(),
Region = col_character(),
RegionCode = col_character()
),
na = "empty")
FileNameOut=paste0(FileDir,'/Retrieve_GB_Postcode_Full.csv')
fwrite(GB_Postcode_Full, FileNameOut) View(GB_Postcode_Full)
Save Retrieve-GB_Postcode_Full.r and source it to execute.

 

Loading GB_Postcode_Warehouse

The following data holds all the locations for the warehouses. First, load GB_Postcode_ Warehouse.csv into Retrieve_GB_Postcode_Warehouse.csv. I will then guide you to create and read back the data in JSON format.

 

I will also show you how to use library(jsonlite) to generate two files: Retrieve_GB_Postcode_Warehouse_A.json Retrieve_GB_Postcode_Warehouse_B.json.

 

Warning I am assuming that you know the JSON data format. If you are not familiar with it, please read the next paragraph.

JSON (JavaScript Object Notation) is a lightweight data-interchange format used by source systems to generate data to transport more easily.

 

Create a new R script file and save it as Retrieve-GB_Postcode_Warehouse.r in directory ../VKHCG/03-Hillman/01-Retrieve/01-EDS/01-R.

rm(list=ls()) #will remove ALL objects
library(readr)
library(data.table)
library(jsonlite)
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG") setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/03-Hillman/01-Retrieve/01-EDS/01-R') FileName=paste0(Base,'/03-Hillman/00-RawData/GB_Postcode_Warehouse.csv') GB_Postcode_Warehouse <- read_csv(FileName,
col_types = cols(
id = col_integer(),
latitude = col_double(),
longitude = col_double(),
postcode = col_character()
),
na = "empty")
FileNameOut=paste0(FileDir,'/Retrieve_GB_Postcode_Warehouse.csv')
fwrite(GB_Postcode_Warehouse, FileNameOut)
GB_Postcode_Warehouse_JSON_A=toJSON(GB_Postcode_Warehouse, pretty=TRUE)
FileNameJSONA=paste0(FileDir,'/Retrieve_GB_Postcode_Warehouse_A.json')
write(GB_Postcode_Warehouse_JSON_A, FileNameJSONA)
GB_Postcode_Warehouse_JSON_B=toJSON(GB_Postcode_Warehouse, pretty=FALSE)
FileNameJSONB=paste0(FileDir,'/Retrieve_GB_Postcode_Warehouse_B.json')
write(GB_Postcode_Warehouse_JSON_B, FileNameJSONB)
View(GB_Postcode_Warehouse)
GB_Postcode_Warehouse_A=json_data <- fromJSON(paste(readLines(FileNameJSONA),
collapse=""))
View(GB_Postcode_Warehouse_A)
GB_Postcode_Warehouse_B=json_data <- fromJSON(paste(readLines (FileNameJSONB), collapse=""))
View(GB_Postcode_Warehouse_B)
Save Retrieve- GB_Postcode_Warehouse.r and source it to execute.

You can create and load JSON files. This is another skill you will find useful in your interactions with a data lake.

 

Loading GB_Postcode_Shops

The following data sets the locations of all the shops. First, load GB_Postcodes_Shops. csv into Retrieve_GB_Postcodes_Shops.csv.

I will also guide you to create and read back the data for five shops in XML format.

I will also show you how to use library(XML) to generate file ­Retrieve_GB_Postcodes_ Shops.xml.

 

Warning I am assuming that you know the XML data format. If you are not familiar with it, please read the next paragraph.

 

Note XML stands for EXtensible Markup Language. XML is a markup language similar to HTML, but XML is used to transport data across the Internet. It is a protocol that is used to transport data using standard HTTP-capable technology.

 

Tip Generating XML files are time-consuming and must be used at a minimum inside the data lake ecosystem. Convert these to less complex and less verbose formats as soon as possible.

Create a new R script file and save it as Retrieve-GB_Postcode_Shops.r in directory ../VKHCG/03-Hillman/01-Retrieve/01-EDS/01-R.
rm(list=ls()) #will remove ALL objects
library(readr)
library(data.table)
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG")
setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/03-Hillman/01-Retrieve/01-EDS/01-R') FileName=paste0(Base,'/03-Hillman/00-RawData/GB_Postcodes_Shops.csv') GB_Postcodes_Shops <- read_csv(FileName,
col_types = cols(
id = col_integer(),
latitude = col_double(),
longitude = col_double(),
postcode = col_character()
),
na = "empty")
FileNameOut=paste0(FileDir,'/Retrieve_GB_Postcodes_Shops.csv')
fwrite(GB_Postcodes_Shops, FileNameOut)
View(GB_Postcodes_Shops)
GB_Five_Shops=GB_Postcodes_Shops[1:5,]
library(XML)
xml <- xmlTree()
xml$addTag("document", close=FALSE)
for (i in 1:nrow(GB_Five_Shops)) {
xml$addTag("row", close=FALSE)
for (j in 1:length(names(GB_Five_Shops))) {
xml$addTag(names(GB_Five_Shops)[j], GB_Five_Shops[i, j])
}
xml$closeTag()
}
xml$closeTag()
GB_Postcodes_Shops_XML=saveXML(xml)
FileNameXML=paste0(FileDir,'/Retrieve_GB_Postcodes_Shops.xml')
write(GB_Postcodes_Shops_XML, FileNameXML)
xmlFile<-xmlTreeParse(FileNameXML)
class(xmlFile)
xmlTop = xmlRoot(xmlFile)
xmlText<- xmlSApply(xmlTop, function(x) xmlSApply(x, xmlValue))
GB_Postcodes_Shops_XML_A <- data.frame(t(xmlText),row.names=NULL)
View(GB_Postcodes_Shops_XML_A)

Save Retrieve-GB_Postcode_Shops.r and source it to execute.

You have now generated an XML document and reloaded it into a data frame. This is a skill you will use several times in your work as data scientist.

 

Clark Ltd

Clark is the source of the financial information in the group.

Note I am loading from a new location: ../VKHCG/04-Clark/00-RawData.

 

Loading Euro_ExchangeRates

First, load Euro_ExchangeRates.csv into Retrieve_Euro_ExchangeRates_Pivot.csv.


Create a new R script file and save it as Retrieve-Euro_ExchangeRates_Pivot.r in directory C:/VKHCG/04-Clark/00-RawData/01-Retrieve/01-EDS/01-R.
rm(list=ls()) #will remove ALL objects
library(readr)
library(data.table)
library(stringi)
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG") setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/04-Clark/01-Retrieve/01-EDS/01-R') FileName=paste0(Base,'/04-Clark/00-RawData/Euro_ExchangeRates.csv') Euro_ExchangeRates <- read_csv(FileName,
col_types = cols(
.default = col_double(),
Date = col_date(format = "%d/%m/%Y")
),
locale = locale(asciify = TRUE), na = "empty")
FileNameOut=paste0(FileDir,'/Retrieve_Euro_ExchangeRates_Pivot.csv')
fwrite(Euro_ExchangeRates, FileNameOut) View(Euro_ExchangeRates)

The file you have for the data structure is known as a pivot table. This is a common data output that financial staff generates. This is, however, not the format in which data scientists want the data to be loaded. So, I must guide you through a process that you can use to resolve this pivot structure.

 

Reload Euro_ExchangeRates.csv into Retrieve_Euro_ExchangeRates.csv,

 create a new R script file, and save it as Retrieve-Euro_ExchangeRates.r in directory C:/VKHCG/04-Clark/00-RawData/01-Retrieve/01-EDS/01-R.
rm(list=ls()) #will remove ALL objects
library("data.table")
library("readr")
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG") setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/04-Clark/01-Retrieve/01-EDS/01-R') FileName=paste0(Base,'/04-Clark/00-RawData/Euro_ExchangeRates.csv') Euro_ExchangeRates <- read_csv(FileName, col_types=cols( Date=col_date(format="%d/%m/%Y"), .default=col_character() ), locale=locale(encoding="ASCII", asciify=TRUE) ) ### Get the list of headings CA=as.vector(names(Euro_ExchangeRates)) 196 blog 7 Retrieve Superstep ### Remove Date from the vector to get Exchange Codes C=CA [! CA %in% "Date"] ### Create a default table structure Exchange=data.table(Date="1900-01-01",Code="Code",Rate=0,Base="Base") ### Add the data for Exchange Code Euro pairs for (R in C) { ExchangeRates=data.table(subset(Euro_ExchangeRates[c('Date',R)], http://is.na(Euro_ExchangeRates[R])==FALSE),R,Base="EUR") colnames(ExchangeRates)[colnames(ExchangeRates)==R] <- "Rate" colnames(ExchangeRates)[colnames(ExchangeRates)=="R"] <- "Code" if(nrow(Exchange)==1) Exchange=ExchangeRates if(nrow(Exchange)>1) Exchange=data.table(rbind(Exchange,ExchangeRates))
}
Exchange2=Exchange
colnames(Exchange)[colnames(Exchange)=="Rate"] <- "RateIn" colnames(Exchange)[colnames(Exchange)=="Code"] <- "CodeIn" colnames(Exchange2)[colnames(Exchange2)=="Rate"] <- "RateOut" colnames(Exchange2)[colnames(Exchange2)=="Code"] <- "CodeOut" ExchangeRate=merge(Exchange, Exchange2, by=c("Base","Date"), all=TRUE,
sort=FALSE,allow.cartesian=TRUE)
ExchangeRates <- data.table(ExchangeRate, Rate=with(
ExchangeRate,
round((as.numeric(RateOut) / as.numeric(RateIn)),9)
))
### Remove work columns
ExchangeRates$Base <-NULL
ExchangeRates$RateIn <-NULL
ExchangeRates$RateOut <-NULL
### Make entries unique
ExchangeRate=unique(ExchangeRates)
### Sort the results
setorderv(ExchangeRate, c('Date','CodeIn','CodeOut'), order= c(-1L,1L,1L), na.last=FALSE)
### Write Results
FileNameOut=paste0(FileDir,'/Retrieve_Euro_ExchangeRates.csv') fwrite(ExchangeRate, FileNameOut)
### View Results

 

View(ExchangeRate)

Save the script Retrieve-Euro_ExchangeRates.r and source it to execute. You have resolved the pivot by converting it to a relationship between two exchange rates for a specific date. 

 

Load: Profit_And_Loss

First, load Profit_And_Loss.csv into Retrieve_Profit_And_Loss.csv.

Create a new R script file and save it as Retrieve-Profit_And_Loss.r in directory ../VKHCG/04-­ Clark/00-RawData/01-Retrieve/01-EDS/01-R.
rm(list=ls()) #will remove ALL objects 198
blog 7 Retrieve Superstep
library(readr)
library(data.table)
library(tibble)
if (http://Sys.info()['sysname'] == "Windows")
{
BaseSet = "C:/VKHCG"
setwd(BaseSet)
} else {
BaseSet = paste0("/home/", http://Sys.info()['user'], "/VKHCG")
setwd(BaseSet)
}
Base=getwd()
FileDir= paste0(Base,'/04-Clark/01-Retrieve/01-EDS/01-R')
FileName=paste0(Base,'/04-Clark/00-RawData/Profit_And_Loss.csv')
Profit_And_Loss <- read_csv(FileName,
col_types = cols(
Amount = col_double(),
ProductClass1 = col_character(),
ProductClass2 = col_character(),
ProductClass3 = col_character(),
QTR = col_character(),
QTY = col_double(),
TypeOfEntry = col_character()
), na = "empty")
### Sort the results
keyList=c('QTR','TypeOfEntry','ProductClass1','ProductClass2','ProductClass3') setorderv(Profit_And_Loss, keyList, order= c(-1L,1L,1L,1L,1L), na.last=FALSE)
FileNameOut=paste0(FileDir,'/Retrieve_Profit_And_Loss.csv') fwrite(Profit_And_Loss, FileNameOut)
View(Profit_And_Loss)

Save Retrieve_Profit_And_Loss.csv and source it to execute.

 

Python Retrieve Solution

Python Retrieve Solution

I will now guide you through the Python solution for the Retrieve superstep.

Note I will repeat an already complete R step in Python, to illustrate that you can interchange technology stacks and still accomplish similar good data science principles.

 

Vermeulen PLC

I will guide you through the first company using Python. I just want to show you that with Python, you can achieve the same results as with R.

 

Note I am only showing the first data load, to make my point. Start your Python editor.

Create a text file named Retrieve-IP_DATA_ALL.py in directory ..\VKHCG\01-Vermeulen\01-Retrieve.

Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
sFileName=Base + '/01-Vermeulen/00-RawData/IP_DATA_ALL.csv'
print('Loading :',sFileName)
IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False)
sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
print('Rows:', IP_DATA_ALL.shape[0])
print('Columns:', IP_DATA_ALL.shape[1])
print('### Raw Data Set #####################################')
for i in range(0,len(IP_DATA_ALL.columns)):
print(IP_DATA_ALL.columns[i],type(IP_DATA_ALL.columns[i]))
print('### Fixed Data Set ###################################')
IP_DATA_ALL_FIX=IP_DATA_ALL
for i in range(0,len(IP_DATA_ALL.columns)):
cNameOld=IP_DATA_ALL_FIX.columns[i] + ' '
cNameNew=cNameOld.strip().replace(" ", ".")
IP_DATA_ALL_FIX.columns.values[i] = cNameNew
print(IP_DATA_ALL.columns[i],type(IP_DATA_ALL.columns[i]))
#print(IP_DATA_ALL_FIX.head())
print('Fixed Data Set with ID')
IP_DATA_ALL_with_ID=IP_DATA_ALL_FIX
IP_DATA_ALL_with_ID.index.names = ['RowID'] #print(IP_DATA_ALL_with_ID.head())
sFileName2=sFileDir + '/Retrieve_IP_DATA.csv'
IP_DATA_ALL_with_ID.to_csv(sFileName2, index = True)
print('### Done!! #')

Now, save the text file. Execute the Retrieve-IP_DATA_ALL.py with your preferred Python compiler.

Note I use the Spyder 3 environment from Anaconda and Python 3. On completion of this task, you have proven that with different tools and a good data scientist, you can process most data with ease.

 

Actionable Business Knowledge from Data Lakes

I will guide you through several actionable business processes that you can formulate directly from the data in the sample data set.

The means are as follows:

  • Identify the data sources required.
  • Identify source data format (CSV, XML, JSON, or database).
  • Data profile the data distribution (Skew, Histogram, Min, Max).
  • Identify any loading characteristics (Columns Names, Data Types,
  • Volumes).
  • Determine the delivery format (CSV, XML, JSON, or database).

Warning I will be using Python as the processing engine for the rest of this blog and will add and explain new libraries and functions as we progress. I am using an Anaconda ecosystem, with a Spyder 3 editor and Python 3 compiler.

 

Vermeulen PLC

The company has two main jobs on which to focus your attention:

  • Designing a routing diagram for the company
  • Planning a schedule of jobs to be performed for the router network

To perform these tasks, I will guide you through two separate data retrieve actions.

For this practical data science, you will need to have your Python editor ready.

 

Designing a Routing Diagram for the Company

A network routing diagram is a map of all potential routes through the company’s network, like a roadmap. I will help you to create a set of start and end locations, using longitude and latitude. I will implement feature extraction, by calculating the distance between locations in kilometers and miles.

 

Start your Python editor and create a text file named Retrieve-IP_Routing.py in the directory ..\VKHCG\01-Vermeulen\01-Retrieve. Following is the Python code you must copy into the file.

 

Let’s retrieve the core IP data structure for the company.

# -*- coding: utf-8 -*-
import os
import pandas as pd
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2,stype):
### convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
### haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a))
### Type of Distance (Kilometers/Miles)
if stype == 'km':
r = 6371 # Radius of earth in kilometers else:
r = 3956 # Radius of earth in miles
d=round(c * r,3)
return d
Base='C:/VKHCG'
sFileName=Base + '/01-Vermeulen/00-RawData/IP_DATA_CORE.csv'
print('Loading :',sFileName)
IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
usecols=['Country','Place Name','Latitude','Longitude'])
sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
IP_DATA = IP_DATA_ALL.drop_duplicates(subset=None, keep='first', inplace=False)
IP_DATA.rename(columns={'Place Name': 'Place_Name'}, inplace=True) IP_DATA1 = IP_DATA
IP_DATA1.insert(0, 'K', 1)
IP_DATA2 = IP_DATA1
print(IP_DATA1.shape)
IP_CROSS=pd.merge(right=IP_DATA1,left=IP_DATA2,on='K')
IP_CROSS.drop('K', axis=1, inplace=True)
IP_CROSS.rename(columns={'Longitude_x': 'Longitude_from', 'Longitude_y':
'Longitude_to'}, inplace=True)
IP_CROSS.rename(columns={'Latitude_x': 'Latitude_from', 'Latitude_y':
'Latitude_to'}, inplace=True)
IP_CROSS.rename(columns={'Place_Name_x': 'Place_Name_from', 'Place_Name_y':
'Place_Name_to'}, inplace=True)
IP_CROSS.rename(columns={'Country_x': 'Country_from', 'Country_y':
'Country_to'}, inplace=True)
IP_CROSS['DistanceBetweenKilometers'] = IP_CROSS.apply(lambda row:
haversine(
row['Longitude_from'],
row['Latitude_from'],
row['Longitude_to'],
row['Latitude_to'],
'km')
,axis=1)
IP_CROSS['DistanceBetweenMiles'] = IP_CROSS.apply(lambda row:
haversine(
row['Longitude_from'],
row['Latitude_from'],
row['Longitude_to'],
row['Latitude_to'],
'miles')
,axis=1)
print(IP_CROSS.shape)
sFileName2=sFileDir + '/Retrieve_IP_Routing.csv' IP_CROSS.to_csv(sFileName2, index = False)
print('### Done!! #')

Now, save the text file.

 

Execute the Retrieve-IP_Routing.py with your preferred Python compiler.

You will see a file named Retrieve_IP_Routing.csv in C:\VKHCG\01-Vermeulen\01-­ Retrieve\01-EDS\02-Python. 

 

Open this file, and you will see a new data set similar in layout to the following:

Country_ Place_ Latitude_ Longitude_ Country_ Place_ Latitude_ Longitude_ Distance Distance

from Name_ from from to Name_ to to Between Between from to Kilometers Miles

US New York 40.7528 -73.9725 US New York 40.7528 -73.9725 0 0

US New York 40.7528 -73.9725 US New York 40.7214 -74.0052 4.448 2.762

So, the distance between New York (40.7528, -73.9725) to New York (40.7214, -74.0052) is 4.45 kilometers, or 2.77 miles.

 

Well done! You have taken a data set of IP addresses and extracted a feature and a hidden feature of the distance between the IP addresses.

 

Building a Diagram for the Scheduling of Jobs

You can extract core routers locations to schedule maintenance jobs. Now that we know where the routers are, we must set up a schedule for maintenance jobs that have to be completed every month.

 

To accomplish this, we must retrieve the location of each router, to prepare a schedule for the staff maintaining them.

 

Start your Python editor and create a text file named Retrieve-Router-Location. py in directory ..\VKHCG\01-Vermeulen\01-Retrieve. Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
InputFileName='IP_DATA_CORE.csv' OutputFileName='Retrieve_Router_Location.csv'
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
sFileName=Base + '/01-Vermeulen/00-RawData/' + InputFileName
print('Loading :',sFileName)
IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
usecols=['Country','Place Name','Latitude','Longitude'])
IP_DATA_ALL.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
ROUTERLOC = IP_DATA_ALL.drop_duplicates(subset=None, keep='first', inplace=False)
print('Rows :',ROUTERLOC.shape[0])
print('Columns :',ROUTERLOC.shape[1])
sFileName2=sFileDir + '/' + OutputFileName ROUTERLOC.to_csv(sFileName2, index = False)
print('### Done!! #')
Now save the Python file. Execute the http://Retrieve-Router-Location.py with your preferred Python compiler. You will see a file named Retrieve_Router_Location.csv in C:\VKHCG\01-Vermeulen\01-Retrieve\01-EDS\02-Python.
Open this file, and you should see a data set like the following:
Country Place_Name Latitude Longitude
US New York 40.7528 -73.9725
US New York 40.7214 -74.0052
US New York 40.7662 -73.9862

 

You have now successfully retrieved the location of 150 routers.

There is another set of data you can try your newfound skills against. First, modify the code you just created, by changing the following code values:

InputFileName='IP_DATA_ALL.csv' OutputFileName='Retrieve_All_Router_Location.csv'


Now, save the Python file as http://Retrieve-Router-All-Location.py and execute the new code with your preferred Python compiler.


You will see a file named Retrieve_All_ Router_Location.csv in C:\VKHCG\01-Vermeulen\01-Retrieve\01-EDS\02-Python.

 

Let’s review what you have achieved.

  • You can now use Python to extract data from a CSV file.
  • You know how to extract hidden features from a data set, i.e., the distance between two locations.
  • You can save programming time with minor code changes, using common variables. You have now succeeded in generating super data science code.

 

Krennwallner AG

The company has two main jobs in need of your attention:

  • Picking content for billboards: I will guide you through the data science required to pick advertisements for each billboard in the company.
  • Understanding your online visitor data: I will guide you through the evaluation of the web traffic to the billboard’s online web servers.

 

Picking Content for Billboards

Picking Content for Billboards

Let’s retrieve the billboard data we have for Germany.

 

Start your Python editor and create a text file named http://Retrieve-DE-Billboard-Locations.py in directory ..\ VKHCG\02-Krennwallner\01-Retrieve.

Following is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
InputFileName='DE_Billboard_Locations.csv' OutputFileName='Retrieve_DE_Billboard_Locations.csv' Company='02-Krennwallner'
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('Loading :',sFileName) IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
usecols=['Country','PlaceName','Latitude','Longitude'])
IP_DATA_ALL.rename(columns={'PlaceName': 'Place_Name'}, inplace=True)
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
ROUTERLOC = IP_DATA_ALL.drop_duplicates(subset=None, keep='first', inplace=False)
print('Rows :',ROUTERLOC.shape[0])
print('Columns :',ROUTERLOC.shape[1])
sFileName2=sFileDir + '/' + OutputFileName ROUTERLOC.to_csv(sFileName2, index = False)
print('### Done!! #')

 

Now save the Python file and execute the Retrieve-DE-Billboard-Locations. py with your preferred Python compiler.

 

You will see a file named Retrieve_Router_ Location.csv in C:\VKHCG\02-Krennwallner\01-Retrieve\01-EDS\02-Python.

Open this file, and you should see a data set like this:

Country Place_Name Latitude Longitude

  • DE Lake 51.7833 8.5667
  • DE Horb 48.4333 8.6833
  • DE Hardenberg 51.1 7.7333
  • DE Horn-bad Meinberg 51.9833 8.9667

 

Understanding Your Online Visitor Data

Online Visitor Data

Let’s retrieve the visitor data for the billboard we have in Germany. Issue We have no visitor data in Krennwallner. I suggest that you look at all the available data sources. Resolved The data is embedded in Vermeulen’s data sources. See IP_DATA_ ALL.csv.

 

I have found that if you are careful about conducting your business information gathering, you will have an entry on the data mapping matrix for each data source. This cross use of data sources can be a reliable solution in your data science.

 

Several times in my engagements with customers, I find that common and important information is buried somewhere in the company’s various data sources. I recommend investigating any direct suppliers or consumers’ upstream or downstream data sources attached to the specific business process.

 

That is part of your skills that you are applying to data science. I have personally found numerous insightful fragments of information in the data sources surrounding a customer’s business processes.

 

Now that we know where the visitor data is, I will guide you in loading it. Start your Python editor and create a text file named http://Retrieve-Online-Visitor.py in the directory .\ VKHCG\02-Krennwallner\01-Retrieve.

 

Following is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import sys import os
import pandas as pd import gzip as gz
InputFileName='IP_DATA_ALL.csv' OutputFileName='Retrieve_Online_Visitor' CompanyIn= '01-Vermeulen' CompanyOut= '02-Krennwallner'
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG' else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################') Base='C:/VKHCG'
sFileName=Base + '/' + CompanyIn + '/00-RawData/' + InputFileName
print('Loading :',sFileName)
IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
usecols=['Country','Place Name','Latitude','Longitude','First IP Number','Last IP Number'])
IP_DATA_ALL.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
IP_DATA_ALL.rename(columns={'First IP Number': 'First_IP_Number'}, inplace=True)
IP_DATA_ALL.rename(columns={'Last IP Number': 'Last_IP_Number'},
inplace=True)
sFileDir=Base + '/' + CompanyOut + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
visitordata = IP_DATA_ALL.drop_duplicates(subset=None, keep='first',
inplace=False)
visitordata10=visitordata.head(10)
print('Rows :',visitordata.shape[0])
print('Columns :',visitordata.shape[1])
print('Export CSV')
sFileName2=sFileDir + '/' + OutputFileName + '.csv' visitordata.to_csv(sFileName2, index = False) print('Store All:',sFileName2)
sFileName3=sFileDir + '/' + OutputFileName + '_10.csv' visitordata10.to_csv(sFileName3, index = False) print('Store 10:',sFileName3)
212
blog 7 Retrieve Superstep
for z in ['gzip', 'bz2', 'xz']:
if z == 'gzip':
sFileName4=sFileName2 + '.gz'
else:
sFileName4=sFileName2 + '.' + z
visitordata.to_csv(sFileName4, index = False, compression=z)
print('Store :',sFileName4)
print('Export JSON')
for sOrient in ['split','records','index', 'columns','values','table']: sFileName2=sFileDir + '/' + OutputFileName + '_' + sOrient + '.json' visitordata.to_json(sFileName2,orient=sOrient,force_ascii=True) print('Store All:',sFileName2)
sFileName3=sFileDir + '/' + OutputFileName + '_10_' + sOrient + '.json' visitordata10.to_json(sFileName3,orient=sOrient,force_ascii=True) print('Store 10:',sFileName3)
sFileName4=sFileName2 + '.gz'
file_in = open(sFileName2, 'rb')
file_out = gz.open(sFileName4, 'wb')
file_out.writelines(file_in)
file_in.close()
file_out.close()
print('Store GZIP All:',sFileName4)
sFileName5=sFileDir + '/' + OutputFileName + '_' + sOrient + '_UnGZip. json'
file_in = gz.open(sFileName4, 'rb')
file_out = open(sFileName5, 'wb')
file_out.writelines(file_in)
file_in.close()
file_out.close()
print('Store UnGZIP All:',sFileName5)
print('### Done!! #')
Now save the Python file. Execute the http://Retrieve-Online-Visitor.py with your preferred Python compiler. You will see a file named Retrieve_Online_Visitor.csv in C:\VKHCG\02-Krennwallner\01-Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:
Country Place_Name Latitude Longitude First_IP_Number Last_IP_Number
BW Gaborone -24.6464 25.9119 692781056 692781567
BW Gaborone -24.6464 25.9119 692781824 692783103
BW Gaborone -24.6464 25.9119 692909056 692909311
BW Gaborone -24.6464 25.9119 692909568 692910079

I have demonstrated that you can export the same data in various formats of JSON.

Warning Remember: Being JSON-compliant is not sufficient to read the files. The orientation formatting is vital to processing the files. If you use a different layout orientation than the source system’s, the data file will not load.

 

You can also see the following JSON files of only ten records. Open them in a text editor, as follows:

split : uses the format {index -> [index], columns -> [columns], data -> [values]}.
See Retrieve_Online_Visitor_10_split.json.
{
"columns":
"Country","Place_Name","Latitude","Longitude","First_IP_ Number","Last_IP_Number"
],
"index":
[
0,1
],
"data":[
["BW","Gaborone",-24.6464,25.9119,692781056,692781567],
["BW","Gaborone",-24.6464,25.9119,692781824,692783103]
]
}
records : uses the format [{column -> value}, . . ., {column -> value}]. See Retrieve_ Online_Visitor_10_records.json.
[
{
"Country":"BW",
"Place_Name":"Gaborone",
"Latitude":-24.6464,
"Longitude":25.9119,
"First_IP_Number":692781056,
"Last_IP_Number":692781567
}
,
{
"Country":"BW",
"Place_Name":"Gaborone",
"Latitude":-24.6464,
"Longitude":25.9119,
"First_IP_Number":692781824,
"Last_IP_Number":692783103
}
]
index : uses the format {index -> {column -> value}}. See Retrieve_Online_ Visitor_10_index.json.
{
"0":
{
"Country":"BW",
"Place_Name":"Gaborone",
"Latitude":-24.6464,
"Longitude":25.9119,
"First_IP_Number":692781056,
215
blog 7 Retrieve Superstep
"Last_IP_Number":692781567
}
,
"1":
{
"Country":"BW",
"Place_Name":"Gaborone",
"Latitude":-24.6464,
"Longitude":25.9119,
"First_IP_Number":692781824,
"Last_IP_Number":692783103
}
}
columns : uses the format {column -> {index -> value}}. See Retrieve_Online_ Visitor_10_columns.json.
{
"Country":
{"0":"BW","1":"BW"},
"Place_Name":
{"0":"Gaborone","1":"Gaborone"},
"Latitude":
{"0":-24.6464,"1":-24.6464},
"Longitude":
{"0":25.9119,"1":25.9119},
"First_IP_Number":
{"0":692781056,"1":692781824},
"Last_IP_Number":
{"0":692781567,"1":692783103}
}
values : uses the format of a simple values array. See Retrieve_Online_Visitor_10_ values.json.
[
["BW","Gaborone",-24.6464,25.9119,692781056,692781567]
,
["BW","Gaborone",-24.6464,25.9119,692781824,692783103]
]
Warning This JSON format causes you to lose all the metadata, i.e., no column name and data types are passed to the next part of the processing. You will have to remap and rename all the data items at each step in the process.
table : uses the format {‘schema’: {schema}, ‘data’: {data}}. See Retrieve_Online_ Visitor_10_table.json.
{"schema":
{
"fields":
[
{"name":"index","type":"integer"}, {"name":"Country","type":"string"}, {"name":"Place_Name","type":"string"}, {"name":"Latitude","type":"number"}, {"name":"Longitude","type":"number"}, {"name":"First_IP_Number","type":"integer"}, {"name":"Last_IP_Number","type":"integer"}
]
,
"primaryKey":["index"],
"pandas_version":"0.20.0"},
"data":
[
{
"index":0,
"Country":"BW",
"Place_Name":"Gaborone",
"Latitude":-24.6464,
217
blog 7 Retrieve Superstep
"Longitude":25.9119,
"First_IP_Number":692781056,
"Last_IP_Number":692781567
}
,
{
"index":1,
"Country":"BW",
"Place_Name":"Gaborone",
"Latitude":-24.6464,
"Longitude":25.9119,
"First_IP_Number":692781824,
"Last_IP_Number":692783103
}

Note Table JSON formatting passes full metadata to the next data process. If you have a choice, I suggest that you always support this JSON format. For visitor data, I suggest we also look at using XML formats for the data retrieving step.

 

Warning XML can have complex structures in real-world applications. The following code can be modified to work on more complex structures. This is normally a time-consuming process of discovery. I suggest you experiment with the XML package, which I suggest we use for this task.

 

Following is a simple but effective introduction to XML processing. Start your Python editor and create a text file named http://Retrieve-Online-Visitor-XML.py in directory .\VKHCG\02-Krennwallner\01-Retrieve.

 

Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
import xml.etree.ElementTree as ET
def df2xml(data): header = data.columns
root = ET.Element('root')
for row in range(data.shape[0]): entry = ET.SubElement(root,'entry') for index in range(data.shape[1]):
schild=str(header[index])
child = ET.SubElement(entry, schild) if str(data[schild][row]) != 'nan':
child.text = str(data[schild][row]) else:
child.text = 'n/a' entry.append(child)
result = ET.tostring(root) return result
def xml2df(xml:data): root = ET.XML(xml:data) all_records = []
for i, child in enumerate(root): record = {}
for subchild in child: record[subchild.tag] = subchild.text
all_records.append(record) return pd.DataFrame(all_records)
InputFileName='IP_DATA_ALL.csv' OutputFileName='Retrieve_Online_Visitor.xml' CompanyIn= '01-Vermeulen' CompanyOut= '02-Krennwallner'
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
sFileName=Base + '/' + CompanyIn + '/00-RawData/' + InputFileName
print('Loading :',sFileName)
IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False)
IP_DATA_ALL.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
IP_DATA_ALL.rename(columns={'First IP Number': 'First_IP_Number'}, inplace=True)
IP_DATA_ALL.rename(columns={'Last IP Number': 'Last_IP_Number'}, inplace=True)
IP_DATA_ALL.rename(columns={'Post Code': 'Post_Code'}, inplace=True)
sFileDir=Base + '/' + CompanyOut + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
visitordata = IP_DATA_ALL.head(10000)
print('Original Subset Data Frame')
print('Rows :',visitordata.shape[0])
print('Columns :',visitordata.shape[1])
print(visitordata)
print('Export XML')
sXML=df2xml(visitordata)
sFileName=sFileDir + '/' + OutputFileName
file_out = open(sFileName, 'wb')
file_out.write(sXML)
file_out.close()
print('Store XML:',sFileName)
xml:data = open(sFileName).read()
unxmlrawdata=xml2df(xml:data)
print('Raw XML Data Frame')
print('Rows :',unxmlrawdata.shape[0])
print('Columns :',unxmlrawdata.shape[1])
print(unxmlrawdata)
unxmldata = unxmlrawdata.drop_duplicates(subset=None, keep='first', inplace=False)
print('Deduplicated XML Data Frame')
print('Rows :',unxmldata.shape[0])
print('Columns :',unxmldata.shape[1])
print(unxmldata)
#
#print('### Done!! #')
#

Now save the Python file and execute the http://Retrieve-Online-Visitor-XML.py with your preferred Python compiler.

 

You will see a file named Retrieve_Online_Visitor. xml in C:\VKHCG\02-Krennwallner\01-Retrieve\01-EDS\02-Python.

This enables you to deliver XML format data as part of the retrieving step.

 

I have assisted you with many formats to process and pass data between steps in your processing pipeline. For the rest of this blog, I am only demonstrating a smaller subset. Remember:

 

The processing methods and techniques can be used on all the data formats. So, let’s see what our Hillman data brings to our knowledge base.

 

Hillman Ltd

The company has four main jobs requiring your attention:

  • Planning the locations of the warehouses: Hillman has countless UK warehouses, but owing to financial hardships, the business wants to shrink the number of warehouses by 20%.

 

  • Planning the shipping rules for best-fit international logistics: At Hillman Global Logistics’ expense, the company has shipped goods from its international warehouses to its UK shops. This model is no longer sustainable. The co-owned shops now want more feasibility regarding shipping options.

 

  • Adopting the best packing option for shipping in containers: Hillman has introduced a new three-size-shipping-container solution. It needs a packing solution encompassing the warehouses, shops, and customers.

 

  • Creating a delivery route: Hillman needs to preplan a delivery route for each of its warehouses to shops, to realize a 30% savings in shipping costs.

Plan the locations of the warehouses. I will assist you in retrieving the warehouse locations.

 

Warning Stop! This data is already loaded.

Earlier in the blog, you have created the load, as part of your R processing.

 

The file is named Retrieve_GB_Postcode_Warehouse.csv in directory C:\VKHCG\03-­ Hillman\01-Retrieve\01-EDS\01-R.

 

I suggest that you update the data mapping matrix on a regular basis. An up-to-date data mapping matrix will save you from having to code a new solution for the data retrieves data sources.

 

Planning Shipping Rules for Best-Fit International Logistics

I will now guide you through the retrieve process for shipping rules. Yes, you must understand the business to map its business processes into your data discovery.

 

Note I do not expect you to grasp all the rules on the first read. However, as a data scientist, you will have to understand the business fully, to create the data sets that are required.

 

Important Shipping Information

Shipping Information

In the world of shipping, there is a short list of important terms you must understand first.

 

Shipping Terms

These determine the rules of the shipment, the conditions under which it is made. Normally, these are stated on the shipping manifest. Currently, Hillman is shipping everything as DDP. Not what we want! I will discuss this shipping term in detail within the next pages.

 

Seller

The person/company sending the products on the shipping manifest is the seller. In our case, there will be warehouses, shops, and customers. Note that this is not a location but a legal entity sending the products.

 

Carrier

The person/company that physically carries the products on the shipping manifest is the carrier. Note that this is not a location but a legal entity transporting the products.

 

Port

A port is any point from which you have to exit or enter a country. Normally, these are shipping ports or airports but can also include border crossings via road. Note that there are two ports in the complete process. This is important. There is a port of exit and a port of entry.

 

Ship

Ship is the general term for the physical transport method used for the goods. This can refer to a cargo ship, airplane, truck, or even person, but it must be identified by a unique allocation number.

 

Terminal

A terminal is a physical point at which the goods are handed off for the next phase of the physical shipping.

 

Named Place

This is the location where the owner is legally changed from seller to buyer. This is a specific location in the overall process. Remember this point, as it causes many legal disputes in the logistics industry.

 

Buyer

The person/company receiving the products on the shipping manifest is the buyer. In our case, there will be warehouses, shops, and customers. Note that this is not a location but a legal entity receiving the products.

 

In general, the complete end-to-end process in a real shipping route consists of several chains of shipping manifests. Yes, you read correctly, there is more than one shipping manifest per shipping route!

 

Incoterm 2010

Let’s first tackle the standard shipping terms. Following is a summary of the basic options, as determined by a standard board and published as Incoterm 2010.

 

Here they are in a simple grid:

Shipping Seller Carrier Port Ship Port Terminal Named Buyer
Term Place
EXW Seller Buyer Buyer Buyer Buyer Buyer Buyer Buyer
FCA Seller Seller Buyer Buyer Buyer Buyer Buyer Buyer
CPT Seller Seller Buyer Buyer Buyer Buyer Buyer Buyer
CIP Seller Seller Insurance Insurance Insurance Insurance Insurance Buyer
DAT Seller Seller Seller Seller Seller Seller Buyer Buyer
DAP Seller Seller Seller Seller Seller Seller Seller Buyer
DDP Seller Seller Seller Seller Seller Seller Seller Seller

If right now you feel the same as I did the first time I saw this grid—lost, lonely, in need of coffee—don’t close this blog. I will quickly guide you from beginning to end. The first item of discussion is the shipping terms.

 

EXW—Ex Works (Named Place of Delivery)

Ex Works

By this term, the seller makes the goods available at its premises or at another named place. This term places the maximum obligation on the buyer and minimum obligations on the seller.

 

Here is the data science version: If I were to buy Practical Data Science at a local bookshop and take it home, and the shop has shipped it EXW—Ex Works, the moment I pay at the register, the ownership is transferred to me. If anything happens to the blog, I would have to pay to replace it.

 

So, let’s see what the data science reveals, using our sample company. Start your Python editor and create a text file named http://Retrieve-Incoterm-EXW.py in the directory

.\VKHCG\03-Hillman\01-Retrieve. Following is the Python code that you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
IncoTerm='EXW' InputFileName='Incoterm_2010.csv' OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv' Company='03-Hillman'
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
### Import Incoterms
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('###########')
print('Loading :',sFileName)
IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False) IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm] print('Rows :',IncotermRule.shape[0])
print('Columns :',IncotermRule.shape[1]) print('###########')
print(IncotermRule)
sFileName=sFileDir + '/' + OutputFileName IncotermRule.to_csv(sFileName, index = False)
print('### Done!! #')

Now save the Python file and execute the http://Retrieve-Incoterm-EXW.py file with your preferred Python compiler.

 

You will see a file named Retrieve_Incoterm_EXW.csv in C:\VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:

Shipping Term Seller Carrier Port Ship Port Terminal Named Place Buyer

EXW Seller Buyer Buyer Buyer Buyer Buyer Buyer Buyer

 

Note The ownership changes prematurely in the supply chain. All the costs and risks are with the buyer.

FCA—Free Carrier (Named Place of Delivery)

Under this condition, the seller delivers the goods, cleared for export, at a named place.

 

Following is the data science version.

data science version.

If I were to buy Practical Data Science at an overseas duty-free shop and then pick it up at the duty-free desk before taking it home.

 

And the shop has shipped it FCA— Free Carrier—to the duty-free desk, the moment I pay at the register, the ownership is transferred to me, but if anything happens to the book between the shop and the duty-­ free desk, the shop will have to pay.

 

It is only once I pick it up at the desk that I will have to pay if anything happens. So, the moment I take the blog, the transaction becomes EXW, so I have to pay any necessary import duties on arrival in my home country.

 

Let’s see what the data science finds. Start your Python editor and create a text file named http://Retrieve-Incoterm-FCA.py in directory .\VKHCG\03-Hillman\01-Retrieve. Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
IncoTerm='FCA' InputFileName='Incoterm_2010.csv' OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv' Company='03-Hillman'
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
### Import Incoterms
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('###########')
print('Loading :',sFileName)
IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False) IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm] print('Rows :',IncotermRule.shape[0])
print('Columns :',IncotermRule.shape[1]) print('###########')
print(IncotermRule)
sFileName=sFileDir + '/' + OutputFileName IncotermRule.to_csv(sFileName, index = False)
print('### Done!! #')

Now save the Python file and execute the http://Retrieve-Incoterm-FCA.py file with your preferred Python compiler.

 

You will see a file named Retrieve_Incoterm_FCA.csv in

C:\ VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:

Shipping Term Seller Carrier Port Ship Port Terminal Named Place Buyer

FCA Seller Seller Buyer Buyer Buyer Buyer Buyer Buyer

 

Note The ownership changes prematurely, at the port in the supply chain. Most of the risks and costs are again with the buyer.

CPT—Carriage Paid To (Named Place of Destination)

 

The seller, under this term, pays for the carriage of the goods up to the named place of destination. However, the goods are considered to be delivered when they have been handed over to the first carrier so that the risk transfers to the buyer upon handing the goods over to the carrier at the place of shipment in the country of export.

 

The seller is responsible for origin costs, including export clearance and freight costs for carriage to the named place of destination. (This is either the final destination, such as the buyer’s facilities, or a port of destination. This must be agreed upon by both seller and buyer, however.)

 

Now, here is the data science version:

If I were to buy Practical Data Science at an overseas bookshop and then pick it up at the export desk before taking it home and the shop shipped it CPT—Carriage Paid To—the duty desk for free, the moment I pay at the register, the ownership is transferred to me.

 

But if anything happens to the book between the shop and the duty desk of the shop, I will have to pay.

 

It is only once I have picked up the book at the desk that I have to pay if anything happens. So, the moment I take the blog, the transaction becomes EXW, so I must pay any required export and import duties on arrival in my home country.

 

Let’s see what the data science finds.

Start your Python editor and create a text file named http://Retrieve-Incoterm-CPT.py in directory .\VKHCG\03-Hillman\01-Retrieve.

Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
IncoTerm='CPT' InputFileName='Incoterm_2010.csv' OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv' Company='03-Hillman'
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
### Import Incoterms
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('###########')
print('Loading :',sFileName)
IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False) IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm] print('Rows :',IncotermRule.shape[0])
print('Columns :',IncotermRule.shape[1]) print('###########')
print(IncotermRule)
sFileName=sFileDir + '/' + OutputFileName IncotermRule.to_csv(sFileName, index = False)
print('### Done!! #')

 

Now save the Python file and execute the http://Retrieve-Incoterm-CPT.py file with your preferred Python compiler.



 You will see a file named: Retrieve_Incoterm_CPT.csv in C:\VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python. 

Shipping Term Seller Carrier Port Ship Port Terminal Named Place Buyer

CPT Seller Seller Buyer Buyer Buyer Buyer Buyer Buyer

 

Note The ownership changes prematurely, at the port in the supply chain. Most of the risks and cost are with the buyer. This is bad if we are buying but good if we are selling.

 

CIP—Carriage and Insurance Paid To (Named Place of Destination)

Practical Data Science

This term is generally similar to the preceding CPT, with the exception that the seller is required to obtain insurance for the goods while in transit. Following is the data science version.

 

If I buy Practical Data Science in an overseas bookshop and then pick it up at the export desk before taking it home, and the shop has shipped it CPT—Carriage Paid To— to the duty desk for free, the moment I pay at the register, the ownership is transferred to me.

 

However, if anything happens to the book between the shop and the duty desk at the shop, I have to take out insurance to pay for the damage.

 

It is only once I have picked it up at the desk that I have to pay if anything happens. So, the moment I take the blog, it becomes EXW, so I have to pay any export and import duties on arrival in my home country. Note that insurance only covers that portion of the transaction between the shop and duty desk.

 

Let’s see what the data science finds. Start your Python editor and create a text file named

 

http://Retrieve-Incoterm-CIP.py in directory .\VKHCG\03-Hillman\01-Retrieve. Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
IncoTerm='CIP'
InputFileName='Incoterm_2010.csv'
OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv'
Company='03-Hillman'
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
### Import Incoterms
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('###########')
print('Loading :',sFileName)
IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False) IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm] print('Rows :',IncotermRule.shape[0])
print('Columns :',IncotermRule.shape[1]) print('###########')
print(IncotermRule)
sFileName=sFileDir + '/' + OutputFileName IncotermRule.to_csv(sFileName, index = False)
print('### Done!! #')

 

Now save the Python file and execute the http://Retrieve-Incoterm-CIP.py file with your preferred Python compiler.

You will see a file named Retrieve_Incoterm_CIP.csv in C:\VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:

Shipping Term Seller Carrier Port Ship Port Terminal Named Place Buyer

CIP Seller Seller Insurance Insurance Insurance Insurance Insurance Buyer

 

Note The ownership changes prematurely, at the port in the supply chain. Most of the risks and cost are with the buyer, with some of it covered by insurance. The risk is lower, compared to that calculated for a house or car insurance. This is not so bad if you are buying but not so good if you are selling!

 

DAT—Delivered at Terminal (Named Terminal at Port or Place of Destination)

Delivered at Terminal

This Incoterm requires that the seller deliver the goods, unloaded, at the named terminal. The seller covers all the costs of transport (export fees, carriage, unloading from the main carrier at the destination port, and destination port charges) and assumes all risks until arrival at the destination port or terminal.

 

The terminal can be a port, airport, or inland freight interchange, but it must be a facility with the capability to receive the shipment.

If the seller is not able to organize unloading, it should consider shipping under DAP terms instead. All charges after unloading (for example, import duty, taxes, customs, and on-carriage costs) are to be borne by the buyer.

 

Following is the data science version. If I were to buy Practical Data Science overseas and then pick it up at a local bookshop before taking it home, and the overseas shop shipped it—Delivered At Terminal (Local Shop)—the moment I pay at the register, the ownership is transferred to me.

 

However, if anything happens to the book between the payment and the pickup, the local shop pays. It is picked up only once at the local shop. I have to pay if anything happens.

 

So, the moment I take it, the transaction becomes EXW, so I have to pay any import duties on arrival at my home.

 

Let’s see what the data science finds. Start your Python editor and create a text file named http://Retrieve-Incoterm-DAT.py in directory ".\VKHCG\03-Hillman\01-Retrieve. Following is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
IncoTerm='DAT' InputFileName='Incoterm_2010.csv' OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv' Company='03-Hillman'
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
### Import Incoterms
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('###########')
print('Loading :',sFileName)
IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False) IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm] print('Rows :',IncotermRule.shape[0])
print('Columns :',IncotermRule.shape[1])
print('###########')
print(IncotermRule)
sFileName=sFileDir + '/' + OutputFileName IncotermRule.to_csv(sFileName, index = False)
print('### Done!! #')

Now save the Python file and execute the http://Retrieve-Incoterm-DAT.py file with your preferred Python compiler.

 

You will see a file named Retrieve_Incoterm_DAT.csv in C:\ VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:

Shipping Term Seller Carrier Port Ship Port Terminal Named Place Buyer

DAT Seller Seller Seller Seller Seller Seller Buyer Buyer

 

Note Now the ownership changes later at the named place in the supply chain. Most of the risk and cost are borne by the seller. This is positive, if you are buying, less so, if you are selling.

 

DAP—Delivered at Place (Named Place of Destination)

Delivered at Place

According to Incoterm 2010’s definition, DAP—Delivered at Place—means that, at the disposal of the buyer, the seller delivers when the goods are placed on the arriving means of transport, ready for unloading at the named place of destination. Under DAP terms, the risk passes from seller to buyer from the point of destination mentioned in the contract of delivery.

 

Once goods are ready for shipment, the necessary packing is carried out by the seller at his own cost, so that the goods reach their final destination safely. All necessary legal formalities in the exporting country are completed by the seller at his own cost and risk to clear the goods for export.

 

After the arrival of the goods in the country of destination, the customs clearance in the importing country must be completed by the buyer at his own cost and risk, including all customs duties and taxes. However, as with DAT terms, any delay or demurrage charges are to be borne by the seller.

 

Under DAP terms, all carriage expenses with any terminal expenses are paid by the seller, up to the agreed destination point. The required unloading cost at the final destination has to be accepted by the buyer under DAP terms.

 

Here is the data science version. If I were to buy 100 copies of Practical Data Science from an overseas website and then pick up the copies at a local bookshop before taking them home, and the shop shipped the copies DAP—Delivered At Place (Local Shop)— the moment I paid at the register, the ownership would be transferred to me.

 

However, if anything happened to the blogs between the payment and the pickup, the website owner pays. Once the copies are picked up at the local shop, I have to pay to unpack them at the bookshop. So, the moment I take the copies, the transaction becomes EXW, so I will have to pay costs after I take the copies.

 

Let’s see what the data science finds. Start your Python editor and create a text file named http://Retrieve-Incoterm-DAP.py in directory .\VKHCG\03-Hillman\01-Retrieve. Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
IncoTerm='DAP' InputFileName='Incoterm_2010.csv' OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv' Company='03-Hillman'
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform)
print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
### Import Incoterms
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('###########')
print('Loading :',sFileName)
IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False) IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm] print('Rows :',IncotermRule.shape[0])
print('Columns :',IncotermRule.shape[1]) print('###########')
print(IncotermRule)
sFileName=sFileDir + '/' + OutputFileName IncotermRule.to_csv(sFileName, index = False)
print('### Done!! #')

Now save the Python file.

Execute the http://Retrieve-Incoterm-DAP.py file with your preferred Python compiler.

You will see a file named Retrieve_Incoterm_DAP.csv in C:\VKHCG\03-Hillman\01-­ Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:

Shipping Term Seller Carrier Port Ship Port Terminal Named Place Buyer

DAP Seller Seller Seller Seller Seller Seller Seller Buyer

 

Note Now, the ownership changes later, at the buyer’s place in the supply chain. All of the risks and cost are with the seller. This is very good if you are buying but really bad if you are selling. You would be responsible for any mishaps with the shipping. The only portion you would not need to cover is offloading.

 

DDP—Delivered Duty Paid (Named Place of Destination)

Delivered Duty Paid

By this term, the seller is responsible for delivering the goods to the named place in the country of the buyer and pays all costs in bringing the goods to the destination, including import duties and taxes. The seller is not responsible for unloading.

 

This term places the maximum obligations on the seller and minimum obligations on the buyer. No risk or responsibility is transferred to the buyer until delivery of the goods at the named place of destination.

 

The most important consideration regarding DDP terms is that the seller is responsible for clearing the goods through customs in the buyer’s country, including both paying the duties and taxes and obtaining the necessary authorizations and registrations from the authorities in that country.

 

Here is the data science version. If I were to buy 100 copies of Practical Data Science on an overseas website and then pick them up at a local bookshop before taking them home, and the shop shipped DDP—Delivered Duty Paid (my home)—the moment I pay at the till the ownership is transferred to me.

 

However, if anything were to happen to the books between the payment and the delivery at my house, the bookshop must replace the books as the term covers the delivery to my house.

 

Let’s see what the data science finds. Start your Python editor and create a text file named http://Retrieve-Incoterm-DDP.py in directory .\VKHCG\03-Hillman\01-Retrieve. Following is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
IncoTerm='DDP' InputFileName='Incoterm_2010.csv'
OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv'
Company='03-Hillman'
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
### Import Incoterms
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('###########')
print('Loading :',sFileName)
IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False) IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm] print('Rows :',IncotermRule.shape[0])
print('Columns :',IncotermRule.shape[1]) print('###########')
print(IncotermRule)
sFileName=sFileDir + '/' + OutputFileName IncotermRule.to_csv(sFileName, index = False)
print('### Done!! #')

Now save the Python file and execute the http://Retrieve-Incoterm-DDP.py file with your preferred Python compiler.

You will see a file named Retrieve_Incoterm_DDP.csv in C:\ VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:

 

Shipping Term Seller Carrier Port Ship Port Terminal Named Place Buyer

DDP Seller Seller Seller Seller Seller Seller Seller Seller

Note Now, the ownership changes later, at the buyer’s place in the supply chain. All risks and cost are with the seller. This is very good if you are buying but really bad if you are selling. You would be responsible for any mishaps with the shipping.

 

Shipping Chains

The code works between any two points on a shipping route. These chains can contain numerous different Incoterm options, depending on the agreements between buyer and seller.

Warning There can be several shipping chains. Numerous combinations and permutations are legally conceivable.

 

Call an Expert

Call an Expert

So, here is my important proven advice when dealing with a complex business process: find an expert in the field. It is essential to get experts with domain knowledge as involved in the process as you can.

 

In the following scenario, I will assume the role of the expert Mr. Maximillian Logistics. Max suggests that you

  • Stop shipping DDP everywhere.
  • Ship CIP between warehouses only.
  • Ship CIP or DDP between warehouses and shops.
  • Ship CIP or EXW between shops.
  • Ship EXW to customers.

 

Max confirms Hillman’s policy and states currently that

  • Shipping costs are £10 per kilometer.
  •  Insurance costs are £5 per kilometer.
  • Import and export duties are a flat rate of £2 per item.
  • You can only ship to a shop from the closest three warehouses.
  • You can only ship between in-country shops, i.e., shops only in the same country.
  • You ship to a customer only from the closest in-county shop.
  • You cannot ship to a customer farther than 20 kilometers away.

 

What does this mean to you as a data scientist at the retrieve superstep level?

•\ You must know the locations of the warehouses. Earlier in the blog, you created the load as part of your R processing.

This is the file named Retrieve_GB_Postcode_Warehouse.csv in directory ..\VKHCG\ 03-Hillman\01-Retrieve\01-EDS\01-R. So, that data is ready.

 

•\ You must know the locations of the UK shops. Earlier in the blog, you created the load as part of your R processing.

 

The file is named Retrieve_GB_Postcodes_Shops.csv in directory ..\VKHCG\03-­ Hillman\01-Retrieve\01-EDS\01-R. So that data is ready.

 

•\ You must know where the customers are located. Earlier in the blog, you created the load as part of your R processing.

 

This is the file named Retrieve_IP_DATA.csv in directory ..\VKHCG\01-­ Vermeulen\01-Retrieve\01-EDS\01-R. So the required data is ready.

 

Possible Shipping Routes

There are numerous potential shipping routes available to the company. The retrieving step can generate the potential set, by using a route combination generator.

 

This will give you a set of routes, but it is highly unlikely that you will ship along all of them. It is simply a population of routes that can be used by the data science to find the optimum solution.

 

Start your Python editor and create a text file named Retrieve-Warehouse-­ http://Incoterm-Chains.py in directory .\VKHCG\03-Hillman\01-Retrieve. Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
from math import radians, cos, sin, asin, sqrt
In navigation, the haversine formula determines the great-circle distance between two points on a sphere, given their longitudes and latitudes. Here is my implementation:
def haversine(lon1, lat1, lon2, lat2,stype):
"""
Calculate the great circle distance between two given GPS points on the surface the earth (specified in decimal degrees) """
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a))
if stype == 'km':
r = 6371 # Radius of earth in kilometers
else:
r = 3956 # Radius of earth in miles
d=round(c * r,3)
return d
Now that you have the formula, I suggest you load the following data:
InputFileName='GB_Postcode_Warehouse.csv' OutputFileName='Retrieve_Incoterm_Chain_GB_Warehouse.csv' Company='03-Hillman'
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
We need the warehouse locations. These are important, as they determine from where to where you should ship products.
### Import Warehouses
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName print('###########')
print('Loading :',sFileName)
Warehouse=pd.read_csv(sFileName,header=0,low_memory=False)

 

You now select the entire valid warehouse list, which includes known locations. Remember the discussion about handling missing data? In this case, we are simply removing unknown data.

WarehouseGood=Warehouse[Warehouse.latitude != 0]
Your warehouse is named WH- plus the post code. This is a standard pattern, so impose it on the data, as follows:
WarehouseGood['Warehouse_Name']=WarehouseGood.apply(lambda row:
'WH-' + row['postcode']
,axis=1)
You have irrelevant data in the data set. Remove it now, with the following code:
WarehouseGood.drop('id', axis=1, inplace=True)
WarehouseGood.drop('postcode', axis=1, inplace=True)

 

You will now limit the warehouses to 100. This is purely to speed up the progress of the example. You can experiment with different numbers of warehouses, to see the impact of the selection.

WarehouseFrom=WarehouseGood.head(100)
You will now construct two data sets to model the shipping between warehouses.
Any warehouse can ship to any other warehouse.
for i in range(WarehouseFrom.shape[1]):
oldColumn=WarehouseFrom.columns[i]
newColumn=oldColumn + '_from'
WarehouseFrom.rename(columns={oldColumn: newColumn}, inplace=True)
WarehouseFrom.insert(3,'Keys', 1) WarehouseTo=WarehouseGood.head(100)
for i in range(WarehouseTo.shape[1]):
oldColumn=WarehouseTo.columns[i]
newColumn=oldColumn + '_to'
WarehouseTo.rename(columns={oldColumn: newColumn}, inplace=True)
WarehouseTo.insert(3,'Keys', 1)
WarehouseCross=pd.merge(right=WarehouseFrom,
left=WarehouseTo,
how='outer',
on='Keys')

 

Congratulations! You have just successfully modeled a matrix model for the intra-­ warehouse shipping. You should now remove the excess key you used for the join.

WarehouseCross.drop('Keys', axis=1, inplace=True) 

 

For these shipments, your Incoterm is DDP. Can you still remember what DDP shipping means? It means that the seller’s warehouse pays everything.

 

WarehouseCross.insert(0,'Incoterm', 'DDP')

You will now use the distance formula to calculate a distance in kilometers between the two warehouses involved in the shipping.

WarehouseCross['DistanceBetweenKilometers'] = WarehouseCross.apply(lambda row:haversine(row['longitude_from'],row['latitude_from'],

row['longitude_to'],row['latitude_to'],'km') ,axis=1)

 

You must also know the distance in miles. Having both distances allows the option of using either distance, depending on the cost models. Always supply more features, as this enriches your results.

WarehouseCross['DistanceBetweenMiles'] = WarehouseCross.apply(lambda row:
haversine(
row['longitude_from'],
row['latitude_from'],
row['longitude_to'],
row['latitude_to'],
'miles')
,axis=1)
Clean up the now excess fields in the model.
WarehouseCross.drop('longitude_from', axis=1, inplace=True)
WarehouseCross.drop('latitude_from', axis=1, inplace=True)
WarehouseCross.drop('longitude_to', axis=1, inplace=True)
WarehouseCross.drop('latitude_to', axis=1, inplace=True)

 

On a validation check, we have found that our model supports the same warehouse shipments that are not in the real-world ecosystem. So, remove them from your simulated model.

WarehouseCrossClean=WarehouseCross[WarehouseCross.DistanceBetweenKilometers !=0]

 

Congratulations! You have successfully built a model for the shipping of products. Now, you can look at some quick high-level check for the model and save the model to disk.

print('###########')
print('Rows :',WarehouseCrossClean.shape[0])
print('Columns :',WarehouseCrossClean.shape[1])
print('###########')
sFileName=sFileDir + '/' + OutputFileName WarehouseCrossClean.to_csv(sFileName, index = False)
print('### Done!! #')

 

Now save the Python file.

Execute the http://Retrieve-Warehouse-Incoterm-Chains.py file with your preferred Python compiler. You have just run your model end-to-end.

Well done! You will see a file named Retrieve_Incoterm_Chain_GB_Warehouse.csv in C:\VKHCG\03-Hillman\01-Retrieve\

 

01-EDS\02-Python. Open this file, and you should see a data set similar to the following:

Incoterm Warehouse_ Warehouse_ Distance Between Distance Between

Name_to Name_from Kilometers Miles
DDP WH-AB10 WH-AB11 1.644 1.021
DDP WH-AB10 WH-AB12 3.818 2.371
DDP WH-AB10 WH-AB13 7.872 4.888
DDP WH-AB10 WH-AB14 10.02 6.222

I will point out that the current data still ships the product as DDP. You will use your new data science skills to resolve this, as I guide you through the blog.

 

Adopt New Shipping Containers

Adopting the best packing option for shipping in containers will require that I introduce a new concept. Shipping of containers is based on a concept reducing the packaging you use down to an optimum set of sizes having the following requirements:

 

  • The product must fit within the box formed by the four sides of a cube.
  • The product can be secured using packing foam, which will fill any void volume in the packaging.
  • Packaging must fit in shipping containers with zero space gaps.
  •  Containers can only hold product that is shipped to a single warehouse, shop, or customer.

 

Note As this is a new concept for Hillman, you will have to generate data for use by your data science.

 

So, I will suggest a well-designed set of generated data that is an acceptable alternative for real data, to build the model as you expect it to work in the future. However, once you start collecting the results of the business process, you always swap the generated data for real production data.

 

I will now guide you to formulate a set of seed data for your data science. Start your Python editor and create a text file named http://Retrieve-Container-Plan.py in directory .\VKHCG\03-Hillman\01-Retrieve. Here is the Python code you must copy into the file.

 

Note Please follow the subsequent code entry. I will discuss important parts of the coding, as required.

# -*- coding: utf-8 -*-
import os
import pandas as pd
ContainerFileName='Retrieve_Container.csv' BoxFileName='Retrieve_Box.csv' ProductFileName='Retrieve_Product.csv' Company='03-Hillman'
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
### Create the Containers
The business requires a model that simulates shipping containers of dimensions 1 meter × 1 meter × 1 meter to 21 meters × 10 meters × 6 meters. So, let’s simulate these.
containerLength=range(1,21)
containerWidth=range(1,10)
containerHeigth=range(1,6)
containerStep=1
c=0
for l in containerLength:
for w in containerWidth:
for h in containerHeigth:
containerVolume=(l/containerStep)*(w/containerStep)*(h/containerStep)
c=c+1
ContainerLine=[('ShipType', ['Container']), ('UnitNumber', ('C'+format(c,"06d"))), ('Length',(format(round(l,3),".4f"))), ('Width',(format(round(w,3),".4f"))), ('Height',(format(round(h,3),".4f"))), ('ContainerVolume',(format(round(containerVolume,6), ".6f")))]
if c==1:
ContainerFrame = pd.DataFrame.from_items(ContainerLine)
else:
ContainerRow = pd.DataFrame.from_items(ContainerLine) ContainerFrame = ContainerFrame.append(ContainerRow) http://ContainerFrame.index.name = 'IDNumber'
print('################')
print('## Container')
print('################')
print('Rows :',ContainerFrame.shape[0])
print('Columns :',ContainerFrame.shape[1])
print('################')
sFileContainerName=sFileDir + '/' + ContainerFileName ContainerFrame.to_csv(sFileContainerName, index = False)

Well done! You have just completed a simulated model, based on the requirement of a future system. So, when the “real” containers arrive, you simply replace this simulated dataset with the “real” data and none of the downstream data science needs changing.

 

Your second simulation is the cardboard boxes for the packing of the products. The requirement is for boxes having a dimension of 100 centimeters × 100 centimeters × 100 centimeters to 2.1 meters × 2.1 meters × 2.1 meters. You can also use between zero and 600 centimeters of packing foam to secure any product in the box.

 

You can simulate these requirements as follows:

## Create valid Boxes with packing foam
boxLength=range(1,21)
boxWidth=range(1,21)
boxHeigth=range(1,21)
packThick=range(0,6)
boxStep=10
b=0
for l in boxLength: for w in boxWidth:
for h in boxHeigth: for t in packThick:
boxVolume=round((l/boxStep)*(w/boxStep)*(h/boxStep),6)
productVolume=round(((l-t)/boxStep)*((w-t)/boxStep)*((h-t)/ boxStep),6)
if productVolume > 0:
b=b+1
BoxLine=[('ShipType', ['Box']),
('UnitNumber', ('B'+format(b,"06d"))), ('Length',(format(round(l/10,6),".6f"))),
('Width',(format(round(w/10,6),".6f"))), ('Height',(format(round(h/10,6),".6f"))), ('Thickness',(format(round(t/5,6),".6f"))), ('BoxVolume',(format(round(boxVolume,9),".9f"))), ('ProductVolume',(format(round(productVolume,9), ".9f")))]
if b==1:
BoxFrame = pd.DataFrame.from_items(BoxLine)
else:
BoxRow = pd.DataFrame.from_items(BoxLine)
BoxFrame = BoxFrame.append(BoxRow)
http://BoxFrame.index.name = 'IDNumber'
print('#################')
print('## Box')
print('#################')
print('Rows :',BoxFrame.shape[0])
print('Columns :',BoxFrame.shape[1])
print('#################')
sFileBoxName=sFileDir + '/' + BoxFileName BoxFrame.to_csv(sFileBoxName, index = False)
Two simulations are done, you are making great progress. Now, your third model is for the products. The requirement is for products having a dimension of 100 centimeters
× 100 centimeters × 100 centimeters to 2.1 meters × 2.1 meters × 2.1 meters. You can build this as follows:
## Create valid Product
productLength=range(1,21)
productWidth=range(1,21)
productHeigth=range(1,21)
productStep=10
p=0
for l in productLength:
for w in productWidth:
for h in productHeigth:
productVolume=round((l/productStep)*(w/productStep)*(h/productStep),6)
if productVolume > 0:
p=p+1
ProductLine=[('ShipType', ['Product']), ('UnitNumber', ('P'+format(p,"06d"))), ('Length',(format(round(l/10,6),".6f"))), ('Width',(format(round(w/10,6),".6f"))), ('Height',(format(round(h/10,6),".6f"))), ('ProductVolume',(format(round(productVolume,9),".9f")))]
if p==1:
ProductFrame = pd.DataFrame.from_items(ProductLine)
else:
ProductRow = pd.DataFrame.from_items(ProductLine)
ProductFrame = ProductFrame.append(ProductRow)
http://BoxFrame.index.name = 'IDNumber'
print('#################')
print('## Product')
print('#################')
print('Rows :',ProductFrame.shape[0])
print('Columns :',ProductFrame.shape[1])
print('#################')
sFileProductName=sFileDir + '/' + ProductFileName ProductFrame.to_csv(sFileProductName, index = False)
# print('### Done!! #') # #

Now save the Python file. Execute the http://Retrieve-Container-Plan.py file with your preferred Python compiler.

 

You have just simulated a complex packing solution from simple requirements, and you are ready to apply data science to the model. You will see a container data file in a format similar to the following:

 

Retrieve_Container.csv in C:\ VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:

 

ShipType UnitNumber Length Width Height ContainerVolume

Container C000001 1 1 1 1

Container C000002 1 1 2 2

Container C000003 1 1 3 3

You will see a box packing file in a format similar to Retrieve_Box.csv, in C:\ VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python.

 

ShipType UnitNumber Length Width Height Thickness BoxVolume ProductVolume

Box B000001 0.1 0.1 0.1 0 0.001 0.001
Box B000002 0.1 0.1 0.2 0 0.002 0.002
Box B000003 0.1 0.1 0.3 0 0.003 0.003
You will see a box packing file similar in format to 'Retrieve_Box.csv' in C:\VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python.
ShipType UnitNumber Length Width Height ProductVolume
Product P034485 0.1 0.1 0.1 0.001
Product P034485 0.1 0.1 0.2 0.002
Product P034485 0.1 0.1 0.3 0.003

I have now successfully assist you in creating a data set with all supported packing seed data for your data science. Now, I will point out that if you change any of the following parameters in the code

containerLength=range(1,21)
containerWidth=range(1,10)
containerHeigth=range(1,6)
containerStep=1
boxLength=range(1,21)
boxWidth=range(1,21)
boxHeigth=range(1,21)
packThick=range(0,6)
boxStep=10
productLength=range(1,21)
productWidth=range(1,21)
productHeigth=range(1,21)
productStep=10

you will produce a different packing solution that will produce interesting results during the data science investigations. I will leave that to you to experiment with, once you understand your skills. Hillman now has data on its containers. Well done.

 

Create a Delivery Route

Create a Delivery Route

I will guide you through the process of creating a delivery route for a customer. Doing so requires that I lead you through a remarkable model that I discovered for shipping items.

 

I want to introduce you to the useful Python package named geopy. This package is the Python Geocoding Toolbox. It supports most location-related queries.

 

The model enables you to generate a complex routing plan for the shipping routes of the company.

 

Start your Python editor and create a text file named Retrieve-Route-­ http://Plan.py in directory .\VKHCG\03-Hillman\01-Retrieve. Following is the Python code you must copy into the file:
# -*- coding: utf-8 -*-
import os
import pandas as pd
from geopy.distance import vincenty
InputFileName='GB_Postcode_Warehouse.csv' OutputFileName='Retrieve_GB_Warehouse.csv' Company='03-Hillman'
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################') #################################
###############################
sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName
print('###########')
print('Loading :',sFileName)
Warehouse=pd.read_csv(sFileName,header=0,low_memory=False)
WarehouseClean=Warehouse[Warehouse.latitude != 0] WarehouseGood=WarehouseClean[WarehouseClean.longitude != 0]
WarehouseGood.drop_duplicates(subset='postcode', keep='first', inplace=True)
WarehouseGood.sort_values(by='postcode', ascending=1)
sFileName=sFileDir + '/' + OutputFileName WarehouseGood.to_csv(sFileName, index = False)
WarehouseLoop = WarehouseGood.head(20)
for i in range(0,WarehouseLoop.shape[0]):
print('Run :',i,' =======>>>>>>>>>>',WarehouseLoop['postcode'][i])
WarehouseHold = WarehouseGood.head(10000)
WarehouseHold['Transaction']=WarehouseHold.apply(lambda row:
'WH-to-WH'
,axis=1)
OutputLoopName='Retrieve_Route_' + 'WH-' + WarehouseLoop['postcode'][i] + '_Route.csv'
WarehouseHold['Seller']=WarehouseHold.apply(lambda row: 'WH-' + WarehouseLoop['postcode'][i] ,axis=1)
WarehouseHold['Seller_Latitude']=WarehouseHold.apply(lambda row:
WarehouseHold['latitude'][i]
,axis=1)
WarehouseHold['Seller_Longitude']=WarehouseHold.apply(lambda row:
WarehouseLoop['longitude'][i]
,axis=1)
WarehouseHold['Buyer']=WarehouseHold.apply(lambda row:
'WH-' + row['postcode']
,axis=1)
WarehouseHold['Buyer_Latitude']=WarehouseHold.apply(lambda row:
row['latitude']
,axis=1)
WarehouseHold['Buyer_Longitude']=WarehouseHold.apply(lambda row:
row['longitude']
,axis=1)
WarehouseHold['Distance']=WarehouseHold.apply(lambda row: round( vincenty((WarehouseLoop['latitude'][i],WarehouseLoop['longitude'][i]),
(row['latitude'],row['longitude'])).miles,6)
,axis=1)
WarehouseHold.drop('id', axis=1, inplace=True)
WarehouseHold.drop('postcode', axis=1, inplace=True)
WarehouseHold.drop('latitude', axis=1, inplace=True)
WarehouseHold.drop('longitude', axis=1, inplace=True)
sFileLoopName=sFileDir + '/' + OutputLoopName WarehouseHold.to_csv(sFileLoopName, index = False)
# print('### Done!! #') #
Now save the Python file. Execute the http://Retrieve-Route-Plan.py file with your preferred Python compiler.
You will see a collection of files similar in format to Retrieve_Route_WH-AB10_ Route.csv in C:\VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python. Open this file, and you should see a data set like this:
Transaction Seller Seller_ Seller_ Buyer Buyer_ Buyer_ Distance
Latitude Longitude Latitude Longitude
WH-to-WH WH-AB10 57.13514 -2.11731 WH-AB10 57.13514 -2.11731 0
WH-to-WH WH-AB10 57.13514 -2.11731 WH-AB11 57.13875 -2.09089 1.024914
WH-to-WH WH-AB10 57.13514 -2.11731 WH-AB12 57.101 -2.1106 2.375854
WH-to-WH WH-AB10 57.13514 -2.11731 WH-AB13 57.10801 -2.23776 4.906919

The preceding code enables you to produce numerous routing plans for the business.

 

For example, changing the WarehouseLoop = WarehouseGood.head(20) to WarehouseGood.head(200) will produce 200 new route plans. 

 

The code will produce a route map for every warehouse, shop, or customer you load.

 

Global Post Codes

I will guide you through a global postcode dataset that will form the base for most of the non-company address and shipping requirements.

 

Tip Normally, I find the best postcode information at the local post office or shipping agent. Remember what I said about experts. The Post Office experts ship items every minute of the day. They are the people to contact for postcode information. Warning You must perform some data science in your R environment.

Note If you have the code ready to perform a task, use it!

 

Now, open your RStudio and use R to process the following R script: Retrieve-­ Postcode-Global.r.

library(readr)
All_Countries <- read_delim("C:/VKHCG/03-Hillman/00-RawData/
All_Countries.txt",
"\t", col_names = FALSE,
col_types = cols(
X12 = col_skip(),
X6 = col_skip(),
X7 = col_skip(),
X8 = col_skip(),
X9 = col_skip()),
na = "null", trim_ws = TRUE)
write.csv(All_Countries,
file = "C:/VKHCG/03-Hillman/01-Retrieve/01-EDS/01-R/Retrieve_All_Countries.csv")

You have just successfully uploaded a new file named Retrieve_All_Countries. csv.' That was the last of the Hillman retrieve tasks.

Clark Ltd

Clark is the financial powerhouse of the group. It must process all the money-related data sources.

Forex

The first financial duty of the company is to perform any foreign exchange trading.

Forex Base Data

Previously, you found a single data source (Euro_ExchangeRates.csv) for forex rates in Clark. Earlier in the blog, I helped you to create the load, as part of your R processing.

 

The relevant file is Retrieve_Retrieve_Euro_ExchangeRates.csv in directory C:\ VKHCG\04-Clark\01-Retrieve\01-EDS\01-R. So, that data is ready.

 

Financials

Clark generates the financial statements for all the group’s companies.

 

Financial Base Data

You found a single data source (Profit_And_Loss.csv) in Clark for financials and, as mentioned previously, a single data source (Euro_ExchangeRates.csv) for forex rates.

Earlier in the blog, I helped you to create the load, as part of your R processing.

The file relevant file is Retrieve_Profit_And_Loss.csv in directory ­C:\VKHCG\04-Clark\01-­ Retrieve\01-EDS\01-R.

 

Person Base Data

Now, I will explain how to extract personal information from three compressed data structures and then join the separate files together to form the base for personal information.

 

Start your Python editor and create a text file named http://Retrieve-PersonData.py in directory .\VKHCG\04-Clark\01-Retrieve. Following is the Python code you must copy into the file:

 

# -*- coding: utf-8 -*-
import sys import os import shutil import zipfile import pandas as pd
if sys.platform == 'linux': Base=os.path.expanduser('~') + 'VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
Company='04-Clark'
ZIPFiles=['Data_female-names','Data_male-names','Data_last-names'] for ZIPFile in ZIPFiles:
InputZIPFile=Base+'/'+Company+'/00-RawData/' + ZIPFile + '.zip' OutputDir=Base+'/'+Company+'/01-Retrieve/01-EDS/02-Python/' + ZIPFile OutputFile=Base+'/'+Company+'/01-Retrieve/01-EDS/02-Python/Retrieve-'+ZIPFile+'.csv'
zip_file = zipfile.ZipFile(InputZIPFile, 'r') zip_file.extractall(OutputDir) zip_file.close()
t=0
for dirname, dirnames, filenames in os.walk(OutputDir): for filename in filenames:
sCSVFile = dirname + '/' + filename
t=t+1
if t==1:
NameRawData=pd.read_csv(sCSVFile,header=None,low_memory=False) NameData=NameRawData
else:
NameRawData=pd.read_csv(sCSVFile,header=None,low_memory=False)
NameData=NameData.append(NameRawData)
NameData.rename(columns={0 : 'Name'},inplace=True)
NameData.to_csv(OutputFile, index = False)
shutil.rmtree(OutputDir)
print('Process: ',InputZIPFile)
#
print('### Done!! #')
#

Now save the Python file and execute the http://Retrieve-PersonData.py file with your preferred Python compiler.

 

This generates three files named Retrieve-Data_female-­ names.csv, Retrieve-Data_male-names.csv, and Retrieve-Data_last-names.csv. You have now completed all the retrieve tasks. Clarks’s retrieve data is ready.

 

Connecting to Other Data Sources

Data Sources

I will now offer a high-level explanation of how to handle non-CSV data sources. The basic data flow and processing stay the same; hence, the data science stays the same. It is only the connection methods that change.

 

The following sections showcase a few common data stores I have used on a regular basis.

 

SQLite

This requires a package named sqlite3. Let’s load the IP_DATA_ALL.csv data in an SQLite database:

vermeulen.db at .\VKHCG\01-Vermeulen\00-RawData\SQLite\. Open your Python editor

and create a text file named Retrieve-IP_DATA_ALL_2_SQLite.py in directory .\VKHCG\03-Hillman\01-Retrieve.

Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import sqlite3 as sq import pandas as pd
Base='C:/VKHCG'
sDatabaseName=Base + '/01-Vermeulen/00-RawData/SQLite/vermeulen.db' conn = sq.connect(sDatabaseName)
sFileName=Base + '/01-Vermeulen/00-RawData/IP_DATA_ALL.csv' print('Loading :',sFileName) IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False)
IP_DATA_ALL.index.names = ['RowIDCSV'] sTable='IP_DATA_ALL'
print('Storing :',sDatabaseName,' Table:',sTable)
IP_DATA_ALL.to_sql(sTable, conn, if_exists="replace")
print('Loading :',sDatabaseName,' Table:',sTable)
TestData=pd.read_sql_query("select * from IP_DATA_ALL;", conn)
print('################')
print('## Data Values')
print('################')
print(TestData)
print('################')
print('## Data Profile')
print('################')
print('Rows :',TestData.shape[0])
print('Columns :',TestData.shape[1])
print('################')
print('### Done!! #')
#

Now save the Python file and execute the Retrieve-IP_DATA_ALL_2_SQLite.py file with your preferred Python compiler. 

The data will now be loaded into SQLite.

 

Now, to transfer data within SQLite, you must perform the following.

Open your Python editor and create a text file named Retrieve-IP_DATA_ALL_SQLite.py in directory .\VKHCG\03-Hillman\01-Retrieve.

Following is the Python code you must copy into the file:

import pandas as pd
import sqlite3 as sq
Base='C:/VKHCG'
sDatabaseName=Base + '/01-Vermeulen/00-RawData/SQLite/Vermeulen.db'
conn = sq.connect(sDatabaseName)
print('Loading :',sDatabaseName)
IP_DATA_ALL=pd.read_sql_query("select * from IP_DATA_ALL;", conn)
print('Rows:', IP_DATA_ALL.shape[0])
print('Columns:', IP_DATA_ALL.shape[1])
print('### Raw Data Set #####################################')
for i in range(0,len(IP_DATA_ALL.columns)):
print(IP_DATA_ALL.columns[i],type(IP_DATA_ALL.columns[i]))
print('### Fixed Data Set ###################################')
IP_DATA_ALL_FIX=IP_DATA_ALL
for i in range(0,len(IP_DATA_ALL.columns)):
cNameOld=IP_DATA_ALL_FIX.columns[i] + ' '
cNameNew=cNameOld.strip().replace(" ", ".")
IP_DATA_ALL_FIX.columns.values[i] = cNameNew
print(IP_DATA_ALL.columns[i],type(IP_DATA_ALL.columns[i]))
#print(IP_DATA_ALL_FIX.head())
print('################')
print('Fixed Data Set with ID')
print('################')
IP_DATA_ALL_with_ID=IP_DATA_ALL_FIX
print('################')
print(IP_DATA_ALL_with_ID.head())
print('################')
sTable2='Retrieve_IP_DATA'
IP_DATA_ALL_with_ID.to_sql(sTable2,conn, index_label="RowID", if_ exists="replace")
print('### Done!! #')
#

Now save the Python file.

Execute the Retrieve-IP_DATA_ALL_SQLite.py with your preferred Python compiler.

The data will now be loaded between two SQLite tables.

 

Date and Time

To generate data and then store the data directly into the SQLite database, do the following. Open your Python editor and create a text file named http://Retrieve-Calendar.py in directory .\VKHCG\04-Clark\01-Retrieve.

Here is the Python code you must copy into the file:

import sys
import os
import sqlite3 as sq
import pandas as pd
import datetime
if sys.platform == 'linux':
Base=os.path.expanduser('~') + 'VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
Company='04-Clark'
sDataBaseDir=Base + '/' + Company + '/01-Retrieve/SQLite' if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/clark.db' conn = sq.connect(sDatabaseName)
start = pd.datetime(1900, 1, 1)
end = pd.datetime(2018, 1, 1)
date1Data= pd.DataFrame(pd.date_range(start, end)) start = pd.datetime(2000, 1, 1)
end = pd.datetime(2020, 1, 1)
date2Data= pd.DataFrame(pd.date_range(start, end)) dateData=date1Data.append(date2Data)
dateData.rename(columns={0 : 'FinDate'},inplace=True)
print('################')
sTable='Retrieve_Date'
print('Storing :',sDatabaseName,' Table:',sTable)
dateData.to_sql(sTable, conn, if_exists="replace")
print('################')
print('################################')
print('Rows : ',dateData.shape[0], ' records')
print('################################') t=0
for h in range(24):
for m in range(60):
for s in range(5):
nTime=[str(datetime.timedelta(hours=h,minutes=m,seconds=30))]
if h == 0 and m == 0:
timeData= pd.DataFrame(nTime)
else:
time1Data= pd.DataFrame(nTime)
timeData= timeData.append(time1Data) timeData.rename(columns={0 : 'FinTime'},inplace=True) print('################')
sTable='Retrieve_Time'
print('Storing :',sDatabaseName,' Table:',sTable)
timeData.to_sql(sTable, conn, if_exists="replace")
print('################')
print('################################')
print('Rows : ',timeData.shape[0], ' records')
print('################################')
print('### Done!! #')
Now save the Python file. Execute the http://Retrieve-Calendar.py file with your preferred Python compiler.

The data will now be loaded as two new SQLite tables named Retrieve_Date and Retrieve_Time.

 

Other Databases

The important item to understand is that you need only change the library in the code I already proved works for the data science:

import sqlite3 as sq

The following will work with the same code.

 

PostgreSQL

PostgreSQL is used in numerous companies, and it enables connections to the database. There are two options: a direct connection using

 style="margin: 0px; width: 980px; height: 85px;">from sqlalchemy import create_engine
engine = create_engine('http://postgresql://scott:tiger@localhost:5432/vermeulen')
and connection via the psycopg2 interface
from sqlalchemy import create_engine
engine = create_engine('postgresql+http://psycopg2://scott:tiger@localhost/ mydatabase')

 

Microsoft SQL Server

Microsoft SQL server is common in companies, and this connector supports your connection to the database.

There are two options. Via the ODBC connection interface, use from sqlalchemy import create_engine
engine = create_engine('mssql+http://pyodbc://scott:tiger@mydsnvermeulen')

Via the direct connection, use from sqlalchemy import create_engine

engine = create_engine('mssql+http://pymssql://scott:tiger@hostname:port/ vermeulen')

 

MySQL

MySQL is widely used by lots of companies for storing data. This opens that data to your data science with the change of a simple connection string.

 

There are two options. For direct connect to the database, use from sqlalchemy import create_engine
engine = create_engine('mysql+http://mysqldb://scott:tiger@localhost/ vermeulen')

For connection via the DSN service, use from sqlalchemy import create_engine

engine = create_engine('mssql+http://pyodbc://mydsnvermeulen')

 

Oracle

Oracle is a common database storage option in bigger companies. It enables you to load data from the following data source with ease:

from sqlalchemy import create_engine

engine = create_engine('http://oracle://andre:vermeulen@127.0.0.1:1521/vermeulen')

 

Microsoft Excel

Excel is common in the data sharing ecosystem, and it enables you to load files using this format with ease. Open your Python editor and create a text file named Retrieve-­ http://Country-Currency.py in directory .\VKHCG\04-Clark\01-Retrieve.

 

Here is the Python code you must copy into the file:

# -*- coding: utf-8 -*-
import os
import pandas as pd
Base='C:/VKHCG'
sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
os.makedirs(sFileDir)
CurrencyRawData = pd­.read_excel('C:/VKHCG/01-Vermeulen/00-RawData/Country_ Currency.xlsx')
sColumns = ['Country or territory', 'Currency', 'ISO-4217'] CurrencyData = CurrencyRawData[sColumns] CurrencyData.rename(columns={'Country or territory': 'Country', 'ISO-4217': 'CurrencyCode'}, inplace=True)
CurrencyData.dropna(subset=['Currency'],inplace=True) CurrencyData['Country'] = CurrencyData['Country'].map(lambda x: x.strip()) CurrencyData['Currency'] = CurrencyData['Currency'].map(lambda x: x.strip())
CurrencyData['CurrencyCode'] = CurrencyData['CurrencyCode'].map(lambda x:
x.strip())
print(CurrencyData)
sFileName=sFileDir + '/Retrieve-Country-Currency.csv' CurrencyData.to_csv(sFileName, index = False)

Now save the Python file. Execute the http://Retrieve-Calendar.py file with your preferred Python compiler.

The code produces Retrieve-Country-Currency.csv.

 

Apache Spark

Apache Spark is now becoming the next standard for distributed data processing. The universal acceptance and support of the processing ecosystem are starting to turn mastery of this technology into a must-have skill.

 

Use package pyspark.sql. The connection is set up by using the following code:

import pyspark
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("IP_DATA_ALL.csv")
.option("header", "true")
.load("/path/to_csv.csv"))

 

Apache Cassandra

Cassandra is becoming a widely distributed database engine in the corporate world. The advantage of this technology and ecosystem is that they can scale massively in a great scaling manner.

 

This database can include and exclude nodes into the database ecosystem, without disruption to the data processing.

To access it, use the Python package, Cassandra.

from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect(‘vermeulen’)
Apache Hive
Access to Hive opens its highly distributed ecosystem for use by data scientists. To achieve this, I suggest using the following:
Import pyhs2
with pyhs2.connect(host='localhost',
port=10000,
authMechanism="PLAIN",
user='andre',
password='vermeulen',
database='vermeulen') as conn:
with conn.cursor() as cur:
#Execute query
cur.execute("select * from IP_DATA_ALL")
for i in cur.fetch():
print (i)

This enables you to read data from a data lake stored in Hive on top of Hadoop.

 

Apache Hadoop

Hadoop is one of the most successful data lake ecosystems in highly distributed data science. I suggest that you understand how to access data in this data structure if you want to perform data science at-scale.

 

PyDoop

The pydoop package includes a Python MapReduce and HDFS API for Hadoop. Pydoop is a Python interface to Hadoop that allows you to write MapReduce applications and interact with HDFS in pure Python.

See https://pypi.python.org/pypi/pydoop.

 

Luigi

Luigi enables a series of Python features that enable you to build complex pipelines into batch jobs. It handles dependency resolution and workflow management as part of the package.

This will save you from performing complex programming while enabling good-­ quality processing.

See https://pypi.python.org/pypi/luigi .

 

Amazon S3 Storage

S3, or Amazon Simple Storage Service (Amazon S3), create simple and practical methods to collect, store, and analyze data, irrespective of format, completely at a massive scale. I store most of my base data in S3, as it is cheaper than most other methods.

 

Package s3

Python’s s3 module connects to Amazon’s S3 REST API. See https://pypi.python.org/ pypi/s3.

 

Package Boto

The Boto package is another useful tool that I recommend. It can be accessed from the boto library, as follows:

from boto.s3.connection import S3Connection

conn = S3Connection('<aws access key>', '<aws secret key>')

 

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.

The Python package redshift-sqlalchemy, is an Amazon Redshift dialect for sql alchemy that opens this data source to your data science.
See https://pypi.python. org/pypi/redshift-sqlalchemy.

 

This is a typical connection:

from sqlalchemy import create_engine

engine = create_engine("redshift+http://psycopg2://username@host.amazonaws. com:5439/database"

 

You can also connect to Redshift via the package pandas-redshift.

(See https:// http://pypi.python.org/pypi/pandas-redshift.)

It combines the capability of packages psycopg2, pandas, and boto3 and uses the following connection:

engine = create_engine('redshift+http://psycopg2://username:password@us-east-1. http://redshift.amazonaws.com:port/vermeulen')

 

You can effortlessly connect to numerous data sources, and I suggest you investigate how you can use their capabilities for your data science.

 

Amazon Web Services

When you are working with Amazon Web Services, I suggest you look at the package:

boto3 package. (See: https://pypi.python.org/pypi/boto3.)

 

It is an Amazon Web Services Library Python package that provides interfaces to Amazon Web Services. For example, the following Python code will create an EC2 cloud data processing ecosystem:

import boto3
ec2 = boto3.resource('ec2')
instance = ec2.create_instances(
ImageId='ami-1e299d7e',
MinCount=1,
MaxCount=1,
InstanceType='t2.micro')
print instance[0].id
Once that is done, you can download a file from S3’s data storage.
import boto3
import botocore
BUCKET_NAME = 'vermeulen' # replace with your bucket name KEY = ' Key_IP_DATA_ALL'# replace with your object key s3 = boto3.resource('s3')
try:
s3.Bucket(BUCKET_NAME).download_file(KEY, 'IP_DATA_ALL.csv')
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise

By way of boto3, you have various options with which you can create the services you need in the cloud environment and then load the data you need. Mastering this process is a significant requirement for data scientists.

Recommend