Data Vault with Examples (Best Tutorial 2019)

Data Vault

How Data Vault Works with examples (Tutorial 2019)

Data vault modeling is a database modeling method designed by Dan Linstedt. Its data structure is designed to be responsible for long-term historical storage of data from multiple operational systems.

 

It supports chronological historical data tracking for full auditing and enables parallel loading of the structures. This tutorial explains how data vault works with best examples and methods. 

 

Hubs

Data vault hubs contain a set of unique business keys that normally do not change over time and, therefore, are stable data entities to store in hubs. Hubs hold a surrogate key for each hub data entry and metadata labeling the source of the business key.

 

Links

Data vault links are associations between business keys. These links are essentially many-to-many joins, with additional metadata to enhance the particular link.

 

Satellites

Data vault satellites hold the chronological and descriptive characteristics for a specific section of business data. The hubs and links form the structure of the model but have no chronological characteristics and hold no descriptive characteristics. Satellites consist of characteristics and metadata linking them to their specific hub.

 

Metadata labeling the origin of the association and characteristics, along with a timeline with start and end dates for the characteristics, is put in safekeeping, for future use from the data section. Each satellite holds an entire chronological history of the data entities within the specific satellite.

 

Reference Satellites

Reference satellites are referenced from satellites but under no circumstances bound with metadata for hub keys. They prevent redundant storage of reference characteristics that are used regularly by other satellites.

 

Typical reference satellites are

  • Standard codes: These are codes such as ISO 3166 for country codes, ISO 4217 for currencies, and ISO 8601 for time zones.

 

  • Fixed lists for specific characteristics: These can be standard lists that reduce other standard lists. For example, the list of countries your business has offices it may be a reduced fixed list from the ISO 3166 list. You can also generate your own list of, say, business regions, per your own reporting structures.

 

  • Conversion lookups: Look at Global Positioning System (GPS) transformations, such as the Molodensky transformation, Helmert transformation, Molodensky-Badekas transformation, or Gauss-­ Krüger coordinate system.

 

The most common is WGS84, the standard U.S. Department of Defense definition of a global location system for geospatial information, and is the reference system for the GPS.

 

Time-Person-Object-Location-Event Data Vault

Event Data Vault


Time Section

The time section contains the complete data structure for all data entities related to recording the time at which everything occurred.

 

Time Hub

The time hub consists of the following fields:

CREATE TABLE [Hub-Time] (
IDNumber VARCHAR (100) PRIMARY KEY,
IDTimeNumber Integer,
ZoneBaseKey VARCHAR (100),
DateTimeKey VARCHAR (100),
DateTimeValue DATETIME
);

This time hub acts as the core connector between time zones and other date-time associated values.

 

Time Links

Time Links

The time links to link the time hub to the other hubs. The following links are supported.

 

Time-Person Link

This connects date-time values within the person hub to the time hub. The physical link structure is stored as a many-to-many relationship time within the data vault. Later in this blog, I will supply details on how to construct the tables.

 

Dates such as birthdays, marriage anniversaries, and the date of reading this blog can be recorded as separate links in the data vault. The normal format is a birthday, married, or ReadblogOn. The format is simply a pair of keys between the time and person hubs.

 

Time-Object Link

This connects date-time values within the object hub to the time hub. Dates such as those on which you bought a car, sold a car, and read this blog can be recorded as separate links in the data vault. The normal format is BoughtCarOn, SoldCarOn, or ReadblogOn. The format is simply a pair of keys between the time object hubs.

 

Time-Location Link

This connects date-time values in the location hub to the time hub.

Dates such as moved to post code SW1, moved from post code SW1, and read blog at post code SW1 can be recorded as separate links in the data vault.

 

The normal format is MovedToPostCode, MovedFromPostCode, or ReadblogAtPostCode. The format is simply a pair of keys between the time and location hubs.

 

Time-Event Link

This connects date-time values in the event hub with the time hub. Dates such as those on which you have moved house and changed vehicles can be recorded as separate links in the data vault. The normal format is move house or change vehicle. The format is simply a pair of keys between the time and event hubs.

 

Time Satellites

Time satellites are the part of the vault that stores the following fields.

Note The <Time Zone> part of the table name is from a list of time zones that I will discuss later in the blog.

CREATE TABLE [Satellite-Time-<Time Zone>] (
IDZoneNumber VARCHAR (100) PRIMARY KEY,
IDTimeNumber INTEGER,
ZoneBaseKey VARCHAR (100),
DateTimeKey VARCHAR (100),
UTCDateTimeValue DATETIME,
Zone VARCHAR (100),
DateTimeValue DATETIME
);

Time satellites enable you to work more easily with international business patterns. You can move between time zones to look at such patterns as “In the morning . . . ” or “At lunchtime . . . ” These capabilities will be used during the Transform superstep, to discover patterns and behaviors around the world.

 

Person Section

The person section contains the complete data structure for all data entities related to recording the person involved.

 

Person Hub

The person hub consists of a series of fields that supports a “real” person. The person hub consists of the following fields:

CREATE TABLE [Hub-Person] (
IDPersonNumber INTEGER,
FirstName VARCHAR (200),
SecondName VARCHAR (200),
LastName VARCHAR (200),
Gender VARCHAR (20),
TimeZone VARCHAR (100),
BirthDateKey VARCHAR (100),
BirthDate DATETIME
);

 

Person Links

This links the person hub to the other hubs. The following links are supported in person links.

 

Person-Time Link

This link joins the person to the time hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Person-Time] (
IDPersonNumber INTEGER,
IDTimeNumber INTEGER,
ValidDate DATETIME
);

 

Person-Object Link

This link joins the person to the object hub to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Person-Object] (
IDPersonNumber INTEGER,
IDObjectNumber INTEGER,
ValidDate DATETIME
);

 

Person-Location Link

This link joins the person to the location hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Person-Time] (
IDPersonNumber INTEGER,
IDLocationNumber INTEGER,
ValidDate DATETIME
);

 

Person-Event Link

This link joins the person to the event hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Person-Time] (
IDPersonNumber INTEGER,
IDEventNumber INTEGER,
ValidDate DATETIME
);

 

Person Satellites

The person satellites are the part of the vault that stores the temporal attributes and descriptive attributes of the data. The satellite is of the following format:

CREATE TABLE [Satellite-Person-Gender] ( PersonSatelliteID VARCHAR (100),
IDPersonNumber INTEGER,
FirstName VARCHAR (200),
SecondName VARCHAR (200),
LastName VARCHAR (200),
BirthDateKey VARCHAR (20),
Gender VARCHAR (10),
);

 

Object Section

Object Section

The object section contains the complete data structure for all data entities related to recording the object involved.

 

Object Hub

The object hub consists of a series of fields that supports a “real” object. The object hub consists of the following fields:

CREATE TABLE [Hub-Object-Species] (
IDObjectNumber INTEGER,
ObjectBaseKey VARCHAR (100),
ObjectNumber VARCHAR (100),
ObjectValue VARCHAR (200),
);

This structure enables you as a data scientist to categorize the objects in the business environment.

 

Object Links

These link the object hub to the other hubs.

 

Object-Time Link

This link joins the object to the time hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Object-Time] (

IDObjectNumber INTEGER,

IDTimeNumber INTEGER,

ValidDate DATETIME

);

 

Object-Person Link

This link joins the object to the person hub to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Object-Person] (

IDObjectNumber INTEGER,

IDPersonNumber INTEGER,

ValidDate DATETIME

);

Object-Location Link

This link joins the object to the location hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Object-Location] (

IDObjectNumber INTEGER,

IDLocationNumber INTEGER,

ValidDate DATETIME

);

Object-Event Link

This link joins the object to the event hub to describe the relationships between the two hubs.

 

Object Satellites

Object satellites are the part of the vault that stores and provisions the detailed characteristics of objects. The typical object satellite has the following data fields:

>CREATE TABLE [Satellite-Object-Make-Model] ( IDObjectNumber INTEGER, ObjectSatelliteID VARCHAR (200), ObjectType VARCHAR (200), ObjectKey VARCHAR (200), ObjectUUID VARCHAR (200),

Make VARCHAR (200),

Model VARCHAR (200)

);

The object satellites will hold additional characteristics, as each object requires additional information to describe the object.

 

I keep each set separately, to ensure future expansion, as the objects are the one hub that evolves at a high rate of change, as new characteristics are discovered by your data science.

 

Location Section

Location Section

The location section contains the complete data structure for all data entities related to recording the location involved.

Location Hub

The location hub consists of a series of fields that supports a GPS location. The location hub consists of the following fields:

CREATE TABLE [Hub-Location] (
IDLocationNumber INTEGER,
ObjectBaseKey VARCHAR (200),
LocationNumber INTEGER,
LocationName VARCHAR (200),
Longitude DECIMAL (9, 6),
Latitude DECIMAL (9, 6)
);

The location hub enables you to link any location, address, or geospatial information to the rest of the data vault.

Location Links

The location links join the location hub to the other hubs.

The following links are supported.

 

Location-Time Link

The link joins the location to the time hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Location-Time] (

IDLocationNumber INTEGER,

IDTimeNumber INTEGER,

ValidDate DATETIME

);

These links support business actions such as ArrivedAtShopAtDateTime or ShopOpensAtTime.

 

Location-Person Link

This link joins the location to the person hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Location-Person] (

IDLocationNumber INTEGER,

IDPersonNumber INTEGER,

ValidDate DATETIME

);

?These links support such business actions as ManagerAtShop or SecurityAtShop.

 

Location-Object Link

This link joins the location to the object hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Location-Object] (

IDLocationNumber INTEGER,

IDObjectNumber INTEGER,

ValidDate DATETIME

);

These links support such business actions as ShopDeliveryVan or RackAtShop.

 

Location-Event Link

Location-Event Link

This link joins the location to the event hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Location-Event] (

IDLocationNumber INTEGER,

IDEventNumber INTEGER,

ValidDate DATETIME

);

These links support such business actions as happened or PostCodeDeliveryStarted.

 

Location Satellites

The location satellites are the part of the vault that stores and provisions the detailed characteristics of where entities are located. The typical location satellite has the following data fields:

CREATE TABLE [Satellite-Location-PostCode] ( IDLocationNumber INTEGER, LocationSatelliteID VARCHAR (200), LocationType VARCHAR (200), LocationKey VARCHAR (200),

LocationUUID VARCHAR (200),

CountryCode VARCHAR (20),

PostCode VARCHAR (200)

);

The location satellites will also hold additional characteristics that are related only to your specific customer. They may split their business areas into their own regions, e.g., Europe, Middle-East, and China. These can be added as a separate location satellite.

 

Event Section

The event section contains the complete data structure for all data entities related to recording the event that occurred.

Event Hub

The event hub consists of a series of fields that supports events that happen in the real world. The event hub consists of the following fields:

CREATE TABLE [Hub-Event] (

IDEventNumber INTEGER,

EventType VARCHAR (200),

EventDescription VARCHAR (200)

);

Event-Time Link

This link joins the event to the time hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Event-Time] (

IDEventNumber INTEGER,

IDTimeNumber INTEGER,

ValidDate DATETIME

);

These links support such business actions as DeliveryDueAt or delivered.

 

Event-Person Link

This link joins the event to the person hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Event-Person] (

IDEventNumber INTEGER,

IDPersonNumber INTEGER,

ValidDate DATETIME

);

These links support such business actions as ManagerAppointAs or StaffMemberJoins.

 

Event-Object Link

This link joins the event to the object hub, to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Event-Object] (

IDEventNumber INTEGER,

IDObjectNumber INTEGER,

ValidDate DATETIME

);

These links support such business actions as VehicleBuy, VehicleSell, or ItemInStock.

 

Event-Location Link

The link joins the event to the location hub to describe the relationships between the two hubs. The link consists of the following fields:

CREATE TABLE [Link-Event-Location] (

IDEventNumber INTEGER,

IDTimeNumber INTEGER,

ValidDate DATETIME

);

These links support such business actions as DeliveredAtPostCode or ­PickupFromGPS.

 

Event Satellites

The event satellites are the part of the vault that stores the details related to all the events that occur within the systems you will analyze with your data science. I suggest that you keep to one type of event per satellite. This enables future expansion and easier long-­ term maintenance of the data vault.

 

Engineering a Practical Process Superstep

I will now begin with a series of informational and practical sections with the end goal of having you use the Process step to generate a data vault using the T-P-O-L-E design model previously discussed.

 

Note I use the T-P-O-L-E model, but the “pure” data vault can use various other hubs, links, or satellites. The model is a simplified one that enables me to reuse numerous utilities and processes I have developed over years of working with data science.

 

Time

time

Time is the characteristic of the data that is directly associated with time recording. ISO 8601-2004 describes an international standard for data elements and interchange formats for dates and times. Following is a basic set of rules to handle the data interchange.

 

The calendar is based on the Gregorian calendar. The following data entities (year, month, day, hour, minute, second, and a fraction of a second) are officially part of the ISO 8601-2004 standard.

 

The data is recorded from largest to smallest. Values must have a pre-agreed fixed number of digits that are padded with leading zeros. I will now introduce you to these different data entities.

 

Year

The standard year must use four-digit values within the range 0000 to 9999, with the optional agreement that any date denoted by AC/BC requires a conversion whereby AD 1 = year 1, 1 BC = year 0, 2 BC = year -1. That means that 2017AC becomes +2017, but 2017BC becomes -2016.

 

Valid formats are YYYY and +/-YYYY. The rule for a valid year is 20 October 2017 becomes 2017-10-20 or +2017-10-20 or a basic 20171020. The rule for missing data on a year is 20 October becomes --10-20 or --1020.

 

The use of YYYYMMDD is common in source systems. I advise you to translate to YYYY-MM-DD during the Assess step, to generate a consistency with the date in later stages of processing.

 

Warning Discuss with your customer the dates before October 1582 (the official start of the Gregorian calendar). The Julian calendar was used from January 1, 45 BC.

 

If the earlier dates are vital to your customer’s data processing, I suggest that you perform an in-depth study into the other, more complex differences in dates. This is very important when you work with text mining on older documents.

 

Open your IPython editor and investigate the following date format:

from datetime import datetime
from pytz import timezone, all_timezones
now_date = datetime(2001,2,3,4,5,6,7)
now_utc=now_date.replace(tzinfo=timezone('UTC'))
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('Year:',str(now_utc.strftime("%Y")))
You should see
Date: 2001-02-03 04:05:06 (UTC) (+0000)
Year: 2001

This enables you to easily extract a year value from a given date.

 

Month

The standard month must use two-digit values within the range of 01 through 12. The rule for a valid month is 20 October 2017 becomes 2017-10-20 or +2017-10-20. The rule for a valid month without a day is October 2017 becomes 2017-10 or +2017-10.

 

Warning You cannot use YYYYMM, so 201710 is not valid.

This standard avoids confusion with the truncated representation YYMMDD, which is still often used in source systems.

201011 could be 2010-11 or 1920-10-11.

Open your IPython editor and investigate the following date format:

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('Month:',str(now_utc.strftime("%m")))
You should get back
Date: 2001-02-03 04:05:06 (UTC) (+0000)
Month: 02
Now you can get the month.
This follows the standard English names for months.
Number Name
1 January
2 February
3 March
4 April
5 May
6 June
7 July
8 August
9 September
10 October
11 November
12 December
Open your IPython editor and investigate the following date format:
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('MonthName:', str(now_utc.strftime("%B")))

 

This gives you the months in words.

Date: 2001-02-03 04:05:06 (UTC) (+0000)

MonthName: February

Day

 

The standard day of the month must use a two-digit value with possible values within the range of 01 through 31, as per an agreed format for the specific month and year. The rule for a valid month is 20 October 2017 becomes 2017-10-20 or +2017-10-20.

Open your IPython editor and investigate the following date format:

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))

print('Day:',str(now_utc.strftime("%d")))

This returns the day of the date.

Date: 2001-02-03 04:05:06 (UTC) (+0000)

Day: 03

 

There are two standards that apply to validate dates: non-leap year and leap year. Owing to the length of the solar year is marginally less than 365¼ days—by about 11 minutes—every 4 years, the calendar indicates a leap year, unless the year is divisible by 400. This fixes the 11 minutes gained over 400 years.

 

Open your IPython editor and investigate the following date format. I will show you a way to find the leap years.

import datetime
for year in range(1960,2025):
month=2
day=29
hour=0
correctDate = None
try:
newDate = datetime.datetime(year=year,month=month,day=day,hour=hour) correctDate = True
except ValueError:
correctDate = False
if correctDate == True:
if year%400 == 0:
print(year, 'Leap Year (400)')
else:
print(year, 'Leap Year')
else:
print(year,'Non Leap Year')
The result shows that the leap years are as follows:
1960,1964,1968,1972,1976,1980,1984,1988,1992,1996,2000(This was a 400) 2004,2008,2012,2016,2020,2024.

Now, we can look at the time component of the date time.

 

Hour

The standard hour must use a two-digit value within the range of 00 through 24. The valid format is hhmmss or hh:mm:ss.

The shortened format hhmm or hh:mm is accepted but not advised.

The format hh is not supported.

The use of 00:00:00 is the beginning of the calendar day. The use of 24:00:00 is only to indicate the end of the calendar day.

 

Open your IPython editor and investigate the following date format:

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('Hour:',str(now_utc.strftime("%H")))
The results are:
Date: 2001-02-03 04:05:06 (UTC) (+0000)
Hour: 04
Minute

 

The standard minute must use two-digit values within the range of 00 through 59.

The valid format is hhmmss or hh:mm:ss. The shortened format hhmm or hh:mm is accepted but not advised.

Open your IPython editor and investigate the following date format:

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('Minute:',str(now_utc.strftime("%M")))
The results are
Date: 2001-02-03 04:05:06 (UTC) (+0000)
Minute: 05
Second

 

The standard second must use two-digit values within the range of 00 through 59. The valid format is hhmmss or hh:mm:ss.

Open your IPython editor and investigate the following date format:

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))

print('Second:',str(now_utc.strftime("%S")))

This returns the seconds of the date time.

Date: 2001-02-03 04:05:06 (UTC) (+0000)

Second: 06

 

Note This is normally where most systems stop recording time, but I have a few clients that require a more precise time-recording level. So, now we venture into sub-second times. This is a fast world with interesting dynamics.

 

The fraction of a Second

Fraction

The fraction of a second is only defined as a format: hhmmss,sss or hh:mm:ss,sss or hhmmss.sss or hh:mm:ss.sss.

I prefer the hh:mm:ss.sss format, as the use of the comma, causes issues with exports to CSV file formats,

which are common in the data science ecosystem.

The current commonly used formats are the following:

  • hh:mm:ss.s: A tenth of a second
  • hh:mm:ss.ss: Hundredth of a second
  • hh:mm:ss.sss: A thousandth of a second

 

Data science works best if you can record the data at the lowest possible levels. I suggest that you record your time at the hh:mm: ss.sss level.

Open your IPython editor and investigate the following date format:

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))

print('Millionth of Second:',str(now_utc.strftime("%f")))

 

This returns the fraction of a second in one-millionth of a second, or microsecond, as follows:

Date: 2001-02-03 04:05:06 (UTC) (+0000)

Millionth of Second: 000007

There are some smaller units, such as the following:

  • Nanosecond: One thousand-millionth of a second
  • Scale comparison: One nanosecond is to one second as one second is to 31.71 years.
  • Picosecond: One trillionth or one-millionth of one-millionth of a second
  • Scale comparison: A picosecond is to one second as one second is to 31,710 years.

 

Local Time and Coordinated Universal Time (UTC)

Universal Time

The date and time we use are bound to a specific location on the earth and the specific time of the calendar year. The agreed local time is approved by the specific country.

 

I am discussing time zones because, as a data scientist, you will have to amalgamate data from multiple data sources across the globe. The issue is the impact of time zones and the irregular application of rules by customers.

 

Local Time

The approved local time is agreed by the specific country, by approving the use of a specific time zone for that country.

 

Daylight Saving Time can also be used, which gives a specific country a shift in its local time during specific periods of the year if they use it. The general format for local time is hh:mm: ss.

Open your IPython editor and investigate the following date format:

from datetime import datetime

now_date = datetime.now()

print('Date:',str(now_date.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))

The results is

2017-10-01 10:45:58 () ()

 

The reason for the two empty brackets is that the date-time is in a local time setting.

Warning Most data systems record the local time in the local time setting for the specific server, PC, or tablet.

As a data scientist, it is required that you understand that not all systems in the ecosystem record time in a reliable manner.

 

Tip There is a Simple Network Time Protocol (SNTP) that you can use automatically to synchronize your system’s time with that of a remote server. The SNTP can be used to update the clock on a machine with a remote server.

 

This keeps your machine’s time accurate, by synchronizing with servers that are known to have accurate times. I normally synchronize my computers that record data at a customer with the same NTP server the customer is using, to ensure that I get the correct date and time, as per their system.

 

Now you can start investigating the world of international date and time settings.

 

Coordinated Universal Time (UTC)

Coordinated Universal Time

Coordinated Universal Time is a measure of time within about 1 second of mean solar time at 0° longitude. The valid format is hh:mm:ss±hh:mm or hh:mm:ss±hh.

 

For example, on September 11, 2017, nine a.m. in London, UK, is 09:00:00 local time and UTC 10:00:00+01:00.

In New York, US, it would be 06:00:00 local time and UTC 10:00:00-04:00.

In New Delhi, India, it would be 15:30:00 local time and UTC 10:00:00+05:30.

 

The more precise method would be to record it as 2017-09-11T10:00:00Z. I will explain over the next part of this blog, why the date is important when you work with multi-time zones as a data scientist.

 

Open your IPython editor and investigate the following date format:

from daytime import DateTime

now_date = datetime.now()
print('Local Date Time:',str(now_date.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
now_utc=now_date.replace(tzinfo=timezone('UTC'))
print('UTC Date Time:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
Warning I am in working in the London, Edinburgh time zone, so my results will be based on that fact. My time zone is Europe/London.
from datetime import datetime
now_date = datetime.now()
print('Local Date Time:',str(now_date.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
now_utc=now_date.replace(tzinfo=timezone('UTC'))
print('UTC Date Time:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
now_london=now_date.replace(tzinfo=timezone('Europe/London'))
print('London Date Time:',str(now_london.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
So, my reference results are
Local Date Time: 2017-10-03 16:14:43 () ()
UTC Date Time: 2017-10-03 16:14:43 (UTC) (+0000)
London Date Time: 2017-10-03 16:14:43 (LMT) (-0001)
Now, with that knowledge noted, let’s look at your results again.
from datetime import datetime
now_date = datetime.now()
print('Local Date Time:',str(now_date.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
now_utc=now_date.replace(tzinfo=timezone('UTC'))
print('UTC Date Time:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))

 

My results are

Local Date Time: 2017-10-03 16:14:43 () ()

UTC Date Time: 2017-10-03 16:14:43 (UTC) (+0000)

 

This shows how our date and time is investigated by data scientists. As humans, we have separated the concepts of date and time for thousands of years. The date was calculated from the stars.

 

Time was once determined by an hourglass. Later, the date was on a calendar, and time is now on a pocket watch. But in the modern international world, there is another date-time data item. Mine is a smartphone—period!

 

Combining Date and Time

Combining Date and Time

When using date and time in one field, ISO supports the format YYYY-MM-­DDThh:mm:ss.

sss for local time and YYYY-MM-DDThh:mm:ss.sssZ for Coordinated Universal Time.

These date and time combinations are regularly found in international companies’ financials or logistics shipping and on my smartphone.

 

Time Zones

Time zones enable the real world to create a calendar of days and hours that people can use to run their lives.

The date and time is normally expressed, for example, as October 20, 2017, and 9:00.

 

In the world of the data scientist, this is inadequate information. The data scientist requires a date, time, and time zone. YYYY-MM-DDThh:mm:ss.sssZ is the format for a special time zone for the data scientist’s data processing needs.

 

Note This format was discussed earlier in this blog, as an alternative to UTC.

 

In international data, I have also seen time zones set in other formats and must discuss with you how to handle these formats in a practical manner.

Open your IPython editor and investigate the following date format:

from pytz import all_timezones

for zone in all_timezones:

print(zone)

 

There are 593 zones. Yes, read correctly! The issue is that companies do not even adhere to these standard zones, due to lack of understanding.

For example one of my customers uses YYYY-DDThh:mm:ss.sssY for UTC-12:00, or Yankee Time Zone.

 

Warning These zones are not valid formats of ISO! But they are generally used by source systems around the world. I will offer you the list that I have seen before if you have to convert back to UTC format or local time.

 

The point of data source recording is normally the time zone you use.

Warning The time zone is dependent on your source system’s longitude and latitude or the clock settings on the hardware. The data below is generalized. I suggest you investigate the source of your data first, before picking the correct time zone.

 

Once you have found the correct original time zone, I suggest you do the following:

now_date_local=datetime.now()
now_date=now_date_local.replace(tzinfo=timezone('Europe/London')) print('Local Date Time:',str(now_date_local.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('Local Date Time:',str(now_date.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
The result is
Local Date Time: 2017-10-01 12:54:03 () ()
Local Date Time: 2017-10-01 12:54:03 (LMT) (-0001)
'Europe/London' applies the correct time zone from the list of 593 zones. Now convert to UTC.
now_utc=now_date.astimezone(timezone('UTC'))
print('UTC Date Time:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
The result is
UTC Date Time: 2017-10-01 12:55:03 (UTC) (+0000)

 

I advise that you only use this with caution, as your preferred conversion.

You will note that there is a more or less inaccurate meaning to YYYY-DDDThh:mmM,

which converts to UTC+12:00, UTC+12:45, UTC+13:00, and UTC+14:00.

 

UTC+14:00 is UTC+14 and stretches as far as 30° east of the 180° longitude line, creating a large fold in the International Date Line and interesting data issues for the data scientist.

 

Intervals Identified by Start and End Dates and Time

End Dates and Time

There is a standard format for supporting a date and time interval in a single field: YYYYMMDDThhmmss/YYYYMMDDThhmmss or YYYY-MM-DDThh:mm:ss/YYYY-­ MM-­DDThh:mm:ss. 

 

So, October 20, 2017, can be recorded as 2017­-10-20T000000/2017-­ 10-­20T235959 and Holiday December 2017 as 2017-12-22T120001/2017-12-27T085959, in the same data table.

 

Note The preceding example was found in one of my source systems. Complex data issues are common in data science. Spot them quickly, and deal with them quickly!

 

As a general note of caution, I advise that when you discover these complex structures, convert them to more simple structures as soon as you can. I suggest that you add an assess step, to split these into four fields: Start Date-Time, End Date Time, Original Date Time, and Business Term.

 

Using the previously mentioned format, October 20, 2017, recorded as 2017-10-20T000000/2017-10-20T235959, becomes the following:

 

Start Date Time End Date Time Original Date Time Business Term

2017-10-20T000000 2017-10-20T235959 2017-10-20T000000/2017-10-­ 20 October 2017
20T235959
Holiday December 2017 can be recorded as 2017-12-22T120001/2017-12-­ 27T085959 in the same data table, as follows:
Start Date Time End Date Time Original Date Time Business Term
2017-12-22T120001 2017-12-22T235959 2017-12-22T120001/2017- Holiday December 2017
12-27T085959
2017-12-23T000000 2017-12-23T235959 2017-12-22T120001/2017- Holiday December 2017
12-27T085959
2017-12-24T000000 2017-12-24T235959 2017-12-22T120001/2017- Holiday December 2017
12-27T085959
2017-12-25T000000 2017-12-25T235959 2017-12-22T120001/2017- Holiday December 2017
12-27T085959
2017-12-26T000000 2017-12-26T235959 2017-12-22T120001/2017- Holiday December 2017
12-27T085959
2017-12-27T000000 2017-12-27T085959 2017-12-22T120001/2017- Holiday December 2017
12-27T085959
Now you have same day time periods that are easier to convert into date time formats.
Special Date Formats
The following are special date formats.
Day of the Week
ISO 8601-2004 sets the following standard:
Number Name
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
Open your IPython editor and investigate the following date format:
now_date_local=datetime.now()
now_date=now_date_local.replace(tzinfo=timezone('Europe/London'))
print('Weekday:',str(now_date.strftime("%w")))
The result is
(It is Sunday)
Weekday: 0

 

Caution I have seen variations! Sunday = 0, Sunday = 1, and Monday = 0. Make sure you have used the correct format before joining data.

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]

 

Week Dates

The use of week dates in many source systems causes some complications when working with dates. Week dates are denoted by the use of a capital W and W with D in the format.

 

The following are all valid formats: YYYY-Www or YYYYWww

or YYYY-Www-D or YYYYWwwD.

For examples, October 22, 2017, is 2017-W42,

if you use ISO 8601(European), but 2017-W43, if you use ISO 8601(US).

 

Warning ISO 8601(European) is an official format. The standard is that the first week of a year is a week that includes the first Thursday of the year or that contains January 4th.

 

By this method, January 1st, 2nd, and 3rd can be included in the last week of the previous year, or December 29th, 30th, and 31st can be included in the first week of next year.

 

ISO 8601(US) is widely used, and according to this standard, the first week begins on January 1st. The next week begins on the next Sunday.

 

Other options are that the first week of the year begins on January 1st, and the next week begins on the following Monday, and the first week of the year begins on January 1st, and the next week begins on January 8th.

now_date_local=datetime.now()

now_date=now_date_local.replace(tzinfo=timezone('Europe/London'))

print('Week of Year:',str(now_date.strftime("%YW%WD%w")))

The result is

Week of Year: 2017W39D0

Day of the Year

 

The day of the year is calculated simply by counting the number of days from January 1st of the given year.

For example October 20, 2017 is 2017/293.

Caution Change this to the ISO standard as soon as possible!

 

I found this format stored as 2017293 in a birth date field. The reason was to hide a staff member’s age. This successfully broke the pension planner’s projection, which read the value as the US date for March 29, 2017, hence, 2017/29/3.

 

This staff member would potentially have received two birthday cards and cake from his management in 2017. I fixed the problem in September 2017.

 

So, let’s look at this day of the year as it was intended to be used by the format. The day of the year is useful to determine seasonality in the data when comparing data between years. Check if the same thing happens on the same day every year. Not really that simple, but close!

 

Here is the format for the day of the year:

now_date_local=datetime.now()

now_date=now_date_local.replace(tzinfo=timezone('Europe/London'))

print('Day of year:',str(now_date.strftime("%Y/%j")))

The result is

Day of year: 2017/274

 

Well done. You have now completed the basic introduction to date and time in the Process step.

I will show how important the YYYY-­ MM-­DDThh:mm:ss.sssZ value for data and time is to the data science. 

 

This completes the basic information on date and time. Next, you will start to build the time hub, links, and satellites.

 

Open your Python editor and create a file named Process_Time.py. Save it into directory .. \VKHCG\01-Vermeulen\03-Process. 

 

The first part consists of standard imports and environment setups.

# -*- coding: utf-8 -*-
import sys import os
from datetime import datetime from datetime import timedelta
from pytz import timezone, all_timezones import pandas as pd
import sqlite3 as sq from http://pandas.io import sql import uuid pd.options.mode.chained_assignment = None
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
Company='03-Hillman'
InputDir='00-RawData'
InputFileName='VehicleData.csv'
sDataBaseDir=Base + '/' + Company + '/03-Process/SQLite' if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/Hillman.db' conn1 = sq.connect(sDatabaseName)
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
You will set up a time hub for a period of ten years before January 1, 2018. If you want to experiment with different periods, simply change the parameters.
base = datetime(2018,1,1,0,0,0)
numUnits=10*365*24
date_list = [base - timedelta(hours=x) for x in range(0, numUnits)] t=0
for i in date_list:
now_utc=i.replace(tzinfo=timezone('UTC'))
sDateTime=now_utc.strftime("%Y-%m-%d %H:%M:%S")
sDateTimeKey=sDateTime.replace(' ','-').replace(':','-')
t+=1
IDNumber=str(uuid.uuid4())
TimeLine=[('ZoneBaseKey', ['UTC']),
('IDNumber', [IDNumber]),
('nDateTimeValue', [now_utc]),
('DateTimeValue', [sDateTime]),
('DateTimeKey', [sDateTimeKey])]
if t==1:
TimeFrame = pd.DataFrame.from_items(TimeLine)
else:
TimeRow = pd.DataFrame.from_items(TimeLine)
TimeFrame = TimeFrame.append(TimeRow)
TimeHub=TimeFrame[['IDNumber','ZoneBaseKey','DateTimeKey','DateTimeValue']] TimeHubIndex=TimeHub.set_index(['IDNumber'],inplace=False)
TimeFrame.set_index(['IDNumber'],inplace=True) sTable = 'Process-Time'
print('Storing :',sDatabaseName,' Table:',sTable)
TimeHubIndex.to_sql(sTable, conn1, if_exists="replace") sTable = 'Hub-Time'
print('Storing :',sDatabaseName,' Table:',sTable)
TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
You have successfully built the hub for time in the data vault. Now, we will build satellites for each of the time zones. So, let’s create a loop to perform these tasks:
active_timezones=all_timezones
z=0
for zone in active_timezones:
t=0
for j in range(TimeFrame.shape[0]):
now_date=TimeFrame['nDateTimeValue'][j]
DateTimeKey=TimeFrame['DateTimeKey'][j]
now_utc=now_date.replace(tzinfo=timezone('UTC'))
sDateTime=now_utc.strftime("%Y-%m-%d %H:%M:%S")
now_zone = now_utc.astimezone(timezone(zone))
sZoneDateTime=now_zone.strftime("%Y-%m-%d %H:%M:%S")
t+=1
z+=1
IDZoneNumber=str(uuid.uuid4())
TimeZoneLine=[('ZoneBaseKey', ['UTC']),
('IDZoneNumber', [IDZoneNumber]),
('DateTimeKey', [DateTimeKey]),
('UTCDateTimeValue', [sDateTime]),
('Zone', [zone]),
('DateTimeValue', [sZoneDateTime])]
if t==1:
TimeZoneFrame = pd.DataFrame.from_items(TimeZoneLine)
else:
TimeZoneRow = pd.DataFrame.from_items(TimeZoneLine)
TimeZoneFrame = TimeZoneFrame.append(TimeZoneRow)
TimeZoneFrameIndex=TimeZoneFrame.set_index(['IDZoneNumber'],
inplace=False)
sZone=zone.replace('/','-').replace(' ','') ############################################################# sTable = 'Process-Time-'+sZone
print('Storing :',sDatabaseName,' Table:',sTable)
TimeZoneFrameIndex.to_sql(sTable, conn1, if_exists="replace")
#
sTable = 'Satellite-Time-'+sZone
print('Storing :',sDatabaseName,' Table:',sTable)
TimeZoneFrameIndex.to_sql(sTable, conn2, if_exists="replace")
#

 

You will note that your databases are growing at an alarming rate. I suggest that we introduce a minor database maintenance step that compacts the database to take up less space on the disk.

 

The required command is a vacuum. It performs an action similar to that of a vacuum-packing system, by compressing the data into the smallest possible disk footprint.

print('################')
print('Vacuum Databases')
sSQL="VACUUM;"
sql.execute(sSQL,conn1)
sql.execute(sSQL,conn2)
print('################')
#
print('### Done!! ############################################')
#

 

Congratulations! You have built your first hub and satellites for a time in the data vault.

The data vault has been built in the directory ..\ VKHCG\88-DV\datavault.db.

You can ­access it with your SQLite tools if you want to investigate what you achieved. 

 

Person

This structure records the person(s) involved within the data sources. A person is determined as a set of data items that, combined, describes a single “real” person in the world.

Remember Angus MacGyver, the baker down the street, is a person.

Angus, my dog, is an object.

 

The data scientist defended his actions by reason that he tests his data science code by hard-coding it into his data science algorithms. Not the hallmark of a good data scientist!

 

Tip Remove self-generated data errors from production systems immediately.

They represent bad science! And, yes, they can damage property and hurt people.

 

The Meaning of a Person’s Name

data scientists

We all have names, and our names supply a larger grouping, or even hidden meanings, to data scientists. If people share the same last name, there is a probability they may be related. I will discuss this again later.

 

Now, we load the person into the data vault. Open your Python editor and create a new file named http://Process-People.py in directory ..\VKHCG\04-Clark\03-Process. 

 

Object

data vault’s object

This structure records the other objects and nonhuman entities that are involved in the data sources. The object must have three basic fields: Object Type, Object Group, and Object Code to be valid. I also add a short and long description for information purposes.

 

The data science allows for the data vault’s object hub to be a simple combination of Object Type, Object Group, and Object Code, to formulate a full description of any nonhuman entities in the real world.

 

I must note that as you start working with objects, you will discover that most Object Types have standards to support the specific object’s processing. 

 

Loop through all nodes and return the shortest path from the entire node set, using Dijkstra’s algorithm:

for node1 in nx.nodes_iter(G):
for node2 in nx.nodes_iter(G):
d=nx.dijkstra_path(G, source=node1, target=node1, weight=None)
print('Path:',d)
Question: Could you create a link table that uses this path as the link?
Here’s the table you must populate:
CREATE TABLE [Link-Object-Species] (
LinkID VARCHAR (100),
NodeFrom VARCHAR (200),
NodeTo VARCHAR (200),
ParentNode VARCHAR (200),
Node VARCHAR (200),
Step INTEGER
);

Try your new skills against the graph. Can you get it to work?

 

Algorithms such as deep learning, logistic regression, and random forests, work with improved net results if you have already created the basic relationships, by linking different areas of insight.

 

Tip start by Object Type and Object Group first when identifying your business requirements. If you have the two fields sorted, you are 80+% of the way to completing the object structure.

 

This is just a basic start to the processing of objects in data science. This part of the data vault will always be the largest portion of the data science work in the Process superstep. I have also enriched my customers’ data by getting external experts to categorize objects for which particular customers have never identified the common characteristics.

 

Warning It is better to load in large amounts of overlapping details than to have inadequate and incomplete details. Do not over-summarize your Object Type and Object Group groupings.

 

For the purpose of the practical examples, I will cover the algorithms and techniques that will assist you to perform the categorization of the objects in data sources.

 

Location

Location

This structure records any location where interaction between data entities occurred in the data sources. The location of a data action is a highly valuable piece of data for any business. It enables geospatial insight of the complete data sets. Geospatial refers to information that categorizes where defined structures are on the earth’s surface.

 

Global Positioning System (GPS)

The Global Positioning System and Global Navigation Satellite System are space-based radio navigation systems that supply data to any GPS/GLONASS receiver, to enable the receiver to calculate its own position on the earth’s surface.

 

This system still uses latitude, longitude, and altitude but has an advanced technique that uses a series of overlapping distant mappings from a minimum of four satellites, to map the receiver’s location. Hence, this system is highly accurate.

 

The relative location of a location is useful when you want to find the closest shop or calculate the best method of transport between locations.

 

Natural Characteristics

A topographical map is a good format for the physical characteristics of a location. I suggest that you collect data on a specific location, to enhance your customer’s data.

 

I also suggest that you look at data sources for weather measurements, temperature, land types, soil types, underground minerals, water levels, plant life, and animal life for specific locations.

 

Note Any natural information that can enhance the location data is good to collect.

Tip Collect the information over a period. That way, you get a time series that you can use to check for changes in the natural characteristics of the location.

 

These natural characteristics are a good source of data for testing the correlations between business behavior and the natural location. Such natural characteristics as a forest, desert, or lake region can indicate to businesses like a chain store when it will or will not sell, say, a motorboat or walking boots.

 

An area covered in red mud 90% of the year may not be selling white trousers. The lack or excess of wind in an area may determine the success of energy-saving solutions, such as wind power generation. Or sinkholes will increase the cost of homeowner’s insurance.

 

Human Characteristics

Characteristics Data Vault

I suggest that you collect data on any human-designed or cultural features of a location. These features include the type of government, land use, architectural styles, forms of livelihood, religious practices, political systems, common foods, local folklore, transportation, languages spoken, money availability, and methods of communication.

 

These natural characteristics are a good source of data to test for correlations between business behavior and the natural location. Human characteristics can also assist with the categorization of the location.

 

Human-Environment Interaction

The interaction of humans with their environment is a major relationship that guides people’s behavior and the characteristics of the location. Activities such as mining and other industries, roads, and landscaping at a location create both positive and negative effects on the environment, but also on humans.

 

A location earmarked as a green belt, to assist in reducing the carbon footprint, or a new interstate changes its current and future characteristics. The location is a main data source for the data science, and, normally, we find unknown or unexpected effects on the data insights.

 

In the Python editor, open a new file named Process_Location.py in the directory

\VKHCG\01-Vermeulen\03-Process. Start with the standard imports and setup of the ecosystem.
# -*- coding: utf-8 -*-
import sys import os
import pandas as pd import sqlite3 as sq from http://pandas.io import sql import uuid
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
Company='01-Vermeulen' InputAssessGraphName='Assess_All_Animals.gml' EDSAssessDir='02-Assess/01-EDS' InputAssessDir=EDSAssessDir + '/02-Python'
sFileAssessDir=Base + '/' + Company + '/' + InputAssessDir if not os.path.exists(sFileAssessDir):
os.makedirs(sFileAssessDir)
sDataBaseDir=Base + '/' + Company + '/03-Process/SQLite' if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
tMax=360*180
You will now loop through longitude and latitude.
Warning I am using a step value of 1 in the following sample code. In real-life systems, I use a step value as small as 0.000001. This will result in a long-running loop. To do this, generate the pandas data frame.
for Longitude in range(-180,180,1):
for Latitude in range(-90,90,1):
t+=1
IDNumber=str(uuid.uuid4())
LocationName='L'+format(round(Longitude,3)*1000, '+07d') +\
'-'+format(round(Latitude,3)*1000, '+07d')
print('Create:',t,' of ',tMax,':',LocationName)
LocationLine=[('ObjectBaseKey', ['GPS']),
('IDNumber', [IDNumber]),
('LocationNumber', [str(t)]),
('LocationName', [LocationName]),
('Longitude', [Longitude]),
('Latitude', [Latitude])]
if t==1:
LocationFrame = pd.DataFrame.from_items(LocationLine)
else:
LocationRow = pd.DataFrame.from_items(LocationLine)
LocationFrame = LocationFrame.append(LocationRow)
LocationHubIndex=LocationFrame.set_index(['IDNumber'],inplace=False) sTable = 'Process-Location'
print('Storing :',sDatabaseName,' Table:',sTable)
LocationHubIndex.to_sql(sTable, conn1, if_exists="replace") # sTable = 'Hub-Location'
print('Storing :',sDatabaseName,' Table:',sTable)
LocationHubIndex.to_sql(sTable, conn2, if_exists="replace")
#
print('################')
print('Vacuum Databases')
sSQL="VACUUM;"
sql.execute(sSQL,conn1)
sql.execute(sSQL,conn2)
print('################')
print('### Done!! ############################################')

 

Congratulations! You have created the hub for location. You can expand the knowledge by adding other satellites, for example, postcode, common names, regional information, and other business-specific descriptions.

 

You should investigate libraries such as geopandas (see GeoPandas 0.4.0 - GeoPandas 0.4.0 documentation), to enhance your location with extra knowledge. You can install this by using install -c conda-forge geopandas.
Example:
import geopandas as gpd
from matplotlib import pyplot as pp
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
southern_world = http://world.cx[:, :0]
southern_world.plot(figsize=(10, 3));
world.plot(figsize=(10, 6));
pp.show()import geopandas as gpd
from matplotlib import pyplot as pp
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
southern_world = http://world.cx[:, :0]
southern_world.plot(figsize=(10, 3));
world.plot(figsize=(10, 6));
pp.show()

The library supports an easy technique to handle locations. Now that we have completed the location structure successfully, we have only to handle events, to complete the data vault.

 

Event

This structure records any specific event or action that is discovered in the data sources. An event is any action that occurs within the data sources. Events are recorded using three main data entities:

 

Event Type, Event Group, and Event Code. The details of each event are recorded as a set of details against the event code. There are two main types of events.

 

Explicit Event

This type of event is stated in the data source clearly and with full details. There is clear data to show that the specific action was performed.

Following are examples of explicit events:

  • A security card with number 1234 was used to open door A.
  • I bought ten cans of beef curry.

Explicit events are the events that the source systems supply, as these have direct data that proves that the specific action was performed.

 

Implicit Event

This type of event is formulated from characteristics of the data in the source systems plus a series of insights on the data relationships.

The following are examples of implicit events:

  • A security card with number 8887 was used to open door X.
  • A security card with number 8887 was issued to Mr. Vermeulen.
  • Room 302 is fitted with a security reader marked door X.

 

These three events would imply that Mr. Vermeulen entered room 302 as an event. Not true!

Warning Beware the natural nature of humans to assume that a set of actions always causes a specific known action. Record as events only proven fact!

 

If I discover that all visitor passes are issued at random and not recovered on exit by the visitor’s desk, there is a probability that there is more than one security card with the number 1234.

 

As a data scientist, you must ensure that you do not create implicit events from assumptions that can impact the decisions of the business. You can only record facts as explicit events.

 

The event is the last of the core structures of the Process step. I will now assist you to convert the basic knowledge of the time, person, object, location, and event structures into a data source that will assist the data scientist to perform all the algorithms and techniques required to convert the data into business insights.

 

These five main pillars (time, person, object, location, and event), as supported by the data extracted from the data sources, will be used by the data scientist to construct the data vault.

 

I will return to the data vault later in this blog. I want to introduce a few data science concepts that will assist you in converting the assess data into a data vault. I suggest you open your Python editor and see if you can create the event hub on your own.

# -*- coding: utf-8 -*-
import sys import os
import pandas as pd import sqlite3 as sq from http://pandas.io import sql import uuid
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
Company='01-Vermeulen'
InputFileName='Action_Plan.csv'
sDataBaseDir=Base + '/' + Company + '/03-Process/SQLite' if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName
print('Loading :',sFileName)
EventRawData=pd.read_csv(sFileName,header=0,low_memory=False,
encoding="latin-1")
EventRawData.index.names=['EventID']
EventHubIndex=EventRawData
sTable = 'Process-Event'
print('Storing :',sDatabaseName,' Table:',sTable)
EventHubIndex.to_sql(sTable, conn1, if_exists="replace") # sTable = 'Hub-Event'
print('Storing :',sDatabaseName,' Table:',sTable)
EventHubIndex.to_sql(sTable, conn2, if_exists="replace")
#
print('################')
print('Vacuum Databases')
sSQL="VACUUM;"
sql.execute(sSQL,conn1)
sql.execute(sSQL,conn2)
print('################')
print('### Done!! ############################################')
Well done. You just completed the next hub.

 

Tip It takes just a quick why—times five—to verify!

The next technique I employ is useful in 90% of my data science. It usually offers multifaceted solutions but requires a further level of analysis.

 

Monte Carlo Simulation

I want to introduce you to a powerful data science tool called a Monte Carlo simulation.

 

This technique performs analysis by building models of possible results, by substituting a range of values—a probability distribution—for parameters that have inherent uncertainty. It then calculates results over and over, each time using a different set of random values from the probability functions.

 

Depending on the number of uncertainties and the ranges specified for them, a Monte Carlo simulation can involve thousands or tens of thousands of recalculations before it is complete. Monte Carlo simulation produces distributions of possible outcome values.

 

As a data scientist, this gives you an indication of how your model will react under real-life situations. It also gives the data scientist a tool to check complex systems, wherein the input parameters are high-volume or complex. I will show you a practical use of this tool in the next section.

 

Causal Loop Diagrams

A causal loop diagram (CLD) is a causal diagram that aids in visualizing how a number of variables in a system are interrelated and drive cause-and-effect processes. The diagram consists of a set of nodes and edges.

 

Nodes represent the variables, and edges are the links that represent a connection or a relation between the two variables.

 

I normally use my graph theory, to model the paths through the system. I simply create a directed graph and then step through it one step at a time.

 

Example: The challenge is to keep the “Number of Employees Available to Work and Productivity” as high as possible.

 

Our first conclusion was “We need more staff.” I then simulated the process, using a hybrid data science technique formulated from two other techniques—Monte Carlo simulation and Deep Reinforcement Learning—against a graph data model.

 

The result was the discovery of several other additional pieces of the process that affected the outcome I was investigating. 

 

The Monte Carlo simulation cycled through a cause-and-effect model by changing the values at points R1 through R4 for three hours, as a training set. The simulation was then run for three days with a reinforced deep learning model that adapted the key drivers in the system.

 

The result was “Managers need to manage not to work.” The R2— the percentage of manage doing employees’ duties—was the biggest cause and impact driver in the system.

 

Tip Do not deliver results early if you are not sure of the true data science results. Work effectively and efficiently not quickly!

 

I routinely have numerous projects going through analysis in parallel. This way, I can process parallel data science tasks while still supporting excellent, effective, and efficient data science.

 

Pareto Chart

Pareto charts are a technique I use to perform a rapid processing plan for data science. Pareto charts can be constructed by segmenting the range of data into groups (also called segments, bins, or categories). 

 

Questions the Pareto chart answers:

  • What are the largest issues facing our team or my customer’s business?
  • What 20% of sources are causing 80% of the problems (80/20 Rule)?
  • Where should we focus our efforts to achieve the greatest improvements?

 

I perform a rapid assessment of the data science processes and determine what it will take to do 80% of the data science effectively and efficiently in the most rapid time frame. It is a maximum-gain technique.

 

Tip If you are saving your customer 80% of its proven loses, it is stress-free to identify resources to complete the remaining 20% savings.

 

Correlation Analysis

The most common analysis I perform at this step is the correlation analysis of all the data in the data vault. Feature development is performed between data items, to find relationships between data values.

import pandas as pd
a = [ [1, 2, 3], [5, 6, 9], [7, 3, 13], [5, 3, 19], [5, 3, 12], [5, 6, 11],
[5, 6, 13], [5, 3, 4]]
df = pd.DataFrame(data=a)
cr=df.corr()
print(cr)

 

Data Science

Data Science

You must understand that data science works best when you follow approved algorithms and techniques. You can experiment and wrangle the data, but in the end, you must verify and support your results.

 

Applying good data science ensures that you can support your work with acceptable proofs and statistical tests. I believe that data science that works follows these basic steps:

 

\ 1.\ Start with a question. Make sure you have fully addressed the 5 Whys.

\ 2.\ Follow a good pattern to formulate a model. Formulate a model, guess a prototype for the data, and start a virtual simulation of the real-world parameters.

 

If you have operational research or econometrics skills, this is where you will find a use for them. Mix some mathematics and statistics into the solution, and you have the start of a data science model.

 

Remember: All questions have to be related to the customer’s business and must provide insights into the question that results in an actionable outcome and, normally, a quantifiable return-on-investment of the application of the data science processes you plan to deploy.

 

\ 3.\ Gather observations and use them to generate a hypothesis. Start the investigation by collecting the required observations, as per your model. Process your model against the observations and prove your hypothesis to be true or false.

 

\ 4.\ Use real-world evidence to judge the hypothesis. Relate the findings back to the real world and, through storytelling, convert the results into real-life business advice and insights. 

 

Caution You can have the best results, but if you cannot explain why they are important, you have not achieved your goal.

 

\ 5.\ Collaborate early and often with customers and with subject matter experts along the way. You also must communicate early and often with your relevant experts to ensure that you take them with you along the journey of discovery.

 

Business people want to be part of solutions to their problems. Your responsibility is to supply good scientific results to support the business.

 

Summary

You have successfully completed the Process step of this blog. You should now have a solid data vault with processed data ready for extraction by the Transform step, to turn your new knowledge into wisdom and understanding.

 

At this point, the Retrieve, Assess and Process steps would have enabled you to change the unknown data in the data lake into a classified series of the hub, i.e., T-P-O-L-E structures. You should have the following in the data vault:

 

  • Time hub with satellites and links
  • Person hub with satellites and links
  • Object hub with satellites and links
  • Location hub with satellites and links
  • Event hub with satellites and links

The rest of the data science is now performed against the standard structure. This will help you to develop solid and insightful data science techniques and methods against the T-P-O-L-E structure.

 

Note If you have processed the features from the data lake into the data vault, you should now have completed most of the data engineering or data transformation tasks. You have achieved a major milestone by building the data vault.

 

Note The data vault creates a natural history, as it stores all of the preceding processing runs as snapshots in the data vault. You can reprocess the complete Transform super step and recover a full history of every step in the processing.

 

The next step is for you to start performing more pure data science tasks, to investigate what data you have found and how you are going to formulate the data from the data vault into a query-able solution for the business.

Recommend