Data Science Process (Best Tutorial 2019)

Data Science Process

The Data Science Process

Data science is about extracting value from data in a grounded manner where one realizes that data requires a lot of treatment and work from a lot of stakeholders before becoming valuable.


This tutorial explains the Data Science Process with best examples. And also explains several Data Science transform Utilities and functions using Python.


As such, data science as an organizational activity is oftentimes described by means of a “process”: a workflow describing the steps that have to be undertaken in a data science project.


As a data scientist, you have probably already experienced that “data science” has become a very overloaded term indeed.


Companies are coming to terms with the fact that data science incorporates a very broad skill set that is near impossible to find in a single person, hence the need for a multidisciplinary team involving:


  • Fundamental theorists, mathematicians, statisticians (think: regression, Bayesian modeling, linear algebra, Singular Value Decomposition, and so on).


  • Analysts and modelers (think: building a random forest or neural network model using R or SAS).


  • Database administrators (think: DB2 or MSSQL experts, people with a solid understanding of databases and SQL).


  • Business intelligence experts (think: reporting and dashboards, as well as data warehouses and OLAP cubes).


  • IT architects and “DevOps” people (think: people maintaining the modeling environment, tools, and platforms).


  • Big data platform experts (think: people who know their way around Hadoop, Hive, and Spark).


The raw source is data. Many such process frameworks have been proposed, with CRISP-DM and the KDD (Knowledge Discovery in Databases) process being the two most popular ones these days.


Data Science Process

Following are the five fundamental data science process steps that are the core of my approach to practical data science.


Start with a What-if Question

Decide what you want to know, even if it is only the subset of the data lake you want to use for your data science, which is a good start.


Take a Guess at a Potential Pattern

Use your experience or insights to guess a pattern you want to discover, to uncover additional insights from the data you already have.

Once you start working with massive volumes, velocities, and variance in data, you will need a more structured framework to handle the data science.


Data science views data models as a set of algorithms and processing rules applied as part of data processing pipelines. So, make sure that when you talk to people, they are clear on what you are talking about in all communications channels.


It consists of several structures, as follows:

Data schemas and data formats: Functional data schemas and data formats deploy onto the data lake’s raw data, to perform the required schema-on-query via the functional layer.


Data models: These form the basis for future processing to enhance the processing capabilities of the data lake, by storing already processed data sources for future use by other processes against the data lake.


Processing algorithms: The functional processing is performed via a series of well-designed algorithms across the processing chain.


Provisioning of infrastructure: The functional infrastructure provision enables the framework to add processing capability to the ecosystem, using technology such as Apache Mesos, which enables the dynamic provisioning of processing work cells.


The processing algorithms and data models are spread across six super steps for processing the data lake.

\ 1.\ Retrieve

This super step contains all the processing chains for retrieving data from the raw data lake into a more structured format.

\ 2.\ Assess

This super step contains all the processing chains for quality assurance and additional data enhancements.

\ 3.\ Process

This super step contains all the processing chains for building the data vault.

\ 4.\ Transform

This super step contains all the processing chains for building the data warehouse from the core data vault.

\ 5.\ Organize

This super step contains all the processing chains for building the data marts from the core data warehouse.

\ 6.\ Report

This super step contains all the processing chains for building virtualization and reporting of the actionable knowledge.


These six supersteps, discussed in detail in individual blogs devoted to them, enable the reader to master both them and the relevant tools from the Data Science Technology Stack.


Identify the business problem:

business problem

Similar to the “Business Understanding” step in CRISP-DM, the first step consists of a thorough definition of the business problem.


Some examples: customer segmentation of a mortgage portfolio, retention modeling for a postpaid telco subscription, or fraud detection for credit cards. Defining the scope of the analytical modeling exercise requires a close collaboration between the data scientist and business expert.


Both need to agree on a set of key concepts such as: How do we define a customer, transaction, churn, fraud, etc.; what is it we want to predict (how do we define this), and when are we happy with the outcome.


  • Identify data sources:

Next, all source data that could be of potential interest need to be identified. This is a very important step as data is the key ingredient to any analytical exercise and the selection of data has a deterministic impact on the analytical models built in a later step.


  • Select the data:

The general golden rule here is the more data, the better, though data sources that have nothing to do with the problem at hand should be discarded during this step. All appropriate data will then be gathered in a staging area and consolidated into a data warehouse, data mart, or even a simple spreadsheet file.


  • Clean the data:

After the data has been gathered, along preprocessing and data wrangling series of steps follows to remove all inconsistencies, such as missing values, outliers, and duplicate data.


  • Transform the data:

The preprocessing step will often also include a lengthy transformation part as well. Additional transformations may be considered, such as alphanumeric to numeric coding, geographical aggregation, logarithmic transformation to improve symmetry, and other smart “featurization” approaches.


  • Analyze the data:

The steps above correspond with the “Data Understanding” and “Data Preparation” steps in CRISP-DM. Once the data is sufficiently cleaned and processed, the actual analysis and modeling can begin (referred to as “Modeling” in CRISP-DM).


Here, an analytical model is estimated on the preprocessed and transformed data. Depending on the business problem, a particular analytical technique will be selected and implemented by the data scientist.


  • Interpret, evaluate, and deploy the model:

Finally, once the model has been built, it will be interpreted and evaluated by the business experts (denoted as “Evaluation” in CRISP-DM). Trivial patterns and insights that may be detected by the analytical model can still be interesting as they provide a validation of the model.


But, of course, the key challenge is to find the unknown, yet interesting and actionable patterns that can provide new insights into your data.


Once the analytical model has been appropriately validated and approved, it can be put into production as an analytics application (e.g., decision support system, scoring engine, etc.).


It is important to consider how to represent the model output in a user-friendly way, how to integrate it with other applications (e.g., marketing campaign management tools, risk engines, etc.), and how to ensure the analytical model can be appropriately monitored on an ongoing basis.


Often, the deployment of an analytical model will require the support of IT experts that will help to “productionize” the model.


Where Does Web Scraping Fit In?

Web Scraping

There are various parts of the data science process where web scraping can fit in. In most projects, however, web scraping will form an important part in the identification and selection of data sources. That is, to collect and gather data you can include in your data set to be used for modeling.


It is important to provide a warning here, which is to always keep the production setting of your constructed models in mind (the “model train” versus “model run” gap).


Are you building a model as a one-shot project that will be used to describe or find some interesting patterns, then, by all means, utilize as much scraped and external data as desired.


In case a model will be productionized as a predictive analytics application, however, keep in mind that the model will need to have access to the same variables at the time of deployment as when it was trained.


You’ll hence carefully need to consider whether it will be feasible to incorporate scraped data sources in such a setup, as it needs to be ensured that the same sources will remain available and that it will be possible to continue scraping them as you go forward.


Websites can change, and a data collection part depending on web scraping requires a great deal of maintenance to implement fixes or changes in a timely manner. In these cases, you still might wish to rely on a more robust solution like an API.


Depending on your project, this requirement might be more or less troublesome to deal with. If the data you’ve scraped refers to aggregated data that “remains valid” for the duration of a whole year.


For instance, then you can, of course, continue to use the collected data when running the models during deployment as well (and schedule a refresh of the data well before the year is over, for instance).


Always keep the production set of a model in mind: Will you have access to the data you need when applying and using the model as well, or only during the time of training the model?


Who will be responsible to ensure this data access? Is the model simply a proof of concept with a limited shelf life, or will it be used and maintained for several years going forward?


In some cases, the web scraping part will form the main component of a data science project. This is common in cases where some basic statistics and perhaps an appealing visualization is built over scraped results to present findings and explore the gathered data in a user-friendly way.


Still, the same questions have to be asked here: Is this a one-off report with a limited use time, or is this something people will want to keep up to date and use for a longer period of time?


The way how you answer these questions will have a great impact on the setup of your web scraper.


In case you only need to gather results using web scraping for a quick proof of concept, a descriptive model, or a one-off report, you can afford to sacrifice robustness for the sake of obtaining data quickly.


In case scraped data will be used during production as well (as was the case for the yearly aggregated information), it can still be feasible to scrape results, although it is a good idea to already think about the next time you’ll have to refresh the dataset and keep your setup as robust and well-documented as possible.


If information has to be scraped every time the model is run, the “scraping” part now effectively becomes part of the deployed setup, including all headaches that come along with it regarding monitoring, maintenance, and error handling. Make sure to agree on upfront which teams will be responsible for this!


There are two other “managerial warnings” we wish to provide here. One relates to data quality.


If you’ve been working with data in an organizational setting, you’ve no doubt heard about the GIGO principle: garbage in, garbage out. When you rely on the World Wide Web to collect data — with all the messiness and unstructuredness that goes with it — be prepared to take a “hit” regarding data quality.


Indeed, it is crucial to incorporate as much cleaning and fail-safes as possible in your scrapers, though you will nevertheless almost always eventually encounter a page where an extra unforeseen HTML tag appears or the text you expected is not there, or something is formatted just slightly differently. A final warning relates to reliability.


The same point holds, in fact, not just for web scraping but also for APIs. Many promising startups over the past years have appeared that utilize Twitter’s, Facebook’s or some other API to provide a great service.


What happens when the provider or owner of that website decides to increase their prices for what they’re offering to others? What happens if they retire their offering?


Many products have simply disappeared because their provider changed the rules. Using external data, in general, is oftentimes regarded as a silver bullet — “If only we could get out the information Facebook has!” — though think carefully and consider all possible outcomes before getting swayed too much by such ideas.


Business Layer

Business Layer

The business layer is the transition point between the nontechnical business requirements and desires and the practical data science, where, I suspect, most readers of this blog will have a tendency to want to spend their careers, doing the perceived more interesting data science.


The business layer does not belong to the data scientist 100%, and normally, its success represents a joint effort among such professionals as business subject matter experts, business analysts, hardware architects, and data scientists.


The business layer is where we record the interactions with the business. This is where we convert business requirements into data science requirements. The business layer must support the comprehensive collection of entire sets of requirements, to be used successfully by the data scientists.


If you want to process data and wrangle with your impressive data science skills, this blog may not be the start of a blog about practical data science that you would expect.


I suggest, however, that you read this blog if you want to work in a successful data science group. As a data scientist, you are not in control of all aspects of a business, but you have a responsibility to ensure that you identify the true requirements.


The Functional Requirements

Functional requirements record the detailed criteria that must be followed to realize the business’s aspirations from its real-world environment when interacting with the data science ecosystem. These requirements are the business’s view of the system, which can also be described as the “Will of the Business.”


Tip Record all the business’s aspirations. Make everyone supply their input. You do not want to miss anything, as later additions are expensive and painful for all involved.


I use the Moscow method as a prioritization technique, to indicate how important each requirement is to the business. I revisit all outstanding requirements before each development cycle, to ensure that I concentrate on the requirements that are of maximum impact to the business at present, as businesses evolve, and you must be aware of their true requirements.


Moscow Options

  • Must have Requirements with the priority “must have” are critical to the current delivery cycle.


  • Should have Requirements with the priority “should have” are important but not necessary to the current delivery cycle.


  • Could have Requirements prioritized as “could have” are those that are desirable but not necessary, that is, nice to have to improve the user experience for the current delivery cycle.


  • Won’t have Requirements with a “won’t have” priority are those identified by stakeholders as the least critical, lowest payback requests, or just not appropriate at that time in the delivery cycle.


General Functional Requirements

As a [user role] I want [goal] so that [business value] is achieved.


Specific Functional Requirements

The following requirements specific to data science environments will assist you in creating requirements that enable you to transform a business’s aspirations into technical descriptive requirements.


I have found these techniques highly productive in aligning requirements with my business customers, while I can easily convert or extend them for highly technical development requirements.


Data Mapping Matrix

The data mapping matrix is one of the core functional requirement recording techniques used in data science. It tracks every data item that is available in the data sources. I advise that you keep this useful matrix up to date as you progress through the processing layers.


Sun Models

The sun models is a requirement mapping technique that assists you in recording requirements at a level that allows your nontechnical users to understand the intent of your analysis while providing you with an easy transition to the detailed technical modeling of your data scientist and data engineer.


Note Over the next few pages, I will introduce several new concepts. Please read on, as the section will help in explaining the complete process.


The Nonfunctional Requirements

Nonfunctional requirements record the precise criteria that must be used to appraise the operation of a data science ecosystem.


Accessibility Requirements

Accessibility can be viewed as the “ability to access” and benefit from some system or entity. The concept focuses on enabling access for people with disabilities, or special needs, or enabling access through assistive technology.


Assistive technology covers the following:

  • Levels of blindness support: Must be able to increase font sizes or types to assist with reading for affected people
  • Levels of color-blindness support: Must be able to change a color palette to match individual requirements
  • Use of voice-activated commands to assist disabled people: Must be able to use voice commands for individuals that cannot type commands or use a mouse in a normal manner


Audit and Control Requirements


The audit is the ability to investigate the use of the system and report any violations of the system’s data and processing rules. Control is making sure the system is used in the manner and by whom it is pre-approved to be used.


An approach called role-based access control (RBAC) is the most commonly used approach to restricting system access to authorized users of your system. RBAC is an access-control mechanism formulated around roles and privileges.


The components of RBAC are role-permissions—user-role and role-role relationships that together describe the system’s access policy.


These audit and control requirements are also compulsory, by regulations on privacy and processing. Please check with your local information officer which precise rules apply.


Availability Requirements

Availability is as a ratio of the expected uptime of a system to the aggregate of the downtime of the system. For example, if your business hours are between 9h00 and 17h00, and you cannot have more than 1 hour of downtime during your business hours, you require 87.5% availability.


Take note that you specify precisely at what point you expect the availability. If you are measuring at the edge of the data lake, it is highly possible that you will sustain 99.99999% availability with ease.


The distributed and fault-tolerant nature of the data lake technology would ensure a highly available data lake. But if you measure at critical points in the business, you will find that at these critical business points, the requirements are more specific for availability.


Record your requirements in the following format:

Component C will be entirely operational for P% of the time over an uninterrupted measured period of D days.


Your customers will understand this better than the general “24/7” or “business hours” terminology that I have seen used by some of my previous customers. No system can achieve these general requirement statements.


The business will also have periods of high availability at specific periods during the day, week, month, or year. An example would be every Monday morning the data science results for the weekly meeting has to be available. This could be recorded as the following:


Weekly reports must be entirely operational for 100% of the time between 06h00 and 10h00 every Monday for each office.


Note Think what this means to a customer that has worldwide offices over several time zones. Be sure to understand every requirement fully!

The correct requirements are

  •  London’s weekly reports must be entirely operational for 100% of the time between 06h00 and 10h00 (Greenwich Mean Time or British Daylight Time) every Monday.


  • New York’s weekly reports must be entirely operational for 100% of the time between 06h00 and 10h00 (Eastern Standard Time or Eastern Daylight Time) every Monday.


Note You can clearly see that these requirements are now more precise than the simple general requirement.


Identify single points of failure (SPOFs) in the data science solution. Ensure that you record this clearly, as SPOFs can impact many of your availability requirements indirectly.


Highlight that those dependencies between components that may not be available at the same time must be recorded and requirements specified, to reflect this availability requirement fully.


Note Be aware that the different availability requirements for different components in the same solution are the optimum requirement recording option.


Backup Requirements

Backup Requirements

A backup, or the process of backing up, refers to the archiving of the data lake and all the data science programming code, programming libraries, algorithms, and data models, with the sole purpose of restoring these to a known good state of the system, after a data loss or corruption event.


Remember: Even with the best distribution and self-healing capability of the data lake, you have to ensure that you have a regular and appropriate backup to restore. Remember a backup is only valid if you can restore it.


The merit of any system is its ability to return to a good state. This is a critical requirement.


For example, suppose that your data scientist modifies the system with a new algorithm that erroneously updates an unknown amount of the data in the data lake. Oh, yes, that silent moment before every alarm in your business goes mad! You want to be able at all times to return to a known good state via a backup.


Warning Please ensure that you can restore your backups in an effective and efficient manner. The process is backup-and-restore. Just generating backups does not ensure survival. Understand the impact it has on the business if it goes back two hours or what happens while you restore.


Capacity, Current, and Forecast

Capacity is the ability to load, process, and store a specific quantity of data by the data science processing solution.


You must track the current and forecast the future requirements because as a data scientist, you will design and deploy many complex models that will require additional capacity to complete the processing pipelines you create during your processing cycles.


Warning I have inadvertently created models that generate several terabytes of workspace requirements, simply by setting the parameters marginally too in-depth than optimal. Suddenly, my model was demanding disk space at an alarming rate!



Capacity is measured per the component’s ability to consistently maintain specific levels of performance as data load demands vary in the solution. The correct way to record the requirement is Component C will provide P% capacity for U users, each with M MB of data during a time frame of T seconds.



The data hard drive will provide 95% capacity for 1000 users, each with 10MB of data during a time frame of 10 minutes.


Warning Investigate the capacity required to perform a full rebuild in one process. I advise researching new cloud on-demand capacity, for disaster recovery or capacity top-ups. I have been consulted after major incidents that crippled a company for weeks, owing to a lack of proper capacity top-up plans.



Concurrency is the measure of a component to maintain a specific level of performance when under multiple simultaneous loads conditions.

The correct way to record the requirement is Component C will support a concurrent group of U users running predefined acceptance script S simultaneously.



The memory will support a concurrent group of 100 users running a sort algorithm of 1000 records simultaneously.


Note Concurrency is the ability to handle a subset of the total user base effectively. I have found that numerous solutions can handle substantial volumes of users with as little as 10% of the users’ running concurrently.


Concurrency is an important requirement to ensure an effective solution at the start. Capacity can be increased by adding extra processing resources, while concurrency normally involves complete replacements of components.


Design Tip If on average you have short-running data science algorithms, you can support high concurrency to maximum capacity ratio. But if your average running time is higher, your concurrency must be higher too. This way, you will maintain an effective throughput performance.


Throughput Capacity

This is how many transactions at peak time the system requires to handle specific conditions.


Storage (Memory)

This is the volume of data the system will persist in memory at runtime to sustain an effective processing solution.


Tip Remember: You can never have too much or too slow memory.


Storage (Disk)

This is the volume of data the system stores on disk to sustain an effective processing solution.


Tip Make sure that you have a proper mix of disks, to ensure that your solutions are effective.

You will need short-term storage on fast solid-state drives to handle the while-processing capacity requirements.


Warning There are data science algorithms that produce larger data volumes during data processing than the input or output data.


The next requirement is your long-term storage. The basic rule is to plan for bigger but slower storage.


Investigate using clustered storage, whereby two or more storage servers work together to increase performance, capacity, and reliability. Clustering distributes workloads to each server and manages the transfer of workloads between servers while ensuring availability.


The use of clustered storage will benefit you in the long term, during periods of higher demand, to scale out vertically with extra equipment.


Tip Ensure that the server network is more than capable of handling any data science load. Remember: The typical data science algorithm requires massive data sets to work effectively.


The big data revolution is now bringing massive amounts of data into the processing ecosystem. So, make sure you have enough space to store any data you need.

Warning If you have a choice, do not share disk storage or networks with a transactional system. The data science will consume any spare capacity on the shared resources. It is better to have a lower performance dedicated set of resources than to share a volatile process.


Storage (GPU)


This is the volume of data the system will persist in GPU memory at runtime to sustain an effective parallel processing solution, using the graphical processing capacity of the solution.


A CPU consists of a limited amount of cores that are optimized for sequential serial processing, while a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores intended for handling massive amounts of multiple tasks simultaneously.


The big advantage is to connect an effective quantity of very high-speed memory as closely as possible to these thousands of processing units, to use this increased capacity. I am currently deploying systems such as Kinetic DB and MapD, which are GPU-based database engines.


This improves the processing of my solutions by factors of a hundred in speed. I suspect that we will see key enhancements in the capacity of these systems over the next years.


Tip Investigate a GPU processing grid for your high-performance processing. It is an effective solution with the latest technology.


Year-on-Year Growth Requirements

The biggest growth in capacity will be for long-term storage. These requirements are specified as how much capacity increases over a period.

The correct way to record the requirement is Component C will be responsible for the necessary growth capacity to handle additional M MB of data within a period of T.


Configuration Management

Configuration management (CM) is a systems engineering process for establishing and maintaining consistency of a product’s performance, functional, and physical attributes against requirements, design, and operational information throughout its life.



A methodical procedure of introducing data science to all areas of an organization is required. Investigate how to achieve a practical continuous deployment of the data science models. These skills are much in demand, as the processes model changes more frequently as the business adopts new processing techniques.




Data science requires a set of documentation to support the story behind the algorithms. I will explain the documentation required at each stage of the processing pipe.


Disaster Recovery

Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.


Efficiency (Resource Consumption for Given Load)

Efficiency is the ability to accomplish a job with a minimum expenditure of time and effort. As a data scientist, you are required to understand the efficiency curve of each of your modeling techniques and algorithms. As I suggested before, you must practice with your tools at different scales.


Tip If it works at a sample 100,000 data points, try 200,000 data points or 500,000 data points. Make sure you understand the scaling dynamics of your tools.


Effectiveness (Resulting Performance in Relation to Effort)

Effectiveness is the ability to accomplish a purpose; producing the precise intended or expected result from the ecosystem.


As a data scientist, you are required to understand the efficiency curve of each of your modeling techniques and algorithms. You must ensure that the process is performing only the desired processing and has no negative side effects.



The ability to add extra features and carry forward customizations at next-version upgrades within the data science ecosystem. The data science must always be capable of being extended to support new requirements.


Failure Management

Failure Management

Failure management is the ability to identify the root cause of a failure and then successfully record all the relevant details for future analysis and reporting.


I have found that most of the tools I would include in my ecosystem have adequate fault management and reporting capability already built into their native internal processing.


A tip I found it takes a simple but well-structured set of data science processes to wrap the individual failure logs into a proper failure-management system. Apply normal data science to it, as if it is just one more data source.


I always stipulate the precise expected process steps required when a failure of any component of the ecosystem is experienced during data science processing.


Acceptance script S completes and reports every one of the X faults it generates. As a data scientist, you are required to log any failures of the system, to ensure that no unwanted side effects are generated that may cause a detrimental impact to your customers.


[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]


Fault Tolerance

Fault tolerance is the ability of the data science ecosystem to handle faults in the system’s processing. In simple terms, no single event must be able to stop the ecosystem from continuing the data science processing.


Here, I normally stipulate the precise operating system monitoring, measuring, and management requirements within the ecosystem, when faults are recorded.


Acceptance script S withstands the X faults it generates. As a data scientist, you are required to ensure that your data science algorithms can handle faults and recover from them in an orderly manner.



Latency is the time it takes to get the data from one part of the system to another. This is highly relevant in the distributed environment of the data science ecosystems.


Acceptance script S completes within T seconds on an unloaded system and within T2 seconds on a system running at maximum capacity, as defined in the concurrency requirement.


Tip Remember: There is also an internal latency between components that make up the ecosystem that is not directly accessible to users. Make sure you also note these in your requirements.



Insist on a precise ability to share data between different computer systems under this section.


Explain in detail what system must interact with what other systems. I normally investigate areas such as communication protocols, locations of servers, operating systems for different subcomponents, and the now-important end user’s Internet access criteria.


Warning Be precise with requirements, as open-ended interoperability can cause unexpected complications later in the development cycle.



Insist on a precise period during which a specific component is kept in a specified state. Describe precisely how changes to functionalities, repairs, and enhancements are applied while keeping the ecosystem in a known good state.



Stipulate the exact amount of change the ecosystem must support each layer of the solution.


Tip State what the limits are for specific layers of the solution. If the database can only support 2024 fields to a table, share that information!


Network Topology

Stipulate and describe the detailed network communication requirements within the ecosystem for processing. Also, state the expected communication to the outside world, to drive successful data science.


Note Owing to the high impact on network traffic from several distributed data science algorithms processing, it is required that you understand and record the network necessary for the ecosystem to operate at an acceptable level.




I suggest listing the exact privacy laws and regulations that apply to this ecosystem. Make sure you record the specific laws and regulations that apply. Seek legal advice if you are unsure. 


This is a hot topic worldwide, as you will process and store other people’s data and execute algorithms against this data. As a data scientist, you are responsible for your actions.


Warning Remember: A privacy violation will result in a fine!


Tip I hold liability insurance against legal responsibility claims for the data I process.



Specify the rigorous faults discovered, faults delivered, and fault removal efficiency at all levels of the ecosystem. Remember: Data quality is a functional requirement. This is a nonfunctional requirement that states the quality of the ecosystem, not the data flowing through it.



The ecosystem must have a clear-cut mean time to recovery (MTTR) specified. The MTTR for specific layers and components in the ecosystem must be separately specified. I typically measure in hours, but for other extra-complex systems, I measure in minutes or even seconds.



The ecosystem must have a precise mean time between failures (MTBF). This measurement of availability is specified in a pre-agreed unit of time. I normally measure in hours, but there are extra sensitive systems that are best measured in years.



Resilience is the capability to deliver and preserve a tolerable level of service when faults and issues to normal operations generate complications for the processing. The ecosystem must have a defined ability to return to the original form and position in time, regardless of the issues it has to deal with during processing.


Resource Constraints

Resource constraints are the physical requirements of all the components of the ecosystem. The areas of interest are processor speed, memory, disk space, and network bandwidth, plus, normally, several other factors specified by the tools that you deploy into the ecosystem.


Tip Discuss these requirements with your system’s engineers. This is not normally the area in which data scientists work.



Reusability is the use of pre-built processing solutions in the data science ecosystem development process. The reuse of preapproved processing modules and algorithms is highly advised in the general processing of data for the data scientists. The requirement here is that you use approved and accepted standards to validate your own results.


Warning I always advise that you use methodologies and algorithms that have proven lineage. An approved algorithm will guarantee acceptance by the business. Do not use unproven ideas!



Scalability is how you get the data science ecosystem to adapt to your requirements. I use three scalability models in my ecosystem: horizontal, vertical, and dynamic (on-­demand).


Horizontal scalability increases capacity in the data science ecosystem through more separate resources, to improve performance and provide high availability (HA). The ecosystem grows by a scale-out, by adding more servers to the data science cluster of resources.


Tip Horizontal scalability is the proven way to handle full-scale data science ecosystems.


Warning Not all models and algorithms can scale horizontally. Test them first.


I would counsel against making assumptions.

Vertical scalability increases capacity by adding more resources (more memory or an additional CPU) to an individual machine.


Warning Make sure that you size your data science building blocks correctly at the start, as vertical scaling of clusters can get expensive and complex to swap at later stages.


Dynamic (on-demand) scalability increases capacity by adding more resources, using either public or private cloud capability, which can be increased and decreased on a pay-as-you-go model.


This is a hybrid model using a core set of resources that is the minimum footprint of the system, with additional burst agreements to cover any planned or even unplanned extra scalability increases in capacity that the system requires.


I’d like to discuss scalability for your power users. Traditionally, I would have suggested high-specification workstations, but I have found that you will serve them better by providing them access to a flexible horizontal scalability on-demand environment.


This way, they use what they need during peak periods of processing but share the capacity with others when they do not require the extra processing power.




One of the most important nonfunctional requirements is security. I specify security requirements at three levels.



I would specifically note requirements that specify protection for sensitive information within the ecosystem. Types of privacy requirements to note include data encryption for database tables and policies for the transmission of data to third parties.


Tip Sources for privacy requirements are legislative or corporate. Please consult your legal experts.



I would specifically note requirements for the physical protection of the system. Include physical requirements such as power, elevated floors, extra server cooling, fire prevention systems, and cabinet locks.


Warning Some of the high-performance workstations required to process data science have stringent power requirements, so ensure that your data scientists are in a preapproved environment, to avoid overloading the power grid.



I purposely specify detailed access requirements with defined account types/groups and their precise access rights.


A tip I use role-based access control (RBAC) to regulate access to data science resources, based on the roles of individual users within the ecosystem and not by their separate names. This way, I simply move the role to a new person, without any changes to the security profile.



International standard IEEE 1233-1998 states that testability is the “degree to which a requirement is stated in terms that permit the establishment of test criteria and performance of tests to determine whether those criteria have been met.” In simple terms, if your requirements are not testable, do not accept them.


Remember A lower degree of testability results in increased test effort. I have spent too many nights creating tests for requirements that are unclear.


Following is a series of suggestions, based on my experience.



Knowing the precise degree to which I can control the state of the code under test, as required for testing, is essential.


The algorithms used by data science are not always controllable, as they include random start points to speed the process. Running distributed algorithms is not easy to deal with, as the distribution of the workload is not under your control.


Isolate Ability

The specific degree to which I can isolate the code under test will drive most of the possible testing. A process such as deep learning includes non-isolation, so do not accept requirements that you cannot test, owing to not being able to isolate them.



I have found that most algorithms have undocumented “extra features” or, in simple terms, “got-you” states. The degree to which the algorithms under test are documented directly impacts the testability of requirements.



I have found the degree to which I can automate testing of the code directly impacts the effective and efficient testing of the algorithms in the ecosystem. I am an enthusiast of known result inline testing.


I add code to my algorithms that test specific sub-sessions, to ensure that the new code has not altered the previously verified code.


Common Pitfalls with Requirements


I just want to list a sample of common pitfalls I have noted while performing data science for my customer base. If you are already aware of these pitfalls, well done!


Many seem obvious; however, I regularly work on projects in which these pitfalls have cost my clients millions of dollars before I was hired. So, let’s look at some of the more common pitfalls I encounter regularly.



Ambiguity occurs when a word within the requirement has multiple meanings.

Examples are listed following.



The system must pass between 96–100% of the test cases using current standards for data science.

What are the “current standards”? This is an example of an unclear requirement!



The report must easily and seamlessly integrate with the websites.

“Easily” and “seamlessly” are highly subjective terms where testing is concerned.



The solution should be tested under as many hardware conditions as possible.

“As possible” makes this requirement optional. What if it fails testing on every hardware setup? Is that okay with your customer?



The solution must support Hive 2.1 and other database versions.

Do other database versions only include other Hive databases, or also others such as HBase version 1.0 and Oracle version 10i?


Engineering a Practical Business Layer

The business layer follows general business analysis and project management principals. I suggest a practical business layer consist of a minimum of three primary structures.


For the business layer, I suggest using a directory structure. This enables you to keep your solutions clean and tidy for successful interaction with a standard version-control system.



Every requirement must be recorded with full version control, in a requirement-per-file manner. I suggest a numbering scheme of 000000-00, which supports up to a million requirements with up to a hundred versions of each requirement.


Requirements Registry

Keep a summary registry of all requirements in one single file, to assist with searching for specific requirements. I suggest you have a column with the requisite number, Moscow, a short description, date created, date last version, and status. I normally use the following status values:

  • In-Development
  •  In-Production
  •  Retired

The register acts as a control for the data science environment’s requirements.


Traceability Matrix

Create a traceability matrix against each requirement and the data science process you developed, to ensure that you know what data science process supports which requirement. This ensures that you have complete control of the environment. Changes are easy if you know how everything interconnects.


Utility Layer

The utility layer is used to store repeatable practical methods of data science. The objective of this blog is to define how the utility layer is used in the ecosystem.


Utilities are the common and verified workhorses of the data science ecosystem. The utility layer is a central storehouse for keeping all one’s solutions utilities in one place.


Having a central store for all utilities ensures that you do not use out-of-date or duplicate algorithms in your solutions. The most important benefit is that you can use stable algorithms across your solutions.


Tip Collect all your utilities (including source code) in one central place. Keep records on all versions for future reference.


If you use algorithms, I suggest that you keep any proof and credentials that show that the process is a high-quality, industry-accepted algorithm. Hard experience has taught me that you are likely to be tested, making it essential to prove that your science is 100% valid.


The additional value is the capability of larger teams to work on a similar project and know that each data scientist or engineer is working to the identical standards. In several industries, it is a regulated requirement to use only sanctioned algorithms.


On May 25, 2018, a new European Union General Data Protection Regulation (GDPR) goes into effect. The GDPR has the following rules:

You must have valid consent as a legal basis for processing. For any utilities you use, it is crucial to test for consent.


You must assure transparency, with clear information about what data is collected and how it is processed. Utilities must generate complete audit trails of all their activities. 


You must support the right to accurate personal data.

Utilities must use only the latest accurate data.  You must support the right to have personal data erased. Utilities must support the removal of all information on a specific person. I will also discuss what happens if the “right to be forgotten” is requested.


This sounds easy at first, but take warning from my experiences and ensure that this request is implemented with care.

You must have the approval to move data between service providers.


I advise you to make sure you have 100% approval to move data between data providers. If you move the data from your customer’s systems to your own systems without clear approval, both you and your customer may be in trouble with the law. You must support the right not to be subject to a decision based solely on automated processing.


This item is the subject of debate in many meetings that I attend. By the nature of what we as data scientists perform, we are conducting, more or less, a form of profiling. The actions of our utilities support decisions from our customers. The use of approved algorithms in our utilities makes compliance easier.


I suggest you investigate the rules and conditions for processing any data you handle. In addition, I advise you to get your utilities certified, to show compliance. Discuss with your chief data officer what procedures are used and which prohibited procedures require checking.


Basic Utility Design

The basic utility must have a common layout to enable future reuse and enhancements. This standard makes the utilities more flexible and effective to deploy in a large-scale ecosystem.


I use a basic design for a processing utility, by building it a three-stage process.

  • \ 1.\ Load data as per input agreement.
  • \ 2.\ Apply processing rules of utility.
  • \ 3.\ Save data as per output agreement.


The main advantage of this methodology in the data science ecosystem is that you can build a rich set of utilities that all your data science algorithms require. That way, you have a basic pre-validated set of tools to use to perform the common processing and then spend time only on the custom portions of the project.


You can also enhance the processing capability of your entire project collection with one single new utility update.


Note I spend more than 80% of my non-project work time designing new utilities and algorithms to improve my delivery capability. I suggest that you start your utility layer with a small utility set that you know works well and build the set out as you go along.


In this blog, I will guide you through utilities I have found to be useful over my years of performing data science. I have split the utilities across various layers of the ecosystem, to assist you in connecting the specific utility to specific parts of the other blogs.

There are three types of utilities

  • Data processing utilities
  • Maintenance utilities
  • Processing utilities


Data Processing Utilities

Data processing utilities are grouped for the reason that they perform some form of data transformation within the solutions.


Retrieve Utilities

Utilities for this super step contain the processing chains for retrieving data out of the raw data lake into a new structured format. 


I suggest that you build all your retrieve utilities to transform the external raw data lake format into the Homogeneous Ontology for Recursive Uniform Schema (HORUS) data format that I have been using in my projects.


HORUS is my core data format. It is used by my data science framework, to enable the reduction of development work required to achieve a complete solution that handles all data formats.


Data Stream to HORUS

These expert utilities enable your solution to handle data streams. Data streams are evolving as the fastest-growing data collecting interface at the edge of the data lake. I will offer extended discussions and advice later in the blog on the use of data streaming in the data science ecosystem.


In the Retrieve super step of the functional layer, I dedicate more text to clarifying how to use and generate full processing chains to retrieve data from your data lake, using optimum techniques


Assess Utilities

Utilities for this super step contain all the processing chains for quality assurance and additional data enhancements.


The assess utilities ensure that the data imported via the Retrieve super step are of good quality, to ensure it conforms to the prerequisite standards of your solution. I perform feature engineering at this level, to improve the data for better processing success in the later stages of data processing.


There are two types of assessing utilities:


Feature Engineering

Feature engineering is the process by which you enhance or extract data sources, to enable better extraction of characteristics you are investigating in the data sets. Following is a small subset of the utilities you may use.


Fixers Utilities

Fixers enable your solution to take your existing data and fix a specific quality issue. Examples include

Removing leading or lagging spaces from a data entry

The example in Python:

baddata = " Data Science with too many spaces is
bad!!! "
print('>',baddata,'<') cleandata=baddata.strip() print('>',cleandata,'<')
Removing nonprintable characters from a data entry
Example in Python:
import string
printable = set(string.printable)
baddata = "Data\x00Science with\x02 funny characters is \x10bad!!!"
cleandata=''.join(filter(lambda x: x in string.printable,


Process Utilities

Utilities for this super step contain all the processing chains for building the data vault. 

I will discuss the data vault’s (Time, Person, Object, Location, Event) design, model, and inner workings in detail during the Process super step of the functional layer.


For the purposes of this blog, I will at this point introduce the data vault as a data structure that uses well-structured design to store data with full history. The basic elements of the data vault are hubs, satellites, and links.


There are three basic process utilities.


Data Vault Utilities

The data vault is a highly specialist data storage technique that was designed by Dan Linstedt. The data vault is a detail-oriented, historical-tracking, and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema.


Hub Utilities

Hub utilities ensure that the integrity of the data vault’s (Time, Person, Object, Location, Event) hubs is 100% correct, to verify that the vault is working as designed.


Satellite Utilities

Satellite utilities ensure the integrity of the specific satellite and its associated hub.


Link Utilities

Link utilities ensure the integrity of the specific link and its associated hubs. As the data vault is a highly structured data model, the utilities in the Process super step of the functional layer will assist you in building your own solution.


Transform Utilities

Utilities for this super step contain all the processing chains for building the data warehouse from the results of your practical data science.


In the Transform super step, the system builds dimensions and facts to prepare a data warehouse, via a structured data configuration, for the algorithms in data science to use to produce data science discoveries. There are two basic transform utilities.


Dimensions Utilities

The dimensions use several utilities to ensure the integrity of the dimension structure. Concepts such as conformed dimension, degenerate dimension, role-playing dimension, mini-dimension, outrigger dimension, slowly changing dimension, late-­arriving dimension, and dimension types (0, 1, 2, 3).

They all require specific utilities to ensure 100% integrity of the dimension structure.


Pure data science algorithms are the most used at this point in your solution. I will discuss extensively what data science algorithms are required to perform practical data science. I will ratify that the most advanced of these are standard algorithms, which will result in common utilities.


Facts Utilities

These consist of a number of utilities that ensure the integrity of the dimensions structure and the facts. There are various statistical and data science algorithms that can be applied to the facts that will result in additional utilities.


Note The most important utilities for your data science will be transformed utilities, as they hold the accredited data science you need for your solution to be successful.


Data Science Utilities

There are several data science–specific utilities that are required for you to achieve success in the data processing ecosystem.


Data Binning or Bucketing

Binning is a data preprocessing technique used to reduce the effects of minor observation errors. Statistical data binning is a way to group a number of more or less continuous values into a smaller number of “bins.”


Organize Utilities

Utilities for this super step contain all the processing chains for building the data marts. The organize utilities are mostly used to create data marts against the data science results stored in the data warehouse dimensions and facts. 


Report Utilities

Utilities for this super step contain all the processing chains for building virtualization and reporting of the actionable knowledge. The report utilities are mostly used to create data virtualization against the data science results stored in the data marts.


Maintenance Utilities

The data science solutions you are building are a standard data system and, consequently, require maintenance utilities, as with any other system. Data engineers and data scientists must work together to ensure that the ecosystem works at its most efficient level at all times.

Utilities cover several areas.


Backup and Restore Utilities

These perform different types of database backups and restores for the solution. They are standard for any computer system. For the specific utilities, I suggest you have an in-­ depth discussion with your own systems manager or the systems manager of your client.


I normally provide a wrapper for the specific utility that I can call in my data science ecosystem, without direct exposure to the custom requirements at each customer.


Checks Data Integrity Utilities

These utilities check the allocation and structural integrity of database objects and indexes across the ecosystem, to ensure the accurate processing of the data into knowledge.


Maintenance Cleanup Utilities

Cleanup Utilities

These utilities remove artifacts related to maintenance plans and database backup files.


Notify Operator Utilities

Utilities that send notification messages to the operations team about the status of the system are crucial to any data science factory.


Rebuild Data Structure Utilities

These utilities rebuild database tables and views to ensure that all the development is as designed. In blogs 6–11, I will discuss the specific rebuild utilities.


Reorganize Indexing Utilities

These utilities reorganize indexes in database tables and views, which is a major operational process when your data lake grows at a massive volume and velocity. The variety of data types also complicates the application of indexes to complex data structures. 


As a data scientist, you must understand when and how your data sources will change. An unclear indexing strategy could slow down algorithms without your taking note, and you could lose data, owing to your not handling the velocity of the data flow.


Shrink/Move Data Structure Utilities

These reduce the footprint size of your database data and associated log artifacts, to ensure an optimum solution is executing.


Solution Statistics Utilities

These utilities update information about the data science artifacts, to ensure that your data science structures are recorded. Call it data science on your data science.

The preceding list is a comprehensive, but not all-inclusive. I suggest that you speak to your development and operations organization staff, to ensure that your data science solution fits into the overall data processing structures of your organization.


Processing Utilities

The data science solutions you are building require processing utilities to perform standard system processing. The data science environment requires two basic processing utility types.

Scheduling Utilities

The scheduling utilities I use are based on the basic agile scheduling principles.

Backlog Utilities

Backlog utilities accept new processing requests into the system and are ready to be processed in future processing cycles.


To-Do Utilities

The to-do utilities take a subset of backlog requests for processing during the next processing cycle. They use classification labels, such as priority and parent-child relationships, to decide what process runs during the next cycle.


Monitoring Utilities

The monitoring utilities ensure that the complete system is working as expected.

Engineering a Practical Utility Layer

The utility layer holds all the utilities you share across the data science environment. I suggest that you create three sublayers to help the utility layer support the better future use of the utilities.


Maintenance Utility

Collect all the maintenance utilities in this single directory, to enable the environment to handle the utilities as a collection. 


I suggest that you keep a maintenance utility registry, to enable your entire team to use the common utilities. Include enough documentation for each maintenance utility, to explain its complete workings and requirements.


Data Utility

Collect all the data utilities in this single directory, to enable the environment to handle the utilities as a collection. I suggest that you keep a data utility registry to enable your entire team to use the common utilities. Include enough documentation for each data utility to explain its complete workings and requirements.


Processing Utility

Processing Utility

Collect all the processing utilities in this single directory to enable the environment to handle the utilities as a collection.

I suggest that you keep a processing utility registry, to enable your entire team to use the common utilities. Include sufficient documentation for each processing utility to explain its complete workings and requirements.


Warning Ensure that you support your company’s processing environment and that the suggested environment supports an agile processing methodology. This may not always match your own environment.


Caution Remember: These utilities are used by your wider team, if you interrupt them, you will pause other current working processing. Take extra care with this layer’s artifacts.


Three Management Layers

This blog is about the three management layers that are must-haves for any large-­scale data science system. I will discuss them at a basic level. I suggest you scale-out these management capabilities, as your environment grows.


Operational Management Layer

The operational management layer is the core store for the data science ecosystem’s complete processing capability. The layer stores every processing schedule and workflow for the all-inclusive ecosystem.

  • This area enables you to see a singular view of the entire ecosystem. It reports the status of the processing.
  • The operations management layer is the layer where I record the following.


Processing-Stream Definition and Management

The processing-stream definitions are the building block of the data science ecosystem. I store all my current active processing scripts in this section.


Definition management describes the workflow of the scripts through the system, ensuring that the correct execution order is managed, as per the data scientists’ workflow design.


Tip Keep all your general techniques and algorithms in a source-control-based system, such as GitHub or SVN, in the format of importable libraries. That way, you do not have to verify if they work correctly every time you use them.


The advice I spend 10% of my time generating new processing building blocks every week and 10% improving existing building blocks.


I can confirm that this action easily saves more than 20% of my time on processing new data science projects when I start them, as I already have a base set of tested code to support the activities required. So, please invest in your own and your team’s future, by making this a standard practice for the team. You will not regret the investment.


Warning When you replace existing building blocks, check for impacts downstream. I suggest you use a simple versioning scheme of mylib_001_01. That way, you can have 999 versions with 99 sub-versions.


This also ensures that your new version can be orderly rolled out into your customer base. The most successful version is to support the process with a good version-control process that can support multiple branched or forked code sets.



The parameters for the processing are stored in this section, to ensure a single location for all the system parameters. You will see in all the following examples that there is an ecosystem setup phase.

if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
print('Working Base :',Base, ' using ', sys.platform)
sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python' if not os.path.exists(sFileDir):
sFileName=Base + '/01-Vermeulen/00-RawData/Country_Currency.xlsx'


In my production system, for each customer, we place all these parameters in a single location and then simply call the single location. Two main designs are used:

  • \ 1.\ A simple text file that we then import into every Python script
  • \ 2.\ A parameter database supported by a standard parameter setup script that we then include into every script


I will also admit to having several parameters that follow the same format as the preceding examples, and I simply collect them in a section at the top of the code.

Advice Find a way that works best for your team and standardize that method across your team.




The scheduling plan is stored in this section, to enable central control and visibility of the complete scheduling plan for the system. In my solution, I use a Drum-Buffer-Rope methodology. The principle is simple.


Similar to a troop of people marching, the Drum-Buffer-Rope methodology is a standard practice to identify the slowest process and then use this process to pace the complete pipeline. You then tie the rest of the pipeline to this process to control the eco-system’s speed.


So, you place the “drum” at the slow part of the pipeline, to give the processing pace, and attach the “rope” to the beginning of the pipeline, and the end by ensuring that no processing is done that is not attached to this drum. This ensures that your processes complete more efficiently, as nothing is entering or leaving the process pipe without been recorded by the drum’s beat.


I normally use an independent Python program that employs the directed acyclic graph (DAG) that is provided by the network libraries’ DiGraph structure.


This automatically resolves duplicate dependencies and enables the use of a topological sort, which ensures that tasks are completed in the order of requirement. Following is an example: Open this in your Python editor and view the process.


Here you construct your network in any order.

DG = nx.DiGraph([
Here is your network unsorted:
print("Unsorted Nodes")
You can test your network for valid DAG.
print("Is a DAG?",nx.is_directed_acyclic_graph(DG))
Now you sort the DAG into a correct order.
print("Sorted Nodes")
You can also visualize your network.
nx.draw_networkx_nodes(DG,pos=pos,node_size = 1000)

You can add some extra nodes and see how this resolves the ordering. I suggest that you experiment with networks of different configurations, as this will enable you to understand how the process truly assists with the processing workload.


A tip I normally store the requirements for the nodes in a common database. That way, I can upload the requirements for multiple data science projects and resolve the optimum requirement with ease.




The central monitoring process is in this section to ensure that there is a single view of the complete system. Always ensure that you monitor your data science from a single point. Having various data science processes running on the same ecosystem without central monitoring is not advised.


A tip I always get my data science to simply set an active status in a central table when it starts and a not-active when it completes. That way, the entire team knows who is running what and can plan its own processing better.


If you are running on Windows, try the following:

conda install -c primer wmi
import wmi
c = wmi.WMI ()
for process in c.Win32_Process ():
print (process.ProcessId, http://process.Name)
For Linux, try this
import os
pids = [pid for pid in os.listdir('/proc') if pid.isdigit()]
for pid in pids:
print open(os.path.join('/proc', pid, 'cmdline'), 'rb').read() except IOError: # proc has already terminated

This will give you a full list of all running processes. I normally just load this into a table every minute or so, to create a monitor pattern for the ecosystem.




All communication from the system is handled in this one section, to ensure that the system can communicate any activities that are happening. We are using a complex communication process via Jira, to ensure we have all our data science tracked. I suggest you look at the Conda install -c condo-forge jira.


I will not provide further details on this subject, as I have found that the internal communication channel in any company is driven by the communication tools it uses. The only advice I will offer is to communicate! You would be alarmed if at least once a week, you lost a project owing to someone not communicating what they are running.



The alerting section uses communications to inform the correct person, at the correct time, about the correct status of the complete system. I use Jira for alerting, and it works well. If any issue is raised, alerting provides complete details of what the status was and the errors it generated.


I will now discuss each of these sections in more detail and offer practical examples of what to expect or create in each section.


The audit, Balance, and Control Layer

The audit, balance, and control layer controls any processing currently underway. This layer is the engine that ensures that each processing request is completed by the ecosystem as planned.


The audit, balance, and control layer is the single area in which you can observe what is currently running within your data scientist environment.

It records

  • Process-execution statistics
  • Balancing and controls
  • Rejects- and error-handling
  • Fault codes management

The three subareas are utilized in the following manner.




First, let’s define what I mean by the audit. An audit is a systematic and independent examination of the ecosystem. The audit sublayer records the processes that are running at any specific point within the environment.


This information is used by data scientists and engineers to understand and plan future improvements to the processing.


Tip Make sure your algorithms and processing generate a good and complete audit trail.


My experience shows that a good audit trail is extremely crucial. The use of the built-­in audit capability of the data science technology stack’s components supplies you with a rapid and effective base for your auditing. I will discuss what audit statistics are essential to the success of your data science.


In the data science ecosystem, the audit consists of a series of observers that record preapproved processing indicators regarding the ecosystem. I have found the following to be good indicators for audit purposes.


Built-in Logging

I advise you to design your logging into an organized preapproved location, to ensure that you capture every relevant log entry.


I also recommend that you do not change the internal or built-in logging process of any of the data science tools, as this will make any future upgrades complex and costly. I suggest that you handle the logs, in the same manner, you would any other data source.


Normally, I build a controlled systematic and independent examination of all the built-in logging vaults. That way, I am sure I can independently collect and process these logs across the ecosystem. I deploy five independent watchers for each logging location, as logging usually has the following five layers.


Debug Watcher

Debug Watcher

This is the maximum verbose logging level. If I discover any debug logs in my ecosystem, I normally raise an alarm, as this means that the tool is using precise processing cycles to perform low-level debugging.

Warning Tools running debugging should not be part of a production system.


Information Watcher

The information level is normally utilized to output information that is beneficial to the running and management of a system. I pipe these logs to the central Audit, Balance, and Control data store, using the ecosystem as I would any other data source.


Warning Watcher

The warning is often used for handled “exceptions” or other important log events. Usually, this means that the tool handled the issue and took corrective action for recovery.


I pipe these logs to the central Audit, Balance, and Control data store, using the ecosystem as I would any other data source. I also add a warning to the Performing a Cause and Effect Analysis System data store. I will discuss this critical tool later in this blog.


Error Watcher

Error Watcher

The error is used to log all unhandled exceptions in the tool. This is not a good state for the overall processing to be in, as it means that a specific step in the planned processing did not complete as expected. Now, the ecosystem must handle the issue and take corrective action for recovery.


I pipe these logs to the central Audit, Balance, and Control data store, using the ecosystem as I would any other data source. I also add an error to the Performing a Cause and Effect Analysis System data store.


Fatal Watcher

Fatal is reserved for special exceptions/conditions for which it is imperative that you quickly identify these events. This is not a good state for the overall processing to be in, as it means a specific step in the planned processing has not completed as expected. This means the ecosystem must now handle the issue and take corrective action for recovery.


Once again, I pipe these logs to the central Audit, Balance, and Control data store, using the ecosystem as I would any other data source. I also add an error to the Performing a Cause and Effect Analysis System data store, which I will discuss later in this blog.


I have discovered that by simply using built-in logging and a good cause-and-effect analysis system, I can handle more than 95% of all issues that arise in the ecosystem.


Process Tracking

I normally build a controlled systematic and independent examination of the process for the hardware logging. There is numerous server-based software that monitors temperature sensors, voltage, fan speeds, and load and clock speeds of a computer system.


I suggest you go with the tool with which you and your customer are most comfortable. I do, however, advise that you use the logs for your cause-and-effect analysis system.


Data Provenance

Keep records for every data entity in the data lake, by tracking it through all the transformations in the system. This ensures that you can reproduce the data, if needed, in the future and supplies a detailed history of the data’s source in the system.


Data Lineage

Keep records of every change that happens to the individual data values in the data lake. This enables you to know what the exact value of any data record was in the past. It is normally achieved by a valid-from and valid-to audit entry for each data set in the data science environment.



The balance sublayer ensures that the ecosystem is balanced across the accessible processing capability or has the capability to top up capability during periods of extreme processing. The processing on-demand capability of a cloud ecosystem is highly desirable for this purpose.


Tip Plan your capability as a combination of always-on and top-up processing.


By using the audit trail, it is possible to adapt to changing requirements and forecast what you will require to complete the schedule of work you submitted to the ecosystem. I have found that deploying a deeply reinforced learning algorithm against the cause-and-­ effect analysis system can handle any balance requirements dynamically.


Note In my experience, even the best pre-plan solution for processing will disintegrate against a good deep-learning algorithm, with reinforced learning capability handling the balance in the ecosystem.



The control sublayer controls the execution of the currently active data science. The control elements are a combination of the control element within the Data Science Technology Stack’s individual tools plus a custom interface to control the overarching work.


The control sublayer also ensures that when processing experiences an error, it can try a recovery, as per your requirements, or schedule a clean-up utility to undo the error. The cause-and-effect analysis system is the core data source for the distributed control system in the ecosystem.


I normally use a distributed yoke solution to control the processing. I create an independent process that is created solely to monitor a specific portion of the data processing ecosystem control.


So, the control system consists of a series of yokes at each control point that uses Kafka messaging to communicate the control requests. The yoke then converts the requests into a process to execute and manage in the ecosystem.


The yoke system ensures that the distributed tasks are completed, even if it loses contact with the central services. The yoke solution is extremely useful in the Internet of things environment, as you are not always able to communicate directly with the data source.


Yoke Solution

The yoke solution is a custom design I have worked on over years of deployments. Apache Kafka is an open source stream processing platform developed to deliver a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka provides a publish-subscribe solution that can handle all activity-stream data and processing.


The Kafka environment enables you to send messages between producers and consumers that enable you to transfer control between different parts of your ecosystem while ensuring a stable process.

I will give you a simple example of the type of information you can send and receive.



The producer is the part of the system that generates the requests for data science processing, by creating structures messages for each type of data science process it requires. The producer is the end point of the pipeline that loads messages into Kafka.


Note This is for your information only. You do not have to code this and make it run.

from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='localhost:1234')
for _ in range(100):
producer.send('Retrieve', b'Person.csv')
# Block until a single message is sent (or timeout) future = producer.send('Retrieve', b'Last_Name.json') result = future.get(timeout=60)
# Block until all pending messages are at least put on the network
# NOTE: This does not guarantee delivery or success! It is really
# only useful if you configure internal batching using linger_ms
# Use a key for hashed-partitioning producer.send('York', key=b'Retrieve', value=b'Run')
# Serialize json messages
import json
producer = KafkaProducer(value_serializer=lambda v: json.dumps(v). encode('utf-8'))
producer.send('Retrieve', {'Retrieve': 'Run'})
# Serialize string keys
producer = KafkaProducer(key_serializer=str.encode)
producer.send('Retrieve', key='ping', value=b'1234')
# Compress messages
producer = KafkaProducer(compression_type='gzip')
for i in range(1000):
producer.send('Retrieve', b'msg %d' % i)



The consumer is the part of the process that takes in messages and organizes them for processing by the data science tools. The consumer is the end point of the pipeline that offloads the messages from Kafka.

Note This is for your information only. You do not have to code this and make it run.

from kafka import KafkaConsumer
import msgpack
consumer = KafkaConsumer('Yoke')
for msg in consumer:
print (msg)
# join a consumer group for dynamic partition assignment and offset commits from kafka import KafkaConsumer
consumer = KafkaConsumer('Yoke', group_id='Retrieve') for msg in consumer:
print (msg)
# manually assign the partition list for the consumer from kafka import TopicPartition
consumer = KafkaConsumer(bootstrap_servers='localhost:1234') consumer.assign([TopicPartition('Retrieve', 2)])
msg = next(consumer)
# Deserialize msgpack-encoded values
consumer = KafkaConsumer(value_deserializer=msgpack.loads)
for msg in consumer:
assert isinstance(msg.value, dict)


Directed Acyclic Graph Scheduling

This solution uses a combination of graph theory and publish-subscribe stream data processing to enable scheduling.


You can use the Python NetworkX library to resolve any conflicts, by simply formulating the graph into a specific point before or after you send or receive messages via Kafka. That way, you ensure an effective and efficient processing pipeline.


A tip I normally publish the request onto three different message queues, to ensure that the pipeline is complete. The extra redundancy outweighs the extra processing, as the message is typically very small.


Yoke Example

Following is a simple simulation of what I suggest you perform with your processing. Open your Python editor and create the following three parts of the yoke processing pipeline:

Create a file called in directory ..\VKHCG\77-Yoke.
Enter this code into the file and save it.
# -*- coding: utf-8 -*-
import sys
import os
import shutil
def prepecosystem():
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
sFileDir=Base + '/77-Yoke'
if not os.path.exists(sFileDir):
sFileDir=Base + '/77-Yoke/10-Master'
if not os.path.exists(sFileDir):
sFileDir=Base + '/77-Yoke/20-Slave'
if not os.path.exists(sFileDir):
return Base
def makeslavefile(Base,InputFile):
sFileNameIn=Base + '/77-Yoke/10-Master/'+InputFile
sFileNameOut=Base + '/77-Yoke/20-Slave/'+InputFile
if os.path.isfile(sFileNameIn):
shutil.move(sFileNameIn,sFileNameOut) if __name__ == '__main__':
print('### Start ############################################') Base = prepecosystem()
for sFile in sFiles:
if sFile != '':
print('### Done!! ############################################')
Next, create the Master Producer Script. This script will place nine files in the master message queue simulated by a directory called 10-Master.
Create a file called in directory ..\VKHCG\77-Yoke.
# -*- coding: utf-8 -*-
import sys import os
import sqlite3 as sq from import sql import uuid
import re
from multiprocessing import Process
def prepecosystem():
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
sFileDir=Base + '/77-Yoke'
if not os.path.exists(sFileDir): os.makedirs(sFileDir)
sFileDir=Base + '/77-Yoke/10-Master' if not os.path.exists(sFileDir):
sFileDir=Base + '/77-Yoke/20-Slave'
if not os.path.exists(sFileDir):
sFileDir=Base + '/77-Yoke/99-SQLite'
if not os.path.exists(sFileDir):
sDatabaseName=Base + '/77-Yoke/99-SQLite/Yoke.db' conn = sq.connect(sDatabaseName) print('Connecting :',sDatabaseName) sSQL='CREATE TABLE IF NOT EXISTS YokeData (\ PathFileName VARCHAR (1000) NOT NULL\
return Base,sDatabaseName
def makemasterfile(sseq,Base,sDatabaseName): sFileName=Base + '/77-Yoke/10-Master/File_' + sseq +\ '_' + str(uuid.uuid4()) + '.txt' sFileNamePart=os.path.basename(sFileName) smessage="Practical Data Science Yoke \n File: " + sFileName with open(sFileName, "w") as txt_file:
connmerge = sq.connect(sDatabaseName)
('" + sFileNamePart + "');"
sSQL=re.sub('\s{2,}', ' ', sSQLRaw)
if __name__ == '__main__':
print('### Start ############################################') Base,sDatabaseName = prepecosystem()
for t in range(1,10):
p = Process(target=makemasterfile, args=(sFile,Base,sDatabaseName))
print('### Done!! ##########')

Execute the master script to load the messages into the yoke system.

Next, create the Slave Consumer Script. This script will place nine files in the master message queue simulated by a directory called 20-Slave. Create a file called in the directory ..\VKHCG\77-Yoke.

# -*- coding: utf-8 -*-
import sys import os
import sqlite3 as sq from import sql import pandas as pd
from multiprocessing import Process
def prepecosystem():
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
sFileDir=Base + '/77-Yoke'
if not os.path.exists(sFileDir):
sFileDir=Base + '/77-Yoke/10-Master'
if not os.path.exists(sFileDir):
sFileDir=Base + '/77-Yoke/20-Slave'
if not os.path.exists(sFileDir):
sFileDir=Base + '/77-Yoke/99-SQLite'
if not os.path.exists(sFileDir):
sDatabaseName=Base + '/77-Yoke/99-SQLite/Yoke.db' conn = sq.connect(sDatabaseName) print('Connecting :',sDatabaseName) sSQL='CREATE TABLE IF NOT EXISTS YokeData (\ PathFileName VARCHAR (1000) NOT NULL\
return Base,sDatabaseName
def makeslavefile(Base,InputFile): sExecName=Base + '/77-Yoke/' sExecLine='python ' + sExecName + ' ' + InputFile os.system(sExecLine)
if __name__ == '__main__':
print('### Start ############################################') Base,sDatabaseName = prepecosystem()
connslave = sq.connect(sDatabaseName) sSQL="SELECT PathFileName FROM YokeData;" SlaveData=pd.read_sql_query(sSQL, connslave) for t in range(SlaveData.shape[0]):
p = Process(target=makeslavefile, args=(Base,sFile))
print('### Done!! ############################################')


Execute the script and observe the slave script retrieving the messages and then using the Run-Yoke to move the files between the 10-Master and 20-Slave directories.


This is a simulation of how you could use systems such as Kafka to send messages via a producer and, later, via a consumer to complete the process, by retrieving the messages and executing another process to handle the data science processing.

Well done, you just successfully simulated a simple message system.


Tip In this manner, I have successfully passed messages between five data centers across four time zones. It is worthwhile to invest time and practice to achieve a high level of expertise in using messaging solutions.


Cause-and-Effect Analysis System

The cause-and-effect analysis system is the part of the ecosystem that collects all the logs, schedules, and other ecosystem-related information and enables data scientists to evaluate the quality of their system. 

Advice Apply the same data science techniques to this data set, to uncover the insights you need to improve your data science ecosystem.