Data gathering and Analysis
Data-driven projects will frequently offer several challenges. With these types of projects, you need to ask something new in the data, as data scientists are always expected to solve a problem.
When I commence a new project, I like to think that I am having a conversation with the data; I am communicating with it to ensure that I represent it in the most honest data analysis and fruitful way for my client or the project’s stakeholders.
In data, then, there lies potential. That is what makes it so exciting. It will always tell us something, whether that information is new or not. And with data, you have the opportunity to continue exploring possibilities and thereby acquiring different results by asking different questions of it, by transforming it through different techniques, and by applying different algorithms to it.
Because of the immense potential of data, accessing it afresh can be daunting, particularly if it is a large dataset, if it comprises various kinds of data or if the company for which you are working simply does not know what data they have collected. That is precisely where the Data Science Process comes in.
This process offers a robust and reliable workflow for any kind of data project, irrespective of the amount and kind of data available, to help you structure your project from concept to delivery.
First devised by Joe Blitzstein and Hanspeter Pfister, the Data Science Process leads us through every stage of our project, from the moment we first consider how to approach the data, to presenting our findings in a clear and actionable way.
The process has five stages, which are:
Identify the question.
Prepare the data.
Analyze the data.
Visualize the insights.
Present the insights.
Each of these stages adds what I like to call a ‘layer of interest’ to your dataset. Although some of these stages can be returned to in the course of the process, following the stages in a linear fashion will reduce the chance of error at a later stage in the project, and will help you to retrace your steps, should you make a mistake.
Data scientist, private investigator
Today, we have an incredible amount of data at our fingertips. Think of the number of combinations we can get from a deck of 52 playing cards. Just shuffle a deck – it’s extremely unlikely that anyone else, in the course of human history, will have had the same order of cards.
Beginning to work with data is like being handed a pack of playing cards – sometimes there are more you can work with, and sometimes there are fewer, but the opportunities for variations are significant.
Once you’ve established some ground rules (for cards, this means a game, for data science, this means a hypothesis and algorithm), you can truly get started. Identifying the question helps us to construct and plan our approach to the data, to ensure that we will get the most relevant results.
As Sherlock Holmes tells Dr. Watson in A Scandal in Bohemia:
It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
Holmes warns Watson against making assumptions about a mystery in the absence of evidence to prove them. But what Conan Doyle has also struck upon here is the need for data scientists to take a step back before they dive in and make any claims or responses to a problem.
With data, we have the advantage of deriving our insights from actual evidence, and so taking the time to consider what we want to ask of the data will help us to funnel it into telling us precisely what we need to know, without the outside biases of our own suppositions or those of others.
This is the first stage of the Data Science Process. Data scientists are required to exhibit some creativity at this point. We are not changing the information to suit our ideas, we are formulating ideas in order to derive insights that will be beneficial to us.
we will explore the various methods that we can apply to ensure that the questions we end up asking of our data will ultimately suit our project’s goals and cover us against gaps and ‘scope creep’ – the uncontrolled growth of a project beyond its original conditions.
The right ingredients for data analysis
We are well into the Computer Age, and most institutions across the public and private sectors will already be sitting on a vast quantity of their own data.
But data was being collected long before we knew what we could do with it, and the information is oftentimes gathered by workers who are not aware of the methods necessary to standardize and analyze information so that it is actually useful.
This knowledge gap will result in organized chaos at best, with datasets often comprising garbled and dirty data.
To clean up this data, to make it legible, we need to take our time. For an illustrative example of just how important it is to prepare our data before we do anything else with it, consider optimal character recognition (OCR) scans.
OCR software will scan a physical page of written or printed text and will translate that text into a digital format. But OCR scans are not always 100 percent correct – their accuracy depends both on the capabilities of the software and the quality of the printed page.
Handwritten documents from the 17th century will give the software more difficulty and will return more errors that must then be manually cleaned up at a later date.
Those who don’t know how to record data properly or those who are using outdated or non-optimal rules as set by their institution will generate datasets that must similarly be ‘cleaned up’.
‘When and where can I get it?’ Data gathering and analysis
Actual data analysis does not need the same level of care essential for the previous two stages. If you have taken the time to postulate the right question and to prepare your data for answering what is required of it, you can afford to experiment with your analyses.
The beauty of working with datasets is that you can duplicate them, so running one type of algorithm on a dataset will not make it unusable by another.
That is the charm of digital information – it can be used, recaptured, restructured, and excerpted, yet you can still return to an earlier version once you have finished and start again.
This is the stage where you can afford to play around. You have spent time building the scaffolding for your project and ensuring that it won’t collapse under the weight of the questions you ask of it, so now it’s time to explore.
Getting started with data analysis
While this part is largely theoretical, it has practical applications, and so I would strongly recommend that you consider applying each of the five stages outlined above to a project of your own in tandem with your reading. It will help to have some of the tools you need to hand before you get started with this part.
If you do not yet have a dataset of your own that you can work with, don’t worry. There are a good number of datasets that are publicly available and free for you to use in your own experiments.
The great benefit of this is that you will be able to get immediately stuck in with using real datasets rather than those that have been specifically created for training.
In my experience, real datasets will give you a sense of achievement from deriving insights from actual information and will add weight to the claim that data science is essential for future development in a significant number of disciplines.
There is a wide variety of truly exciting datasets available online for you to download and use however you so choose. Here are just a handful to get you started:
World Bank Data: a valuable resource for global development data.
European Union Open Data Portal: governmental data from EU member states.
Million Song Dataset: a compilation of metadata and audio features for popular music.
The CIA World Factbook: datasets from 267 countries on subjects from history to infrastructure.
National Climatic Data Center: data on the US environment.
What is necessary for someone new to data science to understand is that data does not have a ‘language’ of its own and that it can only ‘speak’ to us through a machine or piece of software. By data’s ‘language’, I here mean the way in which a machine communicates data to the data scientist.
From the speed of a car and the flowering cycle of a plant to outside temperature and the number of residents in a city, data simply is. It is a series of components, but the relations that we make between them must be established by a human or a computer.
If we were to take the language analogy further, I would say that data can be compared to individual letters which are waiting for someone to put them in the relevant lexical and grammatical order to form words and sentences. So it is up to us (through the tools we apply) to make our data work.
Having access to software is not a requirement for this blog, because it focuses on practical application rather than coding. Even so, if you’d like to try out some of the examples provided in this blog.
Then I would recommend getting your hands on either R or Python, both of which are data analytics programming tools and are available as free downloads for Windows, Linux/UNIX, and Mac OS X. These two programs are currently the most commonly used open-source software packages in the industry.
Identify the question
A grievance that I will often hear from other data scientists is that there is simply too much data out there and that the very idea of grappling with that amount of information to answer a business question is overwhelming.
Given our near constant data exhaust, how can we hope to manage the information that has been collected, in a way that will be conducive to examination? We cannot just dump all the information we have into an algorithm and cross our fingers that we will get the results we need.
Before we can prepare and analyze our data, then, we must know what kind of data we need. And for that, a little fine-tuning of our project’s question is necessary.
Business managers will often throw a question at a data scientist and expect them to dive straight into the database armed only with that. But that question first has to be understood, deconstructed, analyzed.
We have to know what is being asked of us; if we are not properly answering the question, our project’s outcomes will be useless.
Consider the process of school essay writing: going headfirst into answering a question as soon as it has been given to you will only (unless you are very lucky) result in an unwieldy, unstructured mess of information.
It is only if you take the time to step back and think about the big picture – to consider its multiple components and the context in which the question sits – that you can hope to make a persuasive and logical argument.
Fully understanding the question also helps us to keep the course and reduces the chances of our project diverging from the path. Let’s say that our history teacher wanted us to write about the American War of Independence.
Anecdotes from George Washington’s biography may be interesting, but they do not answer the question that has been asked of us.
Essays that are filled with such irrelevant information are the product of students going into a subject without having considered how to tackle the root of the question, instead of taking everything that they can gather and not caring to trim the unnecessary data.
That is why, first and foremost, we need to identify the question
In this section, I will show you the most suitable/fruitful way of tackling this stage of the Data Science Process.
As identifying the question can feel overwhelming, I have given you a tried-and-tested approach that will guide you through this stage and ensure that you have considered all pathways to the question, as well as protect you from bosses who want to slip in additional work after the project has begun.
Look, Ma, no data!
Even though it is so crucial, identifying the problem tends to be the most commonly neglected part in projects that use data. I have been guilty of it myself; for a long time, I started projects by preparing the data.
But that was not because I wanted to skip ahead; I had simply thought that identifying the problem was a given.
After all, data scientists are often called into work on problems, and in my first job at Deloitte, all my projects began with an engagement letter that specified the task and where help was needed.
It is no surprise that such a firm has standardized and streamlined this process, but that only got me into bad habits – jumping ahead of myself before I took note of the big picture.
Another reason for the neglect in identifying the question is that this stage does not use much, if any, data, leaving many data scientists feeling a little self-indulgent about carrying out this step.
But it is important to note that those who proposed the question are probably not data scientists, and they do not know the preparations needed for cleaning up and analyzing the data.
Few companies to date educate their employees about the importance of storing and accessing information, and this knowledge gap means that many data scientists are still being asked, ‘We have a lot of data, so can you find some insights for us?’
This type of question is commonplace, and yet it is obscure, vague, and will go nowhere in helping to solve a company’s problems.
So, even if a question has been formulated beforehand and your boss is querying why you’re not getting on with handling the data, don’t feel as though you’re being extravagant in undertaking this step.
Make your case for this stage if you feel it necessary. Simply, it is not good enough for a question to be proposed – that question must be recalibrated in terms that the data will be able to understand or the project will not move ahead.
How do you solve a problem like…
Problems that come from organizational boards or investors are frequently postulated as open-ended pathways to a question rather than a real question in their own right: ‘We are under-delivering on product units’ or ‘Our customers are leaving us at a rate higher than expected’ or ‘There is a defect in our product’.
None of these are questions – they are issues. I urge readers to take the following step-by-step approach to identify and solve the problem through data. That will both make this stage more efficient and reduce the risk of you focusing on the wrong problem.
Understand the problem
Anyone who plans to participate in data-driven projects must first be aware of the trap that they can be unwittingly led into before the project even begins: that if they follow the lead of a colleague who has given them a question to solve, they may in fact be solving the incorrect problem.
A colleague might have good intentions and try to be more helpful by presenting you with their own devised questions, but their questions will not necessarily be suitable for asking the data.
It may be tempting, upon being given a handful of apparently well-formulated queries, to not bother with identifying the question ourselves. But doing so can lead to disaster later on in the process; it falls to you to identify all the parameters of the business problem because you have been trained to do so.
Blindly taking a non-data scientist’s set of questions and applying them to your project could end up solving the wrong problem, or could simply return no results, because you don’t have the data for it.
Before we jump into our project, then, we must speak with the person who presented us with the problem in the first place. Understanding not only what the problem is but also why it must be resolved now, who its key stakeholders are, and what it will mean for the institution when it is resolved, will help you to start refining your investigation.
Without this step, the outcome can be dangerous for a data scientist, since later on in the project we undoubtedly end up interpreting the question differently to our stakeholders. Once we have expanded upon the central problem, we can move on to the second step.
Stop to smell the roses
Although a good data scientist should begin with a problem and have a clear idea of its various facets, don’t worry at this stage about taking some unexpected detours.
We are actively looking into a company’s archive of information – much of which may not ever have been processed – and we are also carrying out a good deal of groundwork during this stage of the process to understand the company’s issue, so it stands to reason that we will be faced with surprises along the way.
Don’t write these detours off as being ancillary to your project, make note of them as you go. This journey is what data science is all about, and additional insights are a big part of it.
These may at times be irrelevant to the problem, but they can add further value to the business. At times, I have found these insights can drive even more value than the solution to the problem that I have been asked to solve.
Develop your industry knowledge
If you have previous knowledge of the sector within which you are asked to work, you’re off to a great start. You can apply your existing experience to the problem.
You may already know, for example, the particular problems that companies operating in this sector typically run into; or the departments that have, in your experience, typically contributed to those problems, for better or worse; or you may be aware of competing companies that have identified and resolved precisely the problems that have been asked of you.
If you do not have knowledge of the industry, all is not lost. Spend some time researching it in more detail. What are the common pitfalls in the industry? Have your company’s competitors run into the same problems, or are there significant differences? How have they resolved similar problems?
Are the mission and goals for this company reflected across the industry? How is the company different in terms of output, organizational structure, and operations?
Google may be your best friend for answering many of these questions, but also remember that, as a data scientist, you are not working in a vacuum.
Having a solid knowledge of the environment within which you are operating, as well as its individual characteristics and its limitations, will help you to devise an approach that is meaningful to the project’s stakeholders.
Be a people person
Data-driven projects will often affect more than one area of a company. You will find that, in order to truly nail the question, you must work across multiple divisions – and with people who don’t often have reason to speak with each other.
Be prepared to work in this way, and to be social. I count this as one of the many joys of being a data scientist.
After all, having access to information across the organization puts you in a unique position to gather as much information as possible about how an organization operates, and how the business question you have been called in to answer will affect operations across the company. That’s quite exciting.
In one role where I analyzed customer experiences for an organization, I was technically based in the marketing department’s customer experience and insights team.
But half of my time was actually spent working in the IT department: naturally, I needed access to their databases, servers, tools, and software. Before I came along, there had been no data scientist at the company and these two departments were separate entities that rarely had reason to communicate.
When you start out on a project, you will find that you may act as a bridge between departments, helping to synchronize their efforts on behalf of data. This can help immensely in learning about how a company functions and how your project can help improve their services and methods of working.
By the end of my time with that organization, I was working with the marketing, IT and operations departments, and my reports frequently brought them all around the table to discuss their plans for the upcoming weeks.
Every time that you speak to a new department, make sure to maintain your connections so that the team knows about the work you are doing, thus keeping the channel of communication open if you need information from them further down the line.
Think like a consultant
Most will agree that data science requires a bottom-up approach: we use the company’s bedrock of data to make analyses, and we incrementally build upon our results as we get to better understand the problems within the company.
That potential to go exploring can be what makes this first step so exciting. But in order to identify the question, we need to pay more heed to business consulting methods.
In consulting, we outline the possible strategic approaches for our business. Consultants tend to be people who have worked in the business or the industry for several years, and they will have accrued a lot of knowledge about the sector.
These people are often concerned with improving the large-scale strategic and organizational aspects of a company, which demands a top-down approach – and this methodology to analyze the big picture requires certain assumptions about a given problem to be made.
It may seem counterproductive for us to use consulting methods; as data scientists, we are told to try and limit assumptions as much as we can in favor of factual evidence. But taking a leaf out of the consultants’ blog can be extremely beneficial, particularly during this stage of the process.
So, put your technical knowledge to the side for a moment and look instead at the organization, the stakeholders, and the company’s strategy before even considering the data.
Identifying the question, after all, is all about filtering our options, and this third step will start to refine our questions so that they are specific to the needs of our company.
Jot down a list of the key stakeholders for the project, and make special note of who the final decision-makers will be. Spend some time with them and don’t move on to the fourth step until you are able to answer the following questions:
What does each of the project’s stakeholders think of the problem?
What are the facets of this problem?
Which departments need to be given priority in my investigation?
What might be the root causes of the problem?
Do the stakeholders think I should be talking to anyone else? Have I spoken to them?
Where is the data located, and who is in charge of it?
What would success look like for this project?
[Note: You can free download the complete Office 365 and Office 2019 com setup Guide for here]
Top-down or bottom-up?
Deloitte, Pricewater houseCoopers, KPMG, Ernst & Young – these are the renowned ‘big four’ companies that all swear by a top-down consulting approach to all their projects. But what is a top-down method, how does it differ from a bottom-up approach, and is one really better than the other? Let’s take these questions one by one.
A top-down approach is one that considers the project’s big picture first and then drills down into its details.
Consultants who are brought onto a new project will begin their investigations at the top of a company’s hierarchy, reading the relevant executive reports and then working their way from CEOs to executives to managers until they reach the people at the very ‘bottom’ – if they ever get there.
However through their research may be, consultants that use a top-down approach will ultimately frame their project by their initial findings, such as those given in the company’s end-of-year financial reports. The top-down approach remains the typical method used by consultancy firms.
For a bottom-up approach, exactly the opposite path is taken. This method places the numbers first. It initially examines a company’s data before moving up the hierarchy, from project managers to heads of divisions, snowballing information before finally reaching the top floor of executives.
This approach relies on facts and figures to tell the investigator about the day-to-day workings of the company. Readers will be unsurprised to know that the bottom-up approach is almost always used by data scientists.
Considering this is a blog about data science, we might automatically reach for the bottom-up approach, thinking that a method that focuses on databases is surely preferable to one that begins with filtered data. This, however, would be jumping the gun.
Yes, as a bottom-up approach has its basis in facts, it is possible to reach a conclusion much faster than if we use a top-down approach. However, any project investigator using a bottom-up approach will tell you that it is near impossible to drive change with this method.
Companies don’t just run on data – they run on people and relationships. We cannot crunch numbers and then rush into the director’s office, proclaiming that we know how to solve a problem and waving a piece of paper with the data to prove it.
The numbers are, as you will find out in this blog, just one piece of the puzzle. We also have to understand a company’s culture, its mission, and its people.
So if we want to reach reliable conclusions, a little more detective work is necessary. For that, a top-down approach is perfect. To illustrate this necessary caution when using a bottom-up approach, Richard Hopkins, my mentor, and director at PricewaterhouseCoopers, says that if you only go by the numbers:
The company’s not really going to listen to you because you haven’t collaborated, you haven’t talked. Let’s say you find from the numbers that their sale of a certain product is losing them money, and draw the conclusion that they should stop production of those items.
But they might be selling those products for a specific reason – perhaps to sell another product, or perhaps it’s already being phased out.
Yes, a bottom-up approach will get you to an outcome quickly, from which you can make a decision. But without understanding the big picture, that decision may not be the best one to make.
With a top-down approach, we get stakeholder buy-in. We draw our conclusions from our colleagues’ input. With that level of engagement, we can make changes in the company much more quickly.
So, rather than using the bottom-up method, should we emulate the people at KPMG and use a top-down approach? Not necessarily. Getting the big picture takes a lot of time.
We have to work with stakeholders, organize workshops, and when we are given so much information at such an early stage we are forced to make crucial decisions about what information is useful and what should be omitted.
That’s a great deal of responsibility. What’s more, the results from a top-down approach may not end up matching the data.
What to do? Richard introduced me to the distinct advantages of applying both of these methods to a single project:
Quite often, there is a disconnect between the numbers that you get from the databases and those you get from the final reports. I’ve found that I get the best results when I do both analyses and then link them together.
This allows me to work out what happens with the final reports, and what levers in the operation are causing those financial results. So, using both a top-down and a bottom-up approach links the data to operations and gives us the full picture.
The best approach then, whenever possible, is to combine the two. This will not only ensure that you can reach an answer quickly and with the data on your side but will also give you the much-needed support from the project’s stakeholders to actually implement your best ideas.
Understand the limitations
We are at step four and are still not yet working with data, but we will eventually need to gather and prepare our information. That is going to be difficult if, after all the work we have done here, we find that the data we have identified as necessary for our investigation is not there. So what is to be done?
The most efficient approach is to understand – at a top level – what data we actually have, and whether more data gathering is necessary before the project can go ahead.
Again, this means speaking with the right people: those in charge of the company’s existing data. Investigating them further will help us to begin considering where in the data we might find problems, and where we might need further information in order to ensure that any sampling we carry out will be statistically relevant.
This step is ever so slightly chicken-and-egg, in that we must know the questions to ask of the data before we look at our datasets, but we should also ensure that we have the right data available early on so that we lose as little time as necessary before proceeding with the next stage of the Data Science Process.
The best way to learn this step is through practice. Remember your thoughts on what types of data will be useful for answering your questions. List them next to your questions as you carry out this stage, and make a note to take stock of what you need at each checkpoint.
It may seem like a plate spinning in your first project, but it will become much easier with experience.
CASE STUDY Filling the gaps
Ruben Kogel is Head of Data at VSCO, a California-based art, and technology company that helps artists to create digital user experience tools through online subscriptions. At the time of his appointment, he was the only data scientist at his company, and VSCO did not have a standardized practice for creating data-driven reports.
But Kogel saw an opportunity in their data to answer key questions that the company wanted to answer: who are the people that buy VSCO subscriptions, and does their purchase make them ‘behave differently’ afterward?
Ruben knew that the problem required looking at the free account users who then upgraded to their paid-for subscription services. But that was only the tip of the iceberg – it was necessary for Ruben to start drilling down and specifying:
I needed to know more before I could get started. I wanted to know, among other things, what indicated changes in behavior, and why it was important to ‘know’ our customers.
At the time, I didn’t know how VSCO’s users were targeted for subscriptions, so that was a knowledge gap I needed to fill before I could consider answering the question.
Deconstructing the question in this way helped Ruben to focus his analysis. It became clear, by discussing its parameters with colleagues, that the question was really to do with marketing.
The question, then, was reformulated to specifically target marketing needs: ‘VSCO has millions of users who are all potential buyers, but they are not uniformly as likely to purchase a VSCO subscription.
To that end, (1) how are our users segmented in terms of their preferences, behaviors and demographics, and (2) which of these customer segments represent the most likely purchasers?’
Once that had been clarified, the rest was straightforward for Ruben. From speaking with the marketing department he knew the behaviors that predicted whether a customer would be likely to purchase a subscription.
What Ruben had effectively done was take a rather unstructured question at the beginning and formulate it into something that not only responded to precisely what the investor wanted to know but also identified where the company could be gathering more data in order to improve its marketing practices.
If you are working within an established company that has accumulated a lot of data over time, identifying the problem first and foremost becomes all the more important.
I have often found that, while colleagues might be open to the idea of using data-driven techniques to solve a problem unless they are also working as data scientists they are less open to the idea of having to work for it.
This can even be true for the people who manage the databases. And it is where many companies start out on the back foot – in an ironic twist, due to the unmanageable amount of data they have, they lose the sense of its purpose and therefore devalue it. It is up to us to help them understand the data’s importance, and that journey begins here.
Get people on board
Your data science will shake things up. As the key game-changer, you may find yourself up against resistance. If you find yourself deliberately stonewalled by a colleague, do something about it. If you are missing information as the result of intentional stalling by a co-worker, do not hesitate to escalate the matter.
The bottom line for any data science project is to add value to the company, and if your stakeholders know this, they should also know that it must take priority. You won’t win everyone over to this idea, but my personal philosophy is to never take this kind of opposition lying down: be prepared to go over their heads to get the job done.
Mine the data (optional)
Data mining might be my favorite part of the process in any project. Preventing a data scientist from mining the data is a little like disallowing museum curators from researching the materials for which they are responsible.
This step is where you can afford to be explorative. To me, data mining is a process where you carry out tests to scrutinize the data at a top level and find areas that could yield interesting insights for further investigation.
At this experimental stage, I like to put the data into Tableau, which will read the data and help you create some initial descriptive visualizations such as easy-to-read tables, charts, and graphs. This generates a wonderful top layer – which you can use as a lens to formulate your project’s questions.
Ultimately, when carried out at this stage of the process, mining data is a basic way to better understand the problem and to guide your eventual analytics process. It is a test drive for your data, where you try it raw to see whether any trends might emerge even at this early stage. Data mining can save you a lot of effort later on.
At the same time, don’t be disheartened if this process does not result in anything. The data may or may not suggest steps or solutions to us, depending on several factors: the company, the quality of the data, and the complexity of the problem.
So, take this step with a pinch of salt. And if you do find something of interest, make a note of it and ensure that you are aware of your findings when you continue to step six…
Refine the problem
Now that we have understood the scope of the problem and the amount of data at our disposal, we can start to dig a little deeper. Here is where we begin to match up the project’s scope with the data, to distinguish the variables and data that will be useful from those which will not, and to truly deconstruct the question.
While all data has the potential to be useful, we cannot deploy all the information we have for every problem – which can only be a good thing: if all data was always useful, the amount of information returned to us would simply be too unwieldy to manage. For this reason, we can be choosy about the data we select.
This means making sure that we consider the parameters and context of the problem that we want to solve before moving forward. Ultimately, refining the problem saves us time by eliminating data that is irrelevant to our question.
Subdividing our data
Let’s take the example of a company suffering from low profits. Their top-level problem is low profitability. We may initially want to ask: what are the contributing factors here? Perhaps the company’s overall revenues are too low. Or the company’s expenses could be too high. These are already two possibilities.
Using these two categories (revenues and expenses), we may be able to further dissect the issue. What kind of expenses does the company incur? What are the different products and services that the company offers?
Asking these additional questions might take us back to our stakeholders with specific requests: to look at the finances of the company, to see how the product/service offer is segmented, to define what the company’s profit goals have been over time, and so on.
If stakeholders know their data, they might be able to give a direct answer to some of your questions. It is likely, for example, that they would know whether or not expenses have significantly changed over a period of time. Merely by asking this question, we can discount the potential problem of expenses. And that enables us to potentially whittle down the problem to one of revenue.
Gather additional data
At this stage, you will already have identified the data that you need and gathered a reasonable list of questions to solve the problem. Now is the time to evaluate the efficacy of your sub-questions. After all, it is simply not worth answering questions that you know the company will have no interest in or will not act upon.
Ask yourself now: what are the expected outcomes for these sub-questions? Do they help to solve the problem, or is there anything still missing?
This is where you will thank yourself for having worked through the previous six steps before reaching this point; isolating the key areas from which you need additional data will hone and therefore accelerate the data-gathering process. Draw up your plan and then put it aside; hold off from doing any data gathering at all until you have finished step eight.
Quantitative and qualitative methods
If we have identified that we need more data (and this is very likely; I have never been involved in a single project that did not require additional data), it is at this point that we must consider what kind of data would result in the best outcomes.
The most significant questions for data gathering are ‘Where are our sources?’ and ‘Do we want to use quantitative or qualitative research methods to interrogate them?’
What are the quantitative and qualitative methods? Simply put, quantitative methods gather numerical information, while qualitative methods gather non-numerical information. But there are a few additional differences that we should take note of before we decide which method to use in our additional data gathering.
Quantitative methods should be used when we want to gather statistical information. As we will discover in the next blog, they are much simpler to gather into a dataset than qualitative methods, but our final decision as to what kind of data we collect must not rest on simplicity for simplicity’s sake. Both methods will answer different questions.
We cannot, for example, decide to use a qualitative approach to collecting something like age, because it is a fact, not an opinion (whatever you may want to tell yourself). We would use quantitative methods if we needed to count items, or to measure salary development, or to find out more about consumer demographics.
Be careful to note here that quantitative data isn’t just numerical data – rather, it is data that can be counted. Questions that ask about respondents’ favorite brands or political affiliation are still quantitative because the answers can technically be counted into categories.
Qualitative methods use the open-ended questions that can result in an infinite number of answers. They are exploratory in nature, and they help to isolate – but not quantify – trends in opinions, thoughts, and feelings.
We might want to use this approach when we need more context to understand a problem, or when a problem is too complex to be answered in a quantitative way.
Qualitative methods, then, are preferable for gathering data on a consumer’s emotional investment in a product, or when we want to build a broader picture of the way respondents might feel about a political party.
Inform your stakeholders
Once we have taken into account all of the previous seven steps, it is absolutely necessary that we, our team, and every one of the project’s stakeholders are on the same page.
Making clear the problem that you will be solving – as identified through the parameters we have devised – will ensure that we have clarified precisely what your approach will be, and this will reduce the chances of others changing the goalposts as the project gets underway.
The stakeholder who has requested that the project is carried out must agree to our problem outline, which ideally should sketch out not only the content of the project but also its timeframe.
I would highly recommend that you divide the project up into milestones, which will allow your stakeholders to stay informed of your progress and safeguard you against any backlash at the end of the project, had you kept your cards too close to your chest.
It is also necessary that we highlight to our stakeholders that this is not a regular business project. Data science projects will not always fit the PRINCE models that are so familiar and favored in businesses.
This will help protect us against stakeholders coming to the project with any preconceptions and will give us an opportunity to explain to them precisely what steps we will be taking towards completing our tasks.
The one thing that I insist on at the outset of any data science project is to make sure that you get stakeholder buy-in in writing. You may be the best of mates in your personal life, but in my experience, stakeholders have a tendency to change their concept of what they want as the project develops, no matter who they are.
It’s understandable for undefined projects, but this contributes to scope creep, which can either overwork you beyond the original parameters of the project or it can kill the project completely. So before you move on to preparing your data, get that confirmation in writing.
The truth is not always pretty
How do you deal with not alienating stakeholders when your results aren’t giving them what they were hoping for? How can you avoid this minefield?
Unfortunately, there’s no positive spin that you can put on what the data is telling you if the truth is a negative result. The truth is not always pretty.
In data science, you are looking at hard facts – overlooking or prettying up results for the sake of sparing a stakeholder’s feelings is not conducive to anything.
If I can give any advice here, it would be this: prepare your audience for the possibility that the results may not be what they want. Make it clear up front you do not know what the outcomes will be.
They may not like their insights. Put it in their head right away that you are mining for facts, not flattery, and with luck, they won’t shoot the messenger if the report is less than favorable.
This stage of the Data Science Process should only take up a small amount of time in the project’s cycle. Sometimes, those who are new to the process can find themselves spending too long on it because they want to ensure that they have developed a robust methodology.
Remember: you can never refine a problem to the extent where you will know exactly what you want. If you do a good job at this stage, then you are likely to save time, but you must also learn to let go, and that comes with experience.
If you follow the steps outlined above, you can both cover yourself in the event of difficulties further down the line and gain the confidence for moving on to preparing data in a timely fashion.
Ultimately, unless the problem that you have been given is particularly fiendish or is one that requires speaking to multiple, disparate stakeholders, identifying and refining the problem should take you a week at the most.
But any deadlines you enforce here should, if possible, not be shared with others – that would only add undue pressure for you. If it takes you a few more days to be happy and comfortable with the progress you have made, then so much the better.
My recommendation? Give yourself ample time to meet the deadline. It is far better to under-promise and over-deliver than to do the opposite. A good rule of thumb is to work out how many days you think the project will take as a whole and then add 20 percent to that number.
More often than not in data science, you barely meet your deadline. And if you come up against any obstacles and you think that you may not meet the date on which you had initially agreed, make sure to tell the person who needs to know as early as possible. Keeping people informed will build trust between you and your stakeholders, and keep them on your side.
The art of saying no
As you get better within your organization and you start to add value and make people’s roles more efficient, people will flock to you for help on their project. Unfortunately, you will not have the time to assist all of them – and for that reason, you will have to learn how to say no.
The first question to ask yourself when you’re asked to help with a problem is: is this a project that I want to work on?
Some may be interesting, and others may be run by friends, but ultimately your decision should be based on the business value that you derive. You can then compare your options and select the one that will be the most fruitful for your company.
Always remember that the essential job of a data scientist is to derive a value for the company. This defining criterion for success will also serve as an aid for you when you tell people why you are not taking on their project.
How do you say no? First, refrain from responding to a project idea immediately. Tell the person with whom you are speaking that you will think about it and that you will get back to them within a certain timeframe (I give myself two working days to mull a request over).
If you want to take a more diplomatic pathway, one trick is to make sure that you are not considered ‘the data guy’ or ‘the data girl’ – a number cruncher. If you position yourself instead as an adviser, even if you cannot assist them on the project itself, you can give them help with how they can use data.
I believe this to be far better than giving an outright ‘no’, and it also gives colleagues something to work with. Take just half an hour to research what they need – perhaps you will find a tool that will guide them in the right direction.
From there, they will perceive you as a consultant rather than the data holder; someone who will give valuable feedback and help them towards their goal, rather than giving them the answers. This will also benefit you, as you will gain exposure to different parts of the business.
By following the eight steps outlined in this blog, you will not only have protected yourself against the most common snags in data science projects but you will also already have begun to establish yourself within the company.
Enjoy this explorative and communicative stage of the process – in the next stage, you will be almost entirely confined to your desk.
Most of us have at some time been abroad in countries where we don’t speak the language. With this basic method of communication closed off to us, it can be very difficult to get our meaning across. Even if we do have some knowledge of the language, the gaps in our vocabulary and grammar will often cause us (and the listener) some frustration.
Language, then, is a fundamental necessity if we want to understand and communicate with another individual. And preparing data is all about establishing a common language between human and machine.
In this blog, we will learn why data should never be analyzed without first having been prepared, the step-by-step process in which data can be prepared and the best methods that I have learned for managing problems in datasets.
Encouraging data to talk
As practitioners, unless we are very lucky, data will often come to us ‘dirty’. It is often collected by people who do not standardize their records, managed by people who might tamper with their datasets’ column and row names to suit their own projects, and stored in non-optimal locations that can cause damage to the data.
With so many different people working on a single dataset and using different methods for adding data, the resulting datasets in many organizations are unsurprisingly riddled with errors and gaps. And we cannot expect a machine to know where the errors lie or how to fix inconsistencies in information.
So it is our job to prepare the data in such a way that it can be comprehended – and correctly analyzed – by a machine. With great power comes great responsibility
Data preparation (also known as ‘data wrangling’) is a complex component of the entire process and, as it comprises a number of tasks that can only be completed manually, this stage normally takes the most amount of time.
The reason behind this close attention to data preparation is that, if the raw data is not first structured properly in the dataset, then the later stages of the process will either not work at all or, even worse, will give us inaccurate predictions and/or incorrect results.
This can spell disaster for you and your company, and at the very worst end of the scale, neglect of this stage can result in firings and, in the case of freelance work, even lawsuits.
It is not my intention to scare you – I simply mean to show you how essential it is to prepare our data. Surprisingly, despite its importance, this is where I have found a serious gap in the educational materials on data science, which largely focus on the later stages of the process: analyzing and visualizing.
These books and courses use datasets that have already been prepared. But while this approach may be fine if you’re just getting to grips with the discipline, paying no attention to preparation means that you are effectively only learning the cosmetic ways to work with data.
Working only with datasets from educational courses will merely show you data that has been cleaned up to fit the case example. But in the real world, data is often dirty, messy and corrupt, and without knowing the causes and symptoms of dirty data, we cannot adequately complete our project.
If you do not prepare your data, when you get out into the real world on your first project, and your algorithm inevitably returns ‘missing data’ errors, or ‘text qualifier’ errors, or ‘division by zero’, your project will grind to a halt.
How do we know, then, that our data has been sufficiently prepared? Quite simply, once we have ensured that it is suitable for our data analysis stage. It must:
be in the right format;
be free from errors; and
have all gaps and anomalies accounted for.
A common phrase that data scientists use is ‘garbage in, garbage out’, which means if you put unclean data into an algorithm, you will only get nonsensical results, making your analyses useless.
It is true that a number of practitioners struggle with this step, but that is only because they don’t have a framework to follow.
This usually leads to an unstructured, undocumented approach and means they have to reinvent the wheel every time they prepare their data; an ineffective and time-consuming approach in the long run.
So, let’s get started with the process of preparation.
CASE STUDY Ubisoft – making the case for data preparation
Ulf Morys is Finance Director at the German branch of Ubisoft, a game design, development, and distribution company that has created popular game franchises from Assassin’s Creed to Far Cry.
Ulf oversees a distribution subsidiary, which distributes Ubisoft’s video games in the area of Germany, Switzerland, and Austria (GSA) and is also responsible for the financial components of the company’s operations in central Europe.
Ubisoft had historically had its data used solely by the production team for in-game analytics and monetization. Until Ulf changed matters, finance did not number among its strategic areas for data science.
But ignoring the improvements that data science makes can be a costly oversight, and having prior experience of leveraging data to make important business decisions (in his previous job, Ulf saved his company $40 million in a merger thanks to his attention to the data) meant that he knew that a deliberate strategy for using company data was essential.
He says, crucially
Preparing the data doesn’t add more data, it just improves the way you look at it. It’s like that scene in the film The Wizard of Oz when Dorothy opens the door of her house to the kingdom of Oz and the black and white world of Kansas is replaced with Technicolor.
There’s nothing fundamentally different from the way the story functions on a technical level, and yet everything has changed. It has been cleaned up.
To better understand how Ubisoft, in particular, benefited from data preparation, Ulf turned to Ubisoft’s production team, who had been collecting data from its thousands of online gamers for years, from the length of time the game was played and the length of time that it took to complete individual levels, to what the player did in the game and where they failed on the game map.
Ulf found that they were using this past data to assess the likelihood that customers would purchase in-game products through a ‘freemium model’. Having the data to hand not only helped Ubisoft to find out the purchasing patterns of its core customers but also to identify the behaviors that could be applied to future players.
Talking to his team about the grand steps the production team was making thanks to data science turned their heads to the idea.
During a financial strategy meeting, Ulf’s team mapped out the sources of all Ubisoft’s available data and what was missing from the picture, which offered something tangible from which colleagues could bounce off their ideas. ‘Very simply,’ says Ulf, ‘if you don’t know something is there, you can’t ask questions of it’
The gaps showed what key data they needed to gather from their customers (store dimensions, space provided for selling video games, attitudes of typical buyers to video games and consumer feelings about Ubisoft’s output) before they could make meaningful analyses. Ulf says:
It was evident why we weren’t taking a more systematic approach to our customers: we didn’t have the data. When I went to our sales department, I was told which of our customers were good based on their own insider knowledge. But that knowledge hadn’t been collected systematically. Getting the data – the hard facts – was absolutely necessary.
Gathering that information for 2,000 stores enabled Ulf to prepare statistically relevant data that would finally be suitable for analysis. This helped Ubisoft to better target its customers in a way that had not been possible before.
Preparing your data for a journey
In order to get our raw data (information that has not been prepared) to respond to analysis, we first have to prepare it. The method for doing so comprises just three stages:
Extract the data from its sources;
Transform the data into a comprehensible language for access in a relational database;
Load the data into the end source.
This process is known as ETL, and it will help us to gather our data into a suitable format in our end source – known as the ‘warehouse’ – which can be accessed and analyzed in the later stages of the Data Science Process.
A data warehouse stores otherwise disparate data in a single system. Oftentimes, it will comprise relational databases.
What is a relational database?
Relational databases allow us to examine relational data across them. In this type of database, the relationships between the units of information across datasets matter.
The datasets in a relational database are linked by columns that share the same name. For example, if multiple datasets contained columns with the header ‘Country’, the data from those columns could be compared across them in a relational database.
The benefit of having this type of database is that they facilitate the methods of analysis and visualization that are required to derive insights, where data can be examined across multiple sets without the need for individual extraction.
Perhaps the best way to illustrate how advantageous a relational database can be is to compare it to Excel, which is frequently used by people unaccustomed to working with databases:
1 It maintains integrity.
Every cell in Excel is individual; there are no limitations to the types of values that you can place into them. You can add dates or text, for example, underneath phone numbers or monetary values, and Excel will be perfectly happy. A relational database will rap you over the wrist for such negligence.
In a database, columns have predefined types, which means that a column that has been set up to accept dates will not accept any value that does not fit the date format. Databases, then, will keep an eye on the process for you, querying any values that do not match those predefined by the column.
2 It combines datasets.
Combining datasets within a relational database is easy; it is much harder to do in Excel. Relational databases have been designed for that purpose, and that makes it easy to create new datasets by combining common values across the relational database.
All that is required of you is the ability to execute a simple command. As combining tables is not a primary function of Excel,4 an advanced knowledge of programming is required to shoehorn your data into a single table.
3 It is scalable.
Relational databases have been specially designed for scalability; as they combine datasets, it is expected that they must be able to cope with a large number of informational units.
That means – regardless of whether you have five or five billion rows – your relational database is unlikely to crash at a crucial moment. Excel is far more limited in this capacity, and as your dataset grows, the software’s performance will deteriorate as it struggles to cope with the overload.
The data cleanse
We know that in the real world, data is more likely to come to us dirty, but there is some disagreement among practitioners as to how and when data should be cleaned. Some people clean before they transform, and others only once they have loaded it into the new database.
My preference is to clean the data at each stage of the ETL process – it might seem an inefficient use of your time but I have found this to be the best way to protect yourself against obstacles further on.
Unfortunately, data preparation is always going to be time-consuming, but the more due diligence you take in this stage, the more you will speed up the Data Science Process as a whole.
Extract your data
We need to extract data in the first instance 1) to ensure that we are not altering the original source in any way, and 2) because the data that we want to analyze is often stored across a number of different locations. Some examples of possible locations are:
an Excel spreadsheet;
a .csv file;
If we are using data from multiple sources, then we will have to extract it into a single database or warehouse in order for our analyses to work. But it is not always easy to extract from locations that use formatting particular to that system – Excel is one such culprit, to which we will return later on in this blog.
You will get to know these types of files quite intimately as a data scientist. They are the simplest type of raw files of data – completely stripped of any formatting – which makes them accessible to any number of programs into which we may want to import them.
In .csv files, rows are placed on new lines, and columns are separated by commas in each line. Hence the abbreviation, which stands for ‘comma separated values’.
The beauty of working with raw files is that you will never lose or corrupt information when you load your dataset into a program. This is why they are the standard for most practitioners.
Why it’s important to extract data even if it is only in one location
Technically, you could analyze the data directly within its storage facility (the original database, an Excel spreadsheet and so on). While it is not recommended, this method is acceptable for making quick calculations, such as computing the sum of a column of values in Excel.
However, for serious data science projects, carrying out data tasks within its original storage facility is a huge red flag. In doing so, you might accidentally modify the raw data, thereby jeopardizing your work.
And this is the best-case scenario, as it only affects you and your individual project. Working within the storage facility rather than extracting the original data to a test database leaves it vulnerable to user damage, and your work may even end up crashing the internal systems of your institution.
That should give any data scientist pause when they start working with an organization’s data. They are entrusting us with important if not essential company information, so we must ensure that we leave the data just as it was when we started on the project.
Software for extracting data
There are a couple of exceptional free-to-use programs for extracting and reading data that are sure to wean you off any bad habits as have often been formed by Excel users. These programs work well with data that is in a raw .csv file format.
Although it can take time, data can in most cases be stripped down to a raw .csv file. And if you’re working for a large organization where you have to request data extracts, then good news: the data will most likely be given to you in a .csv format anyway.
Notepad++ This is my go-to tool when I want to look at the data I have extracted. Among other features, it is a powerful editor for viewing .csv files, and it is much more user-friendly than the notepad software that comes as standard with Windows. Notepad++ also has a few other significant advantages:
Row numbering, enabling you to navigate through your files and keep tabs on where potential errors might be found;
A search and replace feature, which enables you to quickly find values or text that you don’t want in the dataset and amend them;
It has been designed for the purpose, which means that you can be confident it will not inadvertently modify your data as will other spreadsheet software;
While the Notepad software that comes with Windows generally has trouble dealing with large files, Notepad++ can open files up to 2 GB.
EditPad Lite is a program that is free for personal use. It offers similar features to Notepad++, with one major benefit: although both work well with files that are under 2 GB, I have noticed that Notepad++ can sometimes struggle with datasets at the top end of this file size.
As a result, I have found EditPad Lite to perform much better with my larger files. If you find that you are overworking Notepad++ with your files, consider EditPad Lite.
Transform your data
You cannot simply dump your data from its original source directly into a data warehouse. Not unless you want to work with a messy dataset. By transforming your data, you can reformat the information you plan to use into a language that will suit your objectives.
In a broad sense, the transformation step includes alterations such as joining, splitting and aggregating data. These are functions that allow us to create derived tables to better suit the problem at hand. But the most important function of transformation is data cleaning – and that’s what we will focus on.
In this step, we must identify and manage any errors in our original database, which in the real world will often run the gamut from formatting inconsistencies, through outliers, to significant gaps in information. But to do so, we first have to understand what we are looking for. So, how can we identify dirty data?
Dirty data is information that is either incorrect, corrupt or missing. These three qualifiers are due to the following factors.
Incorrect data In these instances, information has been (partially or completely) incorrectly added to the database (eg inputting a currency value into a date cell). Sometimes, we will know that the data is incorrect. It may be evident when there is a mismatch between columns.
For example, if we had a single row, where the country cell was ‘France’ and the city cell was ‘Rome’, we would know that one was incorrect. We may also be able to identify incorrect data by simply using our common sense – we would know that an entry in date of the birth column that has been given as ‘12/41/2001’ simply cannot be correct.
Corrupt data Corrupt data refers to information that may originally have been correct in the dataset but is now mangled. Information can become corrupted in different ways.
Contributing factors can include if the database to which it belongs has been physically damaged, if it has been altered by another software or if it has been previously extracted in unadvisable ways.
Sometimes, data can simply become corrupted due to transfer to a database that does not support the format it had in the previous storage.
Missing data Missing data either occurs when no information is available for a given cell, or when the person responsible for inserting the data has neglected to add it into the cell. Missing data is a common topic in data science, and it is most likely to occur because of human error.
What can happen when we don’t deal with missing data
We should always be aware of any gaps in our information. Below, you’ll see a real-life example of data that we have extracted from an Excel spreadsheet into a .csv file that shows dividend pay-outs, organized by year.
As you can see from the commas enclosing no information, five of the columns at the highlighted row 504 (26-Apr-15) have missing fields of data.
We have been lucky in this instance that the missing columns have survived the extraction – oftentimes, the missing data values are not defined by commas.
What this would mean is that when we plug the dataset into an algorithm, it would recalibrate our data incorrectly, pushing the data in the row below up to fit the number of columns required in the dataset.
In the above example, this would mean that the date 24-Apr-15 would be brought up to the column directly to the right of the ‘10000000/10000000 Stock Split’ value.
Missing data in this way can cause us significant trouble in the analysis stage if we don’t catch the problem beforehand.
I have known some newbie data scientists who will check the top 100 rows of their dataset, but this is a rookie mistake – if there are errors in the data, you are much more likely to see them at the end of the dataset because the errors will shift information.
Fixing corrupt data
To fix corrupt data so that it can be read by a machine, we can first try the following:
re-extract it from its original file to see if something has corrupted the file during the first extraction;
talk to the person in charge of the data to see if they can cast light on what the actual data should be; or
exclude the rows that contain corrupt data from your analysis.
If you find yourself in a situation where you are missing data and need to retrace your steps to obtain additional input before the project can progress, here are three ways in which you can facilitate the process:
Always be courteous to the people who are giving you data. Some people may find data collection frustrating and will let that show in their communication, but try to stay neutral.
Remember that they are not data scientists, and may not take the same joy in the process of gathering data as you! Explain to them that every data-driven project will have different outcomes and that each project requires different types of data.
You may need to approach the team responsible for your datasets multiple times, so be friendly and get them on your side.
Make sure that anyone you speak to fully understands the problem that you are trying to solve, as well as their role in combating it. Seeing the bigger picture will help your colleagues be more patient with your requests.
Keep a list of the company’s data assets with you at all times. Having this to hand means that when you’re off on the hunt for new data, you will be able to cross-check what the organization already has and reduce the likelihood of you collecting duplicates.
When you list the data assets, I recommend recording the names of the data sources as well as the databases’ columns and their descriptors.
Fixing missing data
If we cannot resolve our problem by using any one of these methods, then we must consider our data as missing. There are various methods for resolving the problem of missing fields in spreadsheets:
Predict the missing data with 100 percent accuracy. We can do this for information that we can derive from other data. For example, say we have a spreadsheet with customer location data that contains column values for both ‘State’ and ‘City’; the entry for State is missing but the
City entry is ‘Salt Lake City’. Then we can be certain that the state is ‘Utah’. It is also possible to derive a missing value based on more than one value, for example, to derive a profit value from both revenue and expenses values.
Bear in mind that when we are inputting information in both examples, we are doing so on the assumption that there were no errors in the collection of the data.
Leave the record as it is. In this scenario, you would simply leave the cell with no data empty. This is most useful when specific fields have no bearing on our analysis and therefore can be left out of our testing.
But it can also be used if we are planning to use a method that isn’t significantly affected by missing data (ie methods that can use averaged values) or if we use a software package that can deal appropriately with this lack of information.
In cases where you leave the record as it is, I would recommend keeping notes of where your data contains gaps so that any later anomalies can be accounted for.
Remove the record entirely. Sometimes, the data that is missing would have been critical to our analysis. In these instances, our only option is to remove the entire row of data from our analysis, as the missing information makes them unable to contribute. Obviously, the major drawback, in this case, is that our results will become less significant as the sample has decreased.
So this approach is likely to work best with large datasets, where the omission of a single row will not greatly affect the dataset’s statistical significance.
Replace the missing data with the mean/median value. This is a popular approach for columns that contain numerical information, as it allows us to arbitrarily fill any gaps without tampering too significantly with our dataset. To calculate the mean, we add all of the values together and divide that total by the number of values.
To calculate the median, we find the sequential middle value in our data range (if there are an uneven number of values, just add the two middle numbers and divide that total by two).
Calculating the median rather than the mean is usually preferable because the former is less affected by outliers, which means that extreme values either side of the median range will not skew our results.
Fill in by exploring correlations and similarities. This approach is again dependent on your missing data value being numerical, and it requires the use of models to predict what the missing values might have been.
For instance, we could use a predictive algorithm to forecast the missing data based on existing similarities among records in your dataset.
Introduce a dummy variable for missing data. This requires adding a column to our dataset: wherever we find missing values in the dataset, we allocate a ‘yes’ value to it – and when it is not missing we give it a ‘no’ value.
We can then explore how the variable correlates with other values in our analysis, and so retrospectively consider the implications of why this data might be missing.
Dealing with outliers
Let’s say that we are working for a company selling phone accessories and we want to find the average number of units that we have sold one of our phone cases to each of our distributors. We have been in business for years, and so our datasets are large.
The person responsible for inputting these values into our database was having a bad day, and instead of inputting the number of product units into the product column, they put the distributor’s telephone number.
That error would abnormally raise our average in this column (and would mean that a single distributor has purchased at least 100 million units!).
If we were to analyze that record on its own, we would probably notice the error. But if we simply calculated the average without looking at the data, our report would be skewed by that outlier – and that would make the report unusable.
Nevertheless, it’s important to distinguish between outliers that can be attributed to erroneous information and outliers that are correct but that fall outside the normal range of values.
The value for a distributor that did purchase 100 million units of your product will still be an outlier, as the value is higher than the normative number of units purchased.
Many datasets will have outliers – our job is to understand where they are and to ensure that they do not unfairly skew our reports. This will largely depend on the type of analysis that we want to carry out.
For example, if we wanted to work out for a publishing house the average number of units sold to blog stores around the world, and we know that the outlier was an exceptional purchase order, we might choose to remove the record even though it’s valid.
It is possible to find outliers in your dataset without searching for them manually, by generating a distribution curve (also known as a bell curve for normal distributions) from your column values.
Distribution curves graphically depict the most probable value or event from your data by way of their apex, and it is simple enough to create them directly, even in Excel. Once you have created your distribution curve, you can identify the values that fall outside the normal range.
CASE STUDY Applied approaches to dealing with dirty data
We have been given a dataset from an imaginary venture capital fund that is looking at the overall growth of start-ups in the United States. As the data gathered was not affiliated with the start-ups, some information was missing, as it was either not publicly available or the start-ups were unwilling to provide that level of information
As you can see, various types of information are missing across our columns, and sometimes multiple values are empty in a single row. Let’s put the methods for fixing missing data into practice. Return to the methods given above and consider how you might resolve the problem of missing data yourself before reading the answers below.
Replace the missing data with the mean/median value. This is a numerical value, and so we can proxy any of the missing employee values with the overall or industry median for that column. (The industry median is preferable as it will provide a like-for-like figure.)
Leave the record as it is or predict the missing data with 100 percent accuracy or remove the record entirely. It should be relatively easy to find out to which industry the company belongs by simply investigating what it does and taking your cues from there.
But our choice depends on how important the industry is to our analysis. If the industry is important, and we cannot research it, we must remove the record from the analysis.
Leave the record as it is or predict the missing data with 100 percent accuracy or remove the record entirely. Even though inception is a number, it is not a numerical value (you cannot perform arithmetic operations with it).
For that reason, we cannot proxy it with an average, and so if we cannot find out when the company was established, then we must accept it as missing.
Leave the record as it is or predict the missing data with 100 percent accuracy or remove the record entirely. In the example given above, we can predict the missing data with 100 percent certainty.
But we must be careful that we are being accurate: for values where a city name can belong to more than one state, we will not be able to predict the data with 100 percent accuracy and so must decide how important the data is to our analysis.
Predict the missing data with 100 percent accuracy. This is an easy one; we can calculate expenses by simply subtracting profit from revenue.
Revenue, expenses and profit, growth
Replace the missing data with the mean/median value. Calculating this block of missing values requires taking more than one step. We need to first proxy our growth revenue and expenses by using the industries’ medians, and then we can calculate the profit as the difference between revenue and expenses.
Transforming data from MS Excel
Excel tries to make things easier by automatically reformatting certain values. This can lead to various hiccups during the ETL process, and as this program is so frequently used to store data, I will give special attention to it here.
One common complaint I have heard from Excel users is the program’s insistence on converting long numerical values (such as phone and credit card numbers) into a scientific formula.9 And that’s not the worst of it.
Excel can also convert dates and monetary amounts into a single format that accords to your computer’s regional settings.
While this may be convenient for single spreadsheets which are often used for business intelligence, those kinds of automation will only end up doing you disfavor in data science, as Excel’s formatting does not translate well into a database.
And if we are dealing with a lot of data, unpicking all of the instances that Excel has altered can be time-consuming.
If we do not transform the data from Excel into a .csv file, then we will only be presented with problems further down the line. While it may be possible to restore dates that have been altered, it is near impossible to restore credit card numbers if they have been changed to scientific formulae.
Just imagine the consequences for an organization that loses its customers’ credit card numbers, especially if you had been working on the only copy of the file.
Some of the most common issues are to do with dates and currency, as these values are not international and are therefore susceptible to our machines’ regional settings.
Date formats The formatting of dates will differ depending on our geographic region, and Excel has been programmed to display the date that accords to our computer’s regional settings. Most countries use a little-endian date format that begins with the day, followed by the month and the year.
In the United States, however, the date format begins with the month, followed by the day and year. We need to ensure that we are working with a consistent date format in our database.
How to fix them The best method to prevent Excel from making changes to our records is to change all of our date formats, as this is the unambiguous international standard that is also not subject to regional rules. In Excel, select the column that you want to fix, right-click, and select ‘Format Cells’.
Currency formats Currency will also depend on our computer’s regional settings. In these cases, it is not only necessary to consider the symbol of currency but also the decimal marks that are being used. Symbols of currency should be completely stripped of your data as they will otherwise be read as text.
Countries use different decimal marks for their currency – indicated either by a point (eg £30.00 in the UK) or a comma (eg €30,00 in Germany).
Take note that this affects both the decimal place and the thousands separator. The sum of £30,000 would be read as thirty thousand pounds in countries such as Australia that use the comma to indicate thousands, but it may be read as thirty pounds in countries such as Sweden that use the comma to indicate decimal places.
Databases function with the decimal point system, and any commas, including thousands of separators, must be stripped of your data.
How to fix them We want to strip our numbers of symbols and commas. If your country uses a decimal comma system, you must first change the regional settings of your computer to make sure the comma is changed to a dot. Select the column, right-click it and select ‘Format Cells’. In the ‘Category’ window, select ‘Currency’.
Uncheck the ‘Use 1000 separator’ box to ensure that no commas will be used, choose ‘None’ from the ‘Symbol’ drop-down box and select ‘2’ as the number of decimal places. That will remove the commas and symbols from our data.
Load your data
Once we have transformed our data into the format we need, we can load our data into our end target: the warehouse. Once this process is complete, we should manually look through our data one last time before we run it through a machine algorithm, to be absolutely certain that we are not working with underprepared data.
Quality assurance after the load
Loading the data into a warehouse can sometimes cause problems. You may have missed cleaning up some of the dirty data in the previous stage, or some of the data may have simply been loaded incorrectly. For that reason, you must learn to double-check your data within the warehouse.
The following are the quality assurance (QA) checks that you should always make at this stage:
Count the number of rows that you have in your final dataset and compare it to the initial dataset. If it is different, return to the initial data-set to find out what happened.
Unfortunately, sometimes the quickest way to check is just by looking at it, and this will mean scrolling through the data line by line. The quickest way to do this is to go from the bottom up rather than the top down because any errors in data are likely to carry downwards.
Check the columns for skewness. To completely safeguard yourself against problems in the analysis stage, check both the top 100 and the bottom 100 rows.
Check the columns that are susceptible to corruption. This usually refers to dates and balances, as we have earlier established that they are the most prone to error.
Check text values. If we have free-form text values from surveys where respondents have typed up answers to an open-ended question, then uploading this kind of text to a database can be tricky. Usually, databases will limit the maximum number of letters in a column.
That might result in cutting off an answer, which leaves our data as missing or can even sometimes affect the rest of the dataset. Free-form text can also contain symbols that databases either cannot recognize or misuse because they are qualifier symbols such as quotation marks.
Think (again) like a consultant
Quality assurance is the most important part of data preparation and, as it comes right at the end of the process, be careful not to lose steam at this stage. I was lucky to enter into the field of data science through the world of consulting, which pays great diligence to QA.
With quality assurance, work is peer-reviewed. The numbers need to add up, and the results have to make sense. Don’t be afraid of this stage – it’s not designed to trip you up, it is there to help protect you from making errors later on in the process.
Those companies that have worked with data for a while have set up rigorous, predetermined procedures that data scientists must follow to the letter before any analysis can be carried out. Some companies will even have consultants to check your process and expect them to take a lot of time with this step.
Delivering an incorrect result will at the very least cost money and, in the worst case, may severely affect business operations. That is why it is so important to ensure that QA is carried out before you move on to the next step.
Now that you have a beautiful warehouse of squeaky-clean data, and you know the question or series of questions that you want to pose to it, you can finally move on to my favorite part: the analysis.