Data Science Tutorial 2019
Data Science can be seen as the interdisciplinary field that deals with the creation of insights or data products from a given set of data files (usually in unstructured form), using analytics methodologies.
The data it handles is often what is commonly known as “big data,” although it is often applied to conventional data streams, such as the ones usually encountered in the databases, the spreadsheets, and the text documents of a business. We’ll take a closer look into big data in the next section.
Data science is not a guaranteed tool for finding the answers to the questions we have about the data, though it does a good job at shedding some light on what we are investigating. For example, we may be interested in figuring out the answer to “How can we predict customer attrition based on the demographics data we have on them?”
This is something that may not be possible with that data alone. However, investigating the data may help us come up with other questions, like “Can demographics data supplement a prediction system of attrition, based on the orders they have made?”
Also, it is as good as the data we have, so it doesn’t make sense to expect breathtaking insights if the data we have is of low quality.
Nowadays, a growing number of people talk about data science and its various merits. However, many people have a hard time distinguishing it from business intelligence and statistics.
What’s worse, some people who are adept at these other fields market themselves as data scientists, since they fail to see the difference and expect the hiring managers to be equally ignorant on this matter.
However, despite the similarities among these three fields, data science is quite different, both in terms of the processes involved, the domain, and the skills required. Let’s take a closer look at these three fields.
As for business intelligence, although it too deals with business data (almost exclusively), it does so through rudimentary data analysis methods (mainly statistics), data visualization, and other techniques, such as reports and presentations, with a focus on business applications.
Also, it handles mainly conventional sized data, almost always structured, with little to no need for in-depth data analytics. Moreover, business intelligence is primarily concerned with getting useful information from the data and doesn’t involve the creation of data products (unless you count fancy plots as data products).
Business intelligence is not a kind of data science, nor is it a scientific field. Business intelligence is essential in many organizations, but if you are after hard-to-find insights or have challenging data streams in your company’s servers, the business intelligence is not what you are after.
Nevertheless, business intelligence is not completely unrelated to data science either. Given some training and a lot of practice, a business intelligence analyst can evolve into a data scientist.
Statistics is a field that is similar to data science and business intelligence, but it has its own domain. Namely, it involves doing basic manipulations on a set of data (usually tidy and easy to work with) and applying a set of tests and models to that data. It’s like a conventional vehicle that you drive on city roads.
It does a decent job, but you wouldn’t want to take that vehicle to the country roads or off-road. For this kind of terrain, you’ll need something more robust and better-equipped for messy data: data science.
If you have data that comes straight from a database, it’s fairly clean, and all you want to do is create a simple regression model or check to see if February sales are significantly different from January sales, analyzing statistics will work. That’s why statisticians remain in business, even if most of the methods they use are not as effective as the techniques a data scientist employs.
Scientists make use of statistics, though it is not formally a scientific field. This is an important point. In fact, even mathematicians look down on the field of statistics, for the simple reason that it fails to create robust theories that can be generalized to other aspects of Mathematics.
So, even though statistical techniques are employed in various areas, they are inherently inferior to most principles of Mathematics and of Science.
Also, statistics is not a fool-proof framework when it comes to drawing inferences about the data. Despite the confidence metrics it provides, its results are only as good as the assumptions it makes about the distribution of each variable, and how well these assumptions hold.
This is why many scientists also employ simulation methods to ensure that the conclusion their statistical models come up with is indeed viable and robust enough to be used in the real world.
Big Data, Machine Learning, and AI
Big data can mean a wide variety of things, depending on who you ask. For a data architect, for example, big data may be what is usually used in a certain kind of database, while for the business person, it is a valuable resource that can have a positive effect on the bottom line.
For the data scientist, big data is our prima material, the stuff we need to work with through various methods to extract useful and actionable information from it.
Or as the Merriam Webster dictionary defines it, “an accumulation of data that is too large and complex for processing by traditional database management tools.”
Whatever the case, most people are bound to agree that it’s a big deal since it promises to solve many business problems, not to mention even larger issues (e.g. climate change or the search for extra-terrestrial life).
It is not clear how big data came to get so much traction so quickly, but one thing is for certain: those who knew about it and knew how to harness its potential in terms of information could make changes wherever they were.
This may sound obvious, but remember that back in the early 2000s, it was only data architects, experienced software developers, and database administrators who were adept at the ins and outs of data.
So it was rare for an analyst to know about this new beast called “big data.” Whatever the case, these data analytics professionals who got a grip on big data first came to pinpoint its main characteristics, which distinguish it from other, more traditional kinds of data, namely, the so-called 4 V’s of big data:
Big data spans from a very large number of terabytes (TB) and beyond. In fact, a good rule-of-thumb about this characteristic of big data is that if the data is so much that it can’t be handled by a single computer, then it’s probably big data.
That’s why big data is usually stored in computer clusters and cloud systems, like Amazon’s S3 and Microsoft’s Azure, where the total amount of data you can store is virtually limitless (although there are limitations regarding the sizes of the individual uploads, as described in the corresponding web pages.
e.g. Amazon Simple Storage Service (S3) — Cloud Storage — AWS). Naturally, even if the technology to store this data is available, having data at this volume makes analyzing it a challenging task.
Big data also travels fast, which is why we often refer to the data we work with as data streams. Naturally, data moving at high bandwidths makes for a completely different set of challenges, which is one of the reasons why big data isn’t easy to work with (e.g. fast-changing data makes training of certain models unfeasible, while the data becomes stale quickly, making the constant retraining of the models necessary).
Although not all big data is this way, it is often the case that among the data streams that are available in an organization, there are a few that have this attribute.
Big data is rarely uniform, as it tends to be an aggregate of various data streams that stem from completely different sources. Some of the data is dynamic (e.g. stock prices over time), while other data is fairly static (e.g. the area of a country).
Some of the data can come from a database, while the rest of it may be derived from the API of a social medium. Putting all that data together into a format that can be used in a data analytics model can be challenging.
Big data is also plagued with the issue of veracity, meaning the reliability of a data stream. This is due to the inherent uncertainty in the measurements involved or the unreliability of the sources (e.g. when conducting a poll for a sensitive topic).
Whatever the case, more is not necessarily better, and since the world’s data tends to have its issues, handling more of it only increases the chances of it being of questionable veracity, resulting in unreliable or inaccurate predictive models.
Some people talk about additional characteristics (also starting with the letter V, such as variability), to show that big data is an even more unique kind of data.
Also, even though it is not considered to be a discernible characteristic of big data specifically, the value is also important, just like in most other kinds of data.
However, the value in big data usually becomes apparent only after it is processed through data science. All of this is not set in stone since just like data science, big data is an evolving field.
One of the best ways to work with big data is through a set of advanced analytics methods commonly referred to as Machine Learning (ML). Machine learning is not derived from statistics.
In fact, many ML methods take a completely different approach to statistical methods, as they are more data-driven, while statistical methods are generally model-driven.
Machine learning methods also tend to be far more scalable, requiring fewer assumptions to be made about the data at hand. This is extremely important when dealing with messy data, the kind of data that is often the norm in data science problems.
Even though statistical methods could also work with many of these problems, the results they would yield may not be as crisp and reliable as necessary.
Machine learning is not entirely divorced from the field of statistics. Some ML methods are related to statistical ones or may use statistical methods on the back-end, as in the case of many regression algorithms, in order to build something with a mathematical foundation that is proven to work effectively.
Also, many data science practitioners use both machine learning and statistics and sometimes combine the results to attain an even better accuracy in their predictions.
Keep that in mind when tackling a challenging problem. You don’t necessarily have to choose one method or the other. You do need to know the difference between the two frameworks in order to decide how to use each one of them with discernment.
Machine learning is a vast field, and since it has gained popularity in the data analytics community, it has spawned a large variety of methods as well as heuristics. However, you don’t need to be an expert in the latest and greatest of ML in order to use this framework.
Knowing enough background information can help you develop the intuition required to make good choices about the ML methods to use for a given problem and about the best way to combine the results of some of these methods.
AI – The Scientific Field, Not the Sci-fi Movie!
Machine learning has gained even more popularity due to its long-standing relationship with Artificial Intelligence (AI), an independent field of science that has to do with developing algorithms that emulate sentient beings in their information processing and decision making.
A sub-field of computer science, AI is a discipline dedicated to making machines smart so they can be of greater use to us. This includes making them more adept at handling data and using it to make accurate predictions.
Even though a large part of AI research is focused on how to make robots interact with their environment in a sentient way (and without creating a worldwide coop in the process!), AI is also closely linked to data science.
In fact, most data scientists rely on it so much that they have a hard time distinguishing it from other frameworks used in data science. When it comes to tackling data analytics problems using AI, we usually make use of artificial neural networks (ANNs), particularly large ones.
Since the term large-scale artificial neural networks don’t sound appealing nor comprehensive, the term “deep learning” was created to describe exactly that.
There are several other AI methods that also apply to data science, but this is by far the most popular one; it’s versatile and can tackle a variety of data science problems that go beyond predictive analytics (which has been traditionally the key application of ANNs).
The most popular alternative AI techniques that apply to data science are the ones related to fuzzy logic, which has been popular over the years and has found a number of applications in all kinds of machines with limited computational power.
However, even though such methods have been applied to data science problems, they are limited in how they handle data and don’t scale as well as ANNs. That’s why these fuzzy logic techniques are rarely referred to as AI in a data science setting.
The key benefit of AI in data science is that it is more self-sufficient and relies more on the data than on the person conducting the analysis. The downside of AI is that it makes the whole process of data science superficial and mechanical, not allowing for in-depth analysis of the data.
Also, even though AI methods are very good at adapting to the data at hand, they require a very large amount of data, making it impractical in many cases.
The Need for Data Scientists and the Products/Services Provided
Despite the variety of tools and automated processes for processing data available to the world today, there is still a great need for data scientists. There are a number of products and services that we as data scientists offer, even if most of them fall under the umbrella of predictive analytics or data products.
Examples are dashboards relaying information about a KPI in real-time, recommendation systems providing useful suggestions for blogs/videos, and insights geared toward what the demand of product X is going to be or whether patient Y is infected with a disease or not.
Also, what we do involves much more than playing around with various models, as is often the case in many Kaggle competitions or textbook problems. So, let’s take a closer look at what a data scientist does when working with the given data.
What Does a Data Scientist Actually Do?
A data scientist applies the scientific method on the provided data, to come up with scientifically robust conclusions about it, and to engineer software that makes use of their findings, adding value for whoever is on the receiving end of this whole process, be it a client, a visitor to a website, or the management team.
There are three major activities within the data science process:
This involves a number of tasks closely associated with one another, aiming at getting the data ready for use in the stages that follow. It is not a simple process and difficult to automate. That’s why around 80% of our time as data scientists is spent in the stage of data engineering.
Luckily, some data is easier to work with than other data, so it’s not always that challenging. Also, once you find a way to deploy your creativity in data engineering, it can be a rewarding experience.
Regardless, it is a necessary stage of data science, as it is responsible for cleaning up the data, formatting it, and picking the most information-rich parts of it to use later on.
This is probably the most interesting part of the data scientist’s work. It involves creating a model or some other system (depending on the application) that takes the data from the previous stage and does something useful with it.
This is usually a prediction of sorts, such as “based on the characteristics of data point X, variable Y is going to take the value of 5.2 for that point.”
The data modeling phase also involves validating the prediction, as well as repeating the process until a satisfactory model is created. It is then applied to data that hasn’t been used in the development of this model.
This aspect of the data scientist’s work has to do with delivering the insights acquired from the previous stages, communicating them, usually through informative visuals, or in some cases, developing a data product (e.g. an API that takes values of variables related to a client and delivering how likely this person is to be a scammer).
Whatever the case, the data scientist ties up any loose ends, writes the necessary reports, and gets ready for the next iteration of the process.
This could be with the same data, sometimes enriched with additional data streams. The next iteration may focus on a somewhat different problem or an improved version of the model.
Naturally, all of these aspects of the data scientist’s work are highly sophisticated in practice and are heavily dependent on the problem at hand. The general parts of the data science process, however, remain more or less the same and are useful guidelines to have in mind.
We’ll go into more detail about all this in the next blog, where we’ll examine the various steps of the data science pipeline and how they relate to each other.
What Does a Data Scientist Not Do?
Equally important to knowing what a data scientist does is knowing what a data scientist doesn’t do, since there is a great deal of misconception about the limits of what data science can offer to the world.
One of the most obvious but often neglected things that a data scientist cannot do is turn low-veracity data into anything useful, no matter how much of it you give him or how sophisticated a model employed.
A data scientist’s work may appear as magic to someone who doesn’t understand how data science works. However, a data scientist is limited by the data as well as the computing resources available to him. So, even a skilled data scientist won’t be able to do much with poor quality data or a miniature of a computer cluster.
Also, a data scientist does not create professional software independently, even if he is able to create an interactive tool that encapsulates the information he has created out of the data.
If you expect him to create the next killer app, you may be disappointed. This is why a data scientist usually works closely with software engineers who can build an app that looks good and works well, while also making use of his models on the back-end.
In fact, a data scientist tends to have effective collaborations with software developers since they have a common frame of reference (computer programming).
Moreover, a data scientist does not always create his own tools. He may be able to tweak existing data analytics systems and get the most out of them, but if you expect him to create the next state-of-the-art system, you are in for a big disappointment.
However, if he is on a team of data scientists who work well together, he may be able to contribute to such a product substantially. After all, most inventions in data science in today’s world tend to be the result of cumulative efforts and take place in research centers.
The Ever-growing Need for Data Science Professionals
If so many of us are willing to undergo the time-consuming process of pushing the data science craft to its limits, this is because there is a need for data science and the professionals that make it practical. If today’s problems could be solved by business intelligence people or statisticians, they would have.
After all, these kinds of professionals are much more affordable to hire, and it’s easier to train an information worker in these disciplines.
However, if you want to gain something truly valuable from the data that is too elusive to be tackled by conventional data analytics methods, you need to hire a data scientist, preferably someone with the right kind of mindset, one that includes not just technical aptitude.
But also creativity, the ability to communicate effectively, and other soft skills not so common among technical professionals.
The need for data science professionals is also due to the fact that most of the data today is highly unstructured and in many cases messy, making it inappropriate for conventional data analytics approaches.
Also, the sheer volume of such data being generated has generated the need for more pronounced, scalable predictive analytics. As data science is the best if not the only way to go when it comes to this kind of data analysis, data scientists are an even more valuable resource.
Start-ups tend to be appealing to individuals with some entrepreneurial vocation. However, many of them require a large capital at the beginning, which is hard to find, even if you are business savvy.
Nevertheless, data science start-ups don’t cost that much to build, as they rely mainly on a good idea and an intelligent implementation of that idea.
As for resources, with tech giants like Amazon, Microsoft, and IBM offering their cloud infrastructure equipped with a variety of analytics software at affordable prices, it’s feasible to make things happen in this field.
Naturally, such companies are bound to focus on spending a large part of their funding to product development, where data scientists play an integral part.
Finally, learning data science has never been as easy a task as it is today. With many blogs written on it (e.g. the ones from Technics Publications), many quality videos, and Massive Open Online Courses (MOOC’s) (such as edX and Coursera), it is merely a matter of investing time and effort.
As for the software required, most of it is open-source, so there is no real obstacle to learning data science today.
This whole phenomenon is not random, however, since the increase in available data, the fact that it is messy, and the value this data holds for an organization, indicate that data science can actually be something worthwhile, both for the individual and for the whole.
This results in a growing demand for data scientists, which in turn motivates many people who dedicate a lot of time to make this possible. So, take advantage of this privilege, and make the most out of it to jump-start your data science career.
Big data has four key characteristics:
Volume – it is in very large quantities, unable to be processed by a single computer
Velocity – it is often being generated and transmitted at high speeds
Variety – it is very diverse and comprised of a number of different data streams
Veracity –it is not always high quality, making it sometimes unreliable
Machine learning and AI are two distinct yet important technologies. Machine learning has to do with alternative approaches to data analysis, usually data-driven and employing heuristics and other methods. AI involves various algorithms that enable computers and machines in general to process information to make decisions in a sentient manner.
The three main stages of the data science process are:
1.Data engineering – preparing the data so it can be used in the stages that follow
2. Data modeling – creating and testing a model that does something useful with the data
3. Information distillation – delivering insights from the model, creating visuals, and in some cases, deploying a data product
A data scientist is not a magician, so if the data at hand is of low quality or if there is not enough computing power, it would be next to impossible to produce anything practically useful out of this, no matter how much data is available.
The Data Science Pipeline
Contrary to what many people think, the whole process of turning data into insights and data products is not at all straight-forward. In fact, it’s more of an iterative process, with impromptu loops and unexpected situations causing delays and reevaluations of your assumptions.
That’s why we often talk about the data science pipeline, a complex process comprised of a number of interdependent steps, each bringing us closer to the end result, be it a set of insights to hand off to our manager or client, or a data product for our end-user.
This whole process is organized in three general stages: data engineering, data modeling, and information distillation (this last one is a term coined by me). Each of these stages includes a number of steps, as you can see in this diagram:
Note that the approach to the data science pipeline described here is just one possible way of viewing it. There are other representations, all of which are equally valid.
Also, as the field matures, it may change to adapt to the requirements of the data scientist’s role. Keep an open mind when it comes to the data science pipeline because it is not set in stone.
Now let’s look at each one of these steps in more detail.
Data engineering involves getting your data ready for data analytics work. However, this is not an easy task because data comes in many varieties and degrees of data quality and documentation. In fact, this is the most time-consuming stage of the process, and it’s not uncommon for a data scientist to spend 70-80% of their time in this stage.
The main challenge is that most of the data streams involved in a data science project are unstructured or semi-structured data. However, most data analytics models work with structured data (aka datasets), so the raw data streams are practically useless for them.
Yet, even if they are structured enough to work, most likely they won’t produce sufficiently good results because they are unrefined. The process of refining the data so that they can be of use for modeling is also part of data engineering. In general, this stage involves the following steps:
Some people consider data acquisition as part of the process, though nowadays it’s so straightforward and automated that it’s not worth describing in detail here. Most of the data acquired for a data science project comes from databases (through some query language, like SQL) or from an API.
One thing to note is that even though the aforementioned steps are executed in that order, it is often the case that we need to go back to a previous step and redo it.
This iteration in the data science process is quite common in most projects and often involves back-steps from other stages of the pipeline. Let’s look at each one of these steps in more detail.
Data preparation involves cleaning the data and putting it in a data file that can be used for some preliminary analysis.
The objective of this step is to get the data ready for exploration, by removing or smoothing out any outliers it may have, normalizing it if necessary, and putting it in a data structure that lends itself for some descriptive analytics.
One of the most common such data structures is the data frame, which is the equivalent of a database table.
Data frames have been around for a while and are widely used in data science due to their intuitiveness. It is a very popular data structure in R, Python, Julia, and other programming platforms (even Spark has its variant of data frames). In general, data frames allow for:
Easy reference to variables by their name
Easy handling of missing values
Variety in the data types stored in them (you can have integers, strings, floats, and other data types in the same data frame, something impossible in the matrix and array data structures)
So, loading your data into a data frame is usually a good first step. Of course, depending on the complexity of the problem, you may need to use several data frames and combine them afterward.
What’s important at the data preparation stage is to get it all in one place and put it into a form to play and see what kind of signals are there, if any at all. Whatever the case, you can’t be sure about the value of the data without going through the data exploration stage that follows.
Data exploration is the most interesting part of the pipeline, as it entails playing around with the data without any concrete expectations in order to understand it and find out how to best work with it.
This is done primarily by creating a variety of plots. Data exploration is a serious endeavor too, as it involves a lot of quantitative analysis using statistics, particularly descriptive statistics.
Just like in every viable creative endeavor, data exploration involves a combination of intuition and logic; whatever ideas you get by looking at the various plots you create must be analyzed and tested.
Several statistical tests come in handy for this, yet data exploration may also involve heuristics designed specifically for this purpose.
If you have heard about stats being the cornerstone of data exploration by some boot-camp instructor, you may want to rethink it. Data science is much more than statistical analysis, even if it involves using statistics to some extent.
At the end of this stage, you should be able to have a sense of what signals the data contains, how strong these signals are, what features are best suited as predictors (if you are dealing with a predictive analytics problem), and what other features you may want to construct using the original feature set.
If you have done all this, going to the next step of the data engineering stage will come naturally and without hesitation.
[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]
Data representation is about getting the data in the most appropriate data structures (particularly data types) and optimizing the resources used for storing and processing it in the steps that follow.
Because even though the data types used to store the variables in the first step of the process may make sense, they may not be the best ones out there for this data.
The understanding you have gained from the data exploration step should help you decide on whether the structures should change. Also, you may need to create a few additional features based on the original ones. The data types of these new features’ also need to be sorted out at this point.
By the term features, we mean data that is in a form that can be used in a model. Features are not the same as variables. In data science, we distinguish between the two based on the processing that has been done on them. Also, a variable may be in the dataset but not be usable in the model as-is.
The transformation from variable to feature can be straight-forward or not, depending on the data. If the data is messy, you may need to work on it before turning the variable into a feature.
Whatever the case, after the data representation step, you’ll have a set of features at your disposal and have some idea of what each one of them is worth in terms of information content.
Some data science practitioners don’t give this step enough attention, because it is often perceived as part of the data preparation or the data exploration phase.
However, if you talk to any computer scientist out there, they will tell you that it is very important to choose the right data type for your variables since it can make the difference between having a dataset that’s scalable and one that is not.
Regardless of the computing power, you have access to, you always want to go for a data structure that is more economical in terms of resources, especially if you are planning to populate it with more and more data in the future.
This is because such a data structure is more future-proof, while it can also save you a lot of money in cloud resources being utilized.
Statisticians may not have to worry about this matter because they usually deal with small or medium data, but in data science, scalability to the big data domain is something that must be kept in mind, even if the data we have at the moment is manageable. Proper data representation can ensure that.
The data modeling stage is by far the most essential of all three stages of the data science pipeline. This is where the data you have meticulously prepared in the previous stages is turned into something more useful, namely a prediction of or valuable insight.
Contrary to what many teach, data modeling is more than just taking functions from a specialized package and feeding it data. It involves much more, probably enough to cover a whole semester’s worth of classes.
Everyone can import a package and use it, given they are patient enough to read the corresponding documentation. If you want to do data modeling properly, however, you need to go beyond that.
Namely, you need to experiment with a number of models (the more diverse, the better), manage a robust sampling process, and then evaluate each one of these experiments with a few performance metrics.
Afterward, you may want to combine some of these models and do another set of experiments. This will not only enable your aggregate model to have a better performance but also help you delve deeper into the nature of the problem and figure out potential subtleties that you can use to make it better.
Before you even get started with the models, evaluate the features themselves and perhaps do some preliminary analysis on them, followed by some preprocessing.
This may involve the generation of meta-features, a process that is common in complex datasets. These two main steps in data modeling are referred to as data learning and data discovery respectively and are an essential part of insight generation.
Data discovery is an interesting part of data modeling, as it involves finding patterns and potential insights in the data and building the scaffolds of your model. It is similar to data exploration, but here the focus is on features and how they can be used to build a robust model. Apart from being more targeted, it also entails different techniques.
For example, in this step, you may be looking at how certain features correlate to each other, how they would collaborate as a set for predicting a particular target variable, how the graph representation of their information would look, and what insights it can yield.
Forming hypotheses and testing them is something that also plays an important role in this part of the pipeline (it is encountered in the data exploration step too, to some extent). This can only help you identify the signals in the datasets and the features that are of greater value for your model.
You also need to get rid of redundant features and perhaps blend the essential features into meta-features (aka synthetic features) for an even better encapsulation of the signals you plan to use.
So, in this step, you really go deep into the dataset and mine the insights that are low-hanging fruit, before employing more robust methods in the next step.
Data learning is about creating a robust model based on the discoveries made in the previous step, as well as testing the model in a reliable manner. If this sounds like a lot of fun, it’s because it is!
In fact, most people who take an interest in data science get involved in it because of this step, which is heavily promoted by places like Kaggle (Your Home for Data Science), a site hosting various data analytics competitions.
Whatever the case, it is definitely the core step of the whole pipeline and deserves a lot of attention from various angles.
The models built in this step are in most cases mappings between the features (inputs) and the target variable (output). This takes the form of a predictive model, or in some cases, some sort of organizing structure (when the target variable is absent).
For each general category of models, there are different evaluation criteria that measure the performance of each model. This has to do with how accurate the mapping is and how much time the whole process takes.
Note that the models are usually trained on some part of the dataset and tested on another. This way you can ensure their robustness and general usability (aka generalization).
This detail is important since if it is not taken into account, you risk having models that may seem very accurate but are useless in practice (because they don’t generalize).
While a big part of this comes from experience, it is equally important (if not more important) to have a solid understanding of how models work and how they are applied, as well as the characteristics of a good model.
The best part is that the whole process of acquiring this expertise is fairly quick (you can master it within a year) and very enjoyable, as it involves trying out various options, comparing them, and selecting which one best suits the problem at hand.
This part of the data science pipeline is about summarizing everything you have done in the previous stages and making it available to your manager or client. This stage is important because of its visibility.
Since it’s at the end of the pipeline and very close to the deadline of the project, it might be tempting to rush through.
Yet, it’s imperative to resist this urge and spend the proper time and energy in this stage because it is tied to what the project stakeholders see or experience.
Besides, without some solid work in this phase, all the work you’ve done in the previous parts of the pipeline may not get the visibility they deserve.
Begin by planning for distillation early in the project. For example, keeping a good documentation notebook while you go through the various steps of the pipeline is bound to be useful for this stage, since it’s doubtful you will remember all the details of the work you have done in those steps.
Also, this kind of documentation will save you time when you prepare your presentations and product documents. In general, information distillation is comprised of two steps: data product creation (whenever there is a data product involved), and insight, deliverance, and visualization.
Data Product Creation
The creation of a data product is often a sophisticated task and isn’t mentioned much in other data science pipeline references. This is because in many cases, it has more to do with software engineering or website development. Still, it is an important aspect of data science, as it’s the main access point for most people to the data scientists’ work.
Data product creation involves an interface (usually accessible through a web browser, as in the case of an API), and an already trained model on the back-end. The user inputs some data through that interface and then waits for a second or so.
During that time, the system translates this input into the appropriate features, feeds them to the model, obtains the result of the model, and then outputs it in a form that is easy to understand. During this whole process, the user is completely insulated from all the processes that yield this result.
In some cases, the user can have access to multiple results by paying a license fee to the company that owns the data product. This way, the user can obtain the results of many data points in bulk instead of having to input them one by one.
The creation of data products can be time-consuming and entail some running costs (the cloud servers they live on are not free!).
Also, they usually involve the collaborative effort of both data scientists and software engineers; not many projects have the creation of such products as part of their pipeline. However, whenever they have them, this step is usually in the first part of the information distillation stage.
Insight, Deliverance, and Visualization
The deliverance of insights and/or data products and the visualization of the outputs of your analysis are key elements of the data science pipeline and oftentimes the only output of the whole process that is viewed outside the data science team.
Still, many data science practitioners don’t do this step justice because it is very easy to get absorbed in other, more intriguing parts of the pipeline. The fact that it’s the last step of the pipeline doesn’t help either.
This part of the information distillation stage generally entails three things:
1.Summary of all the main findings of your analysis into something that is actionable or at least interesting and insightful (hence the first term of this step’s name)
2. Deliverance of your work, be it a model or a data product, to the stakeholders of the project
3.Visuals that demonstrate your findings and performance of your model(s)
All of these are important, though what is most important depends on the project and the organization you work for. Regardless, it is best to know that beforehand so you put enough emphasis on that aspect of this phase, ensuring that everyone is as happy about the production cycle’s completion as you are.
This step of the data science pipeline is essential even if the rest of the steps can’t be completed for one reason or another. Half-baked results are better than no result at all in this case.
I don’t recommend you leave anything unfinished, but if the time restraints or the data doesn’t allow for what you had originally envisioned, it is best to make an effort to present what you’ve found so that everyone is on the same page with you.
If the data is of low veracity for example, and that jeopardized your work, your manager and colleagues need to know. It’s not your fault if the data you were given didn’t have any strong signals in it.
Putting It All Together
The data science pipeline is a complex beast. Nevertheless, with the right mindset, it is highly useful, providing structure to insight generation and the data product development process.
Just like in conventional science, data science’s processes are not straight-forward, as every analysis is prone to unforeseen (and many times unforeseeable) situations.
As such, being flexible is of paramount importance. That’s why we often need to go back to a step we’ve already spent time on, viewing the problem from different angles until we come up with a model that makes more sense.
If the process seems cyclic in the diagram at the beginning of the blog, that is because typically it is cyclic in practice. It’s not uncommon to have several iterations of the pipeline, especially if the final stage is successful and the stakeholders are satisfied with your outputs.
Every iteration is bound to be different. Perhaps you gain access to new data streams or more data in the existing ones, or maybe you are asked to create a more elaborate model. Whatever the case, iterating over the data science process is far from boring, especially if you treat each iteration as a new beginning!
The data science pipeline is comprised of three distinct stages:
Data engineering involves refining the data so that it can be easily used in further analysis. It is comprised of three main steps:
1.Data preparation: Cleaning the data, normalizing it, and putting it a form that it can be useful for data analysis work
2.Data exploration: Playing around with the data to find potential signals and patterns that you can use in your models
3.Data representation: Putting the data in the appropriate data structures, saving resources, and optimizing the efficiency of the models that ensue
Data modeling involves creating a series of models that map the data into something of interest (usually a variable you try to predict), as well as evaluating these models through:
Data discovery: Finding useful patterns in the data that can be leveraged in the models as well as optimizing the feature set so that the information in it is expressed more succinctly
Data learning: Developing (training) a series of models, evaluating (testing) them, and selecting those that are better suited to the problem at hand
Information distillation involves summarizing the findings of your analyses and possibly creating a product that makes use of your models using:
Data product creation: Developing an application that uses your model(s) in the back-end
Insight, deliverance, and visualization: Summarizing your findings into actionable insights, delivering them, and creating information-rich visuals. Overall, the pipeline is a highly non-linear process. There are many back-and-forths throughout a data science project, which is why as a data scientist, it’s best to be flexible when applying this formula.
Data Science Methodologies
As mentioned in the previous two blogs, data science is diverse in its applications, which is why the pipeline I described is bound to require some adaptation to the problem at hand.
This is because data science lends itself to a variety of different situations. Plus the data itself is quite diverse too, making the potential applications different from one another.
So using data science, we can engage in a variety of methodologies, such as predictive analytics, recommender systems, automated data exploration (e.g. data mining), graph analytics, natural language processing, and other methodologies.
Predictive analytics is an umbrella of methodologies, all aiming at predicting the value of a certain variable. Predictive analytics methods are the most widely used in data science and also the most researched methods in the field.
Their objectives tend to be fairly simple to express mathematically, but achieving a good performance in them is not as straight-forward as you may think.
The reason is that in order for a predictive model to work well, it needs to be able to generalize the data it is trained on so that it can grasp the underlying meta-pattern in the data and use that to predict certain things about data it has never encountered before.
Therefore, memorizing the patterns in the training data is not only inadequate as an approach but a terrible idea overall, as this approach is guaranteed to have a poor generalization. This condition is usually referred to as over-fitting, and it’s a major concern in many data science systems.
Predictive analytics covers a variety of different methodologies which can be grouped into five main categories: classification, regression, time-series analysis, anomaly detection, and text prediction. Let’s now look at each one of these predictive analytics methodologies in more detail.
Whenever we are dealing with a discrete target variable, we have a classification problem. In many cases, the target variable is a binary one (e.g. someone sharing a post in social media or not sharing it), but it can have several different values in the general case (e.g. different types of network attacks).
These values are usually referred to as classes, and they can be either numeric or text-based. In some cases, the class variable is transformed into a series of binary variables, one for each class.
Usually, we are more interested in predicting one particular value accurately (e.g. fraudulent transactions), rather than all the different values.
In the fraudulent transactions case, for example, the other values would be various types of normal transactions, which are expected and common. As such, if we miss a few, it’s not a big deal, but missing a few fraudulent transactions may have severe consequences.
The evaluation of a classification system, also known as a classifier, is often measured using a specialized metric called F1. If we care about all the different classes equally, then we use a more conventional metric like accuracy rate to evaluate our classifier.
There are three different categories of classifiers: inductive/deductive, transductive, and heuristics-based. Inductive/deductive classifiers deal with the creation and application of rules (e.g. decision trees).
This excels in cases where the number of features is limited and relatively independent but fails to handle highly non-linear problems. Transductive classifiers are based on the distances of the unknown data points to the known ones (e.g. K Nearest Neighbor).
They are usually good at non-linear problems, are fast, but don’t scale very well. Heuristics-based classifiers, which are the most popular one today, make use of various heuristics for creating meta-features which are then used for the classification through some clever aggregation process (e.g. Artificial Neural Networks, or ANNs for short).
Although these classifiers tend to perform exceptionally well and are quite scalable, they usually require a lot of data, and their interpretability is questionable.
Regression deals with problems where the target variable is a continuous one. A common example of this is the number of shares in a post on a social medium. Regression is often linked to classification, since defining a threshold in the target variable of the regression problem can turn it into a binary classification one.
Regression hasn’t been researched as much as classification, and many of its systems are usually variations of classification ones (e.g. regression trees, which are decision trees that just use a continuous target variable).
The key point of a regression system is that whatever you do, there are always going to be cases where the model is off in relation to the true value of the target variable.
The reason is that the model needs to be as simple as possible in order to be robust, and the evaluation functions used tend to focus on the average error (usually the mean squared error).
Regression systems are very useful, yet they are generally a bit left behind compared to other predictive analytics systems because the methods they use for assessing regression features are arcane and sometimes inaccurate (e.g. Pearson Correlation).
Contrary to what many people still think, the relationships among variables in the real-world are highly non-linear, so treating them as linear is oversimplifying and distorting them.
Also, performing crude transformations on the variables so that these relationships become more or less linear is also a suboptimal approach to the problem. Nevertheless, regression systems are great at doing feature selection on the fly, making the whole process of data modeling faster and easier.
Whenever we have a target variable that changes over time and we need to predict its value in the near future, this involves a time-series analysis (e.g. predicting the value of a stock in the next few days).
Naturally, the values we use as features for a problem like that are the previous instances of the target variable, along with other temporal variables. How far back we need to go, however, greatly depends on the problems.
Also, the nature of these features, as well as their contributions to the model, are things that need to be determined.
Much like in the case of regression, time-series analysis involves minimizing the error of the target variable, as it also tends to be continuous. The problem in this situation is that a number of predicted data points will have to be used for predicting further into the future, so a small error in the predictions will likely accumulate.
This is why problems in this category are more prone to the so-called butterfly effect (see glossary for definition), which is why accurate measurements and predictions are essential for more robust performance in this kind of system.
The anomaly detection methodology, which is also known as novelty detection, is a very powerful tool for tackling a certain kind of problem that is very hard to solve otherwise.
Namely, if you need to identify a certain kind of peculiar case (e.g. fraudulent transactions) in a large number of ordinary ones (in this case, normal transactions).
These anomalous cases are usually a problematic situation of a system that, if left unattended, would create a lot of issues in the system and to its end-users.
For example, if the data at hand refers to a computer network, an anomalous case could be a hack, blockage, or some system error. Best case scenario, these anomalies are bound to jeopardize the user experience, while they may even bring about security issues to their computers.
Although anomaly detection is in a way a kind of classification, the way it is carried out is not through conventional classification methods. The reason is that conventional classification requires sufficient examples from each of the classes that the predictive analytics system needs to predict.
Since this is not possible in some cases due to the anomalies being by definition very rare, classifiers are bound to not learn that particular class properly, making predictions of it inaccurate.
A special kind of anomaly detection is outlier prediction for a single variable. Although this scenario is fairly basic, since it is usually the case that the outliers can be pinpointed very accurately, often without any calculations at all, that’s not a trivial problem either.
As the dimensionality increases, it becomes more and more challenging, since the statistical methods which have been used traditionally in figuring out extreme cases fail to predict anomalies accurately. That’s why most modern anomaly detection techniques rely primarily on machine learning, or non-parametric data analytics, such as kernel methods.
Anomaly detection is important in various domains, such as cybersecurity. Needless to say, even if the majority of cases are benign, the few malicious ones can be catastrophic, making their prediction a highly valuable feat that would be practically impossible without anomaly detection.
This predictive analytics methodology, text prediction, has to do with predicting the word or phrase that comes next in a textual data stream (e.g. when typing a query in the search bar of a search engine).
Although it may not seem like a difficult problem, it is. It requires a lot of data in order to build a robust text predictor.
For the system to work properly and in real-time, as is usually the case, it has to calculate loads of conditional probabilities on the fly.
This is so it can pick the most probable word among the thousands of possible words in a given language, to follow the stream of words it has at its disposal. It is a bit like time-series analysis, though in this case, the target variable is discrete and of a different type (i.e. not numeric).
Text prediction has many applications in the mobile world, especially related to smartphones, as it lends itself to facilitating text input. Also, a version of it is used in text editors like Notepad++ for inputting large variable names, for example.
In this kind of application, its scope is far more limited, so it’s a fairly easy scenario, but when dealing with the whole of a language’s vocabulary, a more robust predictive system is required.
Also known as recommendation systems, this is probably one of the most widely known applications of data science, as it is quite ubiquitous. From online stores to the Internet Movie Database (IMDb), various sites make use of a recommender system.
Basically, a recommender system takes a series of entities (e.g. consumer products), which are often referred to like items, and which are associated with another list of entities (e.g. customers), often referred to as users.
It then tries to figure out what an existing member of the latter list will be interested in from all the members of the former list. This may seem straight-forward, but considering the vast amount of combinations and the fact that the overlap between two users is bound to be small, this makes for a challenging problem.
The matrix that encapsulates all the information on the relationship between the items and the users is usually referred to as the utility matrix, and it is key in all recommender systems.
However, as these relationships tend to be very specialized, this matrix is usually very sparse and very large. This is why it is usually represented as a sparse matrix type when doing data analytics work on it.
In order to make use of the data in the utility matrix, there have been several methods developed that facilitate this recommendation process to make it fairly easy and quite scalable.
These methods aim to fill in the blanks in the utility matrix and give us hints about potential relationships between users and items not yet manifested in real life.
Let’s take a look at the most important categories, namely collaborative filtering, and content-based systems, both of which play an important role in recommender systems.
If we are dealing with a case where various users rate a bunch of items in an e-commerce store, the collaborative filtering method deals with the user ratings, while the content-based approach focuses on using features of items to calculate their similarity.
On top of all that, there is a strictly analytical approach called Non-negative Matrix Factorization, which is also used in practice. Let’s look at each one of these methods in more detail.
Content-based systems take a more traditional approach to recommendations. They look at how we can create features based on the items and use them as a proxy for finding similar items to recommend. As a first step, this method attempts to formulate a profile for each item.
You can think of this as an abstraction of the item that is common among other items which share certain characteristics. In the case of movies, for example, these can be the genre, the director, the production company, the actors, and even the time it was first released.
Once you have decided on which features to use for the item profile, you can refactor all the available data so that it takes the form of a feature set.
These features can be binary ones, denoting the presence or absence of something. This is common in some applications, such as cases where the items are documents or web pages.
The features are then used for assessing similarities using metrics such as the inverted Jaccard distance and cosine distance. The specific metric you use greatly depends on the application at hand and the dimensionality of the feature set.
Also, these feature sets lend themselves to predictive analytics methods (particularly classification), so using the latter as part of a recommender system is also an effective option.
The collaborative filtering method of recommendation has been around the longest and involves finding similarities between the user ratings for two items, thereby finding similar users and applying transductive reasoning for finding the more relevant items to recommend.
As for measuring the similarity among the users, there are various methods for that. Like in the content-based recommender systems, the most common ones here are the inverted Jaccard distance and cosine distance, though the process of collaborative filtering also involves certain other processes.
For example, rounding the data and performing some kind of normalization are popular.
Collaborative filtering also entails some clustering methods. This shouldn’t come as a shock, considering that clustering is about finding similar data points and putting them into groups.
The recommendation problem is a classic use case for this methodology. Though, one thing to keep in mind is that the number of clusters is best to remain large, at least at first, for the clusters to be more meaningful.
Beyond these methods of collaborative filtering, there is also the more analytical approach of UV decomposition, where the utility matrix is split (decomposed) into two other matrices.
This is a computationally expensive process but solves the problem very efficiently. We’ll look into a popular UV decomposition method in the next paragraph.
Non-negative Matrix Factorization (NMF or NNMF)
Although this is technically a collaborative filtering method, its robustness and popularity render it a method worthy of its own section in this blog.
As the name suggests, it involves breaking a matrix into two other matrices that when multiplied together, yield the original matrix. Also, the values of these new matrices are either zero or positive, hence the “non-negative” part.
Factorizing a matrix can be modeled as an optimization problem, where the fitness function is the squared error. However elegant as it is, such a process is prone to over-fitting. One very robust remedy to this is a process called regularization (which is also used in regression problems extensively).
This is incorporated in the NMF algorithm, something that contributes to the effectiveness of this technique. The two matrices that this process yields, P and Q, are often referred to like the features matrix and the weights matrix respectively.
Each row in P represents the strength of the connections between the user and the features, while every row in Q denotes the strength of the connections between an item and the features.
The features in P are the latest features that emerge from this analysis, which may have real-world expressions (e.g. a film’s genre, in the case of a movie recommender system). In general, there are fewer features than items, otherwise, the system would be impractical.
Automated Data Exploration Methods
This group of data science methods involves an automated approach to data exploration. Generally, they are fast, insightful, and oftentimes useful for the problems they are opting to solve.
Its main method is data mining, but there are also other ones like association rules and clustering. Let’s look at each one of the main methods of this group in more detail.
Data mining has been around even before data science was an independent discipline. Now it can be considered as part of it since it is generally an automated form of data exploration.
The focus of data mining is finding patterns in the data without the use of a specific model. In fact, it rarely involves the development of any models, as it is mainly interested in finding relationships among the different variables of the dataset.
Oftentimes, there isn’t even a target variable involved in the problems that it’s trying to solve. So, for problems like that, data mining is an excellent approach. Afterward, you can shift to different methodologies to address specific questions that involve a target variable.
Data mining has fallen out of fashion in the past years, as it was geared toward data analysts who didn’t have the programming expertise and in-depth understanding of data analytics to go beyond the analysis that data mining yields.
As such, it is not considered a groundbreaking field anymore, although it is a useful methodology to know, particularly if you are dealing with huge sparse datasets, especially containing text.
Although this is technically part of data mining, it is worth describing a bit further on its own, as its results are very useful.
Association rules are usually related to a particular application: shopping carts. In essence, they involve insights derived from the creation of rules about what other products people buy when products A and B are in their shopping carts.
Rules may vary in applicability and reliability. These two attributes are the key characteristics of a rule and are referred to as support and confidence respectively once they are quantified as metrics. Usually, these metrics take the form of a percentage. The higher they are, the more valuable the rules are.
So, data exploration through finding/creating association rules is the process of pinpointing the most valuable such rules based on a given dataset. The number of rules and the thresholds of support and confidence tend to be the parameters of this process.
Clustering is a popular data exploration method that often goes beyond the data exploration part of the pipeline. Although most of the clustering methods out there are still rudimentary and require some parameter tuning, as a methodology, clustering is highly valuable and insightful.
Usually categorized as an unsupervised learning method (since it doesn’t require any labeled data), it finds groups of data points that are most similar, while at the same time most dissimilar to the data points of the other groups.
The rationale of this methodology is finding interesting patterns in a dataset if the data you have is quantitative. (If you have binary variables too, that would also work as long as they are not all binary variables.)
These patterns take the form of labels, which can be used to provide more structure to your dataset.
An important thing to keep in mind when using this method is to always have your data normalized, preferably to [0, 1] or to (0, 1), if you plan to include binary variables.
Also, it is important to pick the right distance function (the default is Euclidean, but Manhattan distance is also useful in certain problems), since different distance functions yield different results. If the number of variables is high, you may want to reduce it before applying to cluster to your data.
This is because high dimensionality translates to very sparse data space (making it harder to identify the signal in the dataset), while most distance metrics fail to capture the differences/similarities of the various data points.
Finally, in some cases, like when using a variant of the k-means algorithm, you will need to provide the number of clusters (usually referred to as k), since it is an essential parameter of the algorithm.
Graph analytics makes use of a fairly intuitive representation of the data, primarily focusing on the relationships among the various data points (or features/classes, in some cases). Although this methodology’s name hints towards a visual approach to data analytics, it is not bound by that aspect of it.
In fact, graph analytics involves a lot of calculations which are abstract in nature, much like traditional computer science.
The key facets of this methodology are that graphs are in a different domain, beyond the n-dimensional space of conventional data analytics, and that graphs are made useful through the various algorithms that make the innate information in these graphs available to the data scientists.
In general, a graph is comprised of three things:
Nodes (aka vertices): the entities that are of primary importance in the dataset
Arcs (aka edges): the connections among these entities, represented as lines
Weights: the values corresponding to the strength of these connections
Graphs live in an abstract plane where all the conventional dimensions are absent. Therefore, they are not prone to the limitations of the conventional space which most datasets inhabit.
This frees graphs from issues like the curse of dimensionality or the signal being diluted by useless features. Also, because graphs can be easily viewed on a screen or a print-out, they lend themselves to easy data visualization.
In order to get data into graph form, you need to define a similarity/dissimilarity metric, or a relationship that is meaningful in the data at hand. Once you do that, you can model the data as follows:
Nodes are the data points, features, clusters, or any other entities you find most important; Arcs are the relationships identified; Weights are the values of the similarities/ dissimilarities calculated for these relationships.
After you have identified the architecture of the graph, you can create a list of all the nodes, their connections, and their weights, or put all this information in a large, usually sparse, matrix.
The latter is typically referred to as the connectivity matrix and is very important in many algorithms that run on graphs. Let’s look at some of those algorithms and how they can be of use to your data science projects.
These are specialized algorithms, designed for graph-based data, and are adept at providing insights related to the relationship dynamics and the connectivity of the entities involved in the graphs they are applied to.
Graph algorithms are not always relevant to data science, but it’s good to know them nevertheless, as they enable you to comprehend the dynamics of graph theory and how the flow of information is mirrored in the graph architecture. This makes them useful in data exploration.
The graph algorithms that are most commonly used are the following:
Calculating graph characteristics (e.g. centrality, order, power, and eccentricity) – These pertain to certain aspects of the graphs, the equivalent of descriptive statistics in a tabular dataset.
Creating the Minimum Spanning Tree of a graph (Prim’s and Kruskal’s algorithms) – These are the graphs connecting all the nodes in the smallest overall weight. They are the skeleton of the graph.
Finding the shortest path between two nodes (Dijkstra’s and Floyd’s algorithms) – These are the core algorithms for all navigation systems, particularly those that are GPS-based.
Finding connected components – These are “islands” of connectivity in a graph and are particularly useful in sparse graphs. Finding them can yield useful insights.
Finding cliques (highly connected sub-graphs) – Much like connected components, cliques are high-density areas of connectivity. The only difference is that the nodes of a clique can be connected to other nodes outside the clique.
Finding a Maximal Independent Set (aka MIS; Luby’s algorithm) – An independent set in a graph is a set of nodes where any pair within it is not directly connected.
A maximal independent set is an independent set S, where if we were to add any other node to S, the independence would no longer hold properly. Finding sets like this can be useful in understanding a graph’s entities.
Finding Single-Source Shortest Paths (Johnson’s algorithm) – This is very much like finding shortest paths, but it applies to sparse graphs only.
Calculating the PageRank value of a node (any modern search engine’s core algorithm) – This is a clever way of figuring out the importance of a node in a graph based on how many nodes point toward it.
It applies only to a certain kind of graph, where the relationship’s direction is modeled (also known as directed graphs). This is essential for ranking the nodes in a graph.
Clustering nodes in a graph
This is the equivalent of conventional clustering but applied to graph-based data. It yields subsets of the original graph that are more closely connected and is essential for understanding the nature of the different kinds of entities in the graph and how they are interrelated.
Although this list is not exhaustive, it is an excellent place to start in traversing the knowledge graph of graph analytics tools.
Other Graph-related Topics
Apart from graph modeling and graph processing (through the various algorithms out there), there are certainly other things that should be taken into account when using graphs in data science projects.
Storing graphs, for example, is much different than storing conventional datasets and requires some attention. Ideally, your graph could be stored in a single file (given it’s compact enough) as a graph object.
However, if your graph contains an excessive amount of data, it may need to be stored in a distributed format. Whatever the case, if you are planning to do a lot of work with graphs, it would make sense to use a graph database system, such as Neo4j.
This kind of software will not only store your graph efficiently, but also query it and help you enrich it through its dedicated language (in the case of Neo4j, it is Cypher, and its scripts have the .cql extension).
More often than not, these graph programs will have APIs for certain programming languages, so you won’t need to do everything through their native scripts.
Graphs are also dynamic. As such, they often require special treatment. For example, algorithms for evolving graphs are great for this kind of situation and are popular in the corresponding domains (e.g. social media).
Evolving graphs have a temporal component. As a result, the analysis you perform on them is different than that of conventional (static) graphs.
Clearly, not every problem lends itself to a graph-based solution. While graph analysis is attractive and very appealing to anyone who understands the underlying theory, it doesn’t mean that it’s a panacea.
If your problem involves relationships of entities and how they change over time, then graph analysis is the best way to go for your data science project.
Natural Language Processing (NLP)
Natural language processing is a fascinating part of data science that involves making sense out of text and figuring out certain characteristics of the information in it, such as its sentiment polarity, its complexity, and the themes or topics it entails.
Although there are a lot of data science professionals involved in this part of data science doing all sorts of interesting things, the field remains in its infancy, as most of them just do one or more of the following NLP tasks: sentiment analysis, topic extraction, and text summarization.
We have yet to see a system in NLP that actually understands what text it is being fed and does something intelligent with it.
Sentiment analysis is one of the most popular NLP methods. It involves finding out whether a piece of text has a positive or negative flavor (although in certain cases a neutral flavor may also be an option, depending on the application).
The idea is to identify if a piece of text is polarized emotionally, and to gauge the pulse of public opinion on a topic (in the case of multiple texts, usually stemming from social media). However, sentiment analysis lends itself to the analysis of internal documents in an organization, such as call logs.
There are several ways to accomplish sentiment analysis; all of them involve creating a set of features which are then fed into a binary classifier (oftentimes a simple logistic regression system).
The effectiveness of this method lies in the selection of the most powerful features that most effectively encapsulate the information in the text.
The presence of certain keywords that are known to be either positive or negative can be good candidates for such features.
Usually, the selection of the words or phrases that are then used for these features is an automated process. The whole process can also be modeled as a regression problem, yielding a flavor score as an output, ranging between 0.0 (negative sentiment) and 1.0 (positive sentiment).
Sentiment analysis, even if it is not very accurate, is a very useful application, especially when it comes to analyzing text feeds from social media in order to gauge how people feel about a certain topic (e.g. the political situation, a certain event, or even a particular person).
An example of this is the sentiment analysis of a particular commercial product, using its reviews from various sources. The information stemming from the sentiment analysis of this data stream could help a company figure out the optimum marketing strategy for that product, as well as optimize the way to engage with its customers.
Don’t underestimate the value of this method because of its simplicity. Sentiment analysis is so insightful that there are companies out there that are built solely around this service.
Topic extraction (aka topic modeling) entail the discovery of the most distinct groups of topics in a set of documents based on a statistical analysis of the key terms featured. This analysis is usually based on some frequency model, which basically translates to the number of occurrences of the terms in the documents.
Whatever the case, it works well and always provides a set of topics, along with a view of how the given documents relate to these topics.
The most commonly used algorithm for this is the Latent Dirichlet Allocation (LDA), although there are others too. LDA is quite popular because of its ability to handle whole word phrases, rather than individual words only.
Naturally, regardless of the algorithm you use for finding these topics, certain words are usually excluded from the text because they don’t add much value to the documents. These are usually very common words (e.g. the, and, a, by), which are often referred to as stopwords.
Also, they can be other words, or even whole phrases, that are specific to the documents analyzed (e.g. USA, country, United States, economy, senator, if the document corpus is comprised of talks by US politicians, for example).
Topic extraction/modeling is essential for organizing large collections of text that are in the form of individual documents. These can be articles, website content, or even whole blogs.
This NLP method enables you to figure out what these documents are about without having to parse them yourself so that you can then analyze the document corpus more effectively and efficiently.
As its name suggests, this method of NLP has to do with encapsulating the key ideas in a document in a new document that is shorter, yet still in a comprehensive text form.
Coupled with this is the process of named entity recognition, which has to do with identifying certain entities (e.g. people involved, their titles, time, and location) and delivering all the key information related to them (e.g. what kind of event took place).
Some people consider sentiment analysis as part of text summarization, although I would rather keep it separate here, as it has grown to be a fairly independent field as of late.
There are several methods for text summarization, all of which fall into one of two categories: supervised and unsupervised. Just like the corresponding machine learning method categories, these ones share the same philosophy: if there is labeled data and it’s leveraged in the method, it is supervised. Otherwise, it is unsupervised.
Some of the major supervised methods of text summarization are grouped as follows:
Statistical: Conditional Random Field, KEA, KEA++ (Naive Bayes)
Conventional ML: Turney (SVM), GenEx (Decision Tree), KPSpotter (Information Gain coupled with a combination of other NLP techniques)
AI related: Artificial Neural Networks
In the unsupervised methods category, some of the methods that exist are:
Statistical: Frequency analysis, TF-IDF, BM25, POS-tags
Other: TextRank, GraphRank, RAKE
Other NLP Methods
Beyond these methods in NLP, there are a few other NLP methods that deserve your attention if you are to employ NLP in your projects. For example, the breakdown of text into its parts-of-speech (POS) structure is also popular and is usually carried out with a package like NLTK or spaCy, in Python.
Other methods of NLP may focus more on the complexity of the text, using custom algorithms that analyze the breadth of the vocabulary used, the length of each sentence, and other similar factors.
This is particularly useful when it is taken to the next level to look into more subtle things, like writing style (expressed in different features, such as text entropy and sentence structure). This can be used to check if two documents are written by the same person (plagiarism detection).
Finally, there are other NLP methods that are used in conjunction with graph analysis or other data science methodologies.
These may involve information retrieval topics such as the relevance of a document to a given phrase (query), using methods like TF-IDF (term frequency-inverse document frequency), a heuristic for assessing how important a particular term is in a document in relation to other terms across a set of documents.
All of these methods merely scratch the surface of what is possible in the NLP field. For example, NLP also includes methods for generating text, figuring out the text’s parts-of-speech, and many more.
Beyond the aforementioned methodologies that are used in data science, there are other ones as well, most of which have come about relatively recently.
They are mostly AI-related, and as this field is on the rise, you can expect to see more of these cropping up in the years to come. Let’s look at some of them now that are popular among data scientists.
Chatbots are NLP systems that are interactive, dwell in the cloud, and are usually associated with a particular set of real-world tasks.
They serve as virtual assistants of sorts, always there to help out with queries and carrying out relatively simple tasks. The function of a chatbot entails the following processes (in that order), all of which are data-science related to some extent:
Understanding what the user types (or in some cases, speaks) – This employs an NLP system that analyzes the language, identifies keywords and key phrases, and isolates them
If the input is audio and the system is uncertain (or if its settings demand it to do so) – In this case, a confirmation response is given, where the system repeats what the user has said to ensure that it has understood it properly
Information retrieval/task completion – This involves carrying out the task that relates to the user’s request, be it a simple informational one or a more executive one
Final response – The chatbot yields the information retrieved or a brief report regarding the task performed to the user Chatbots are great for customer service, general assistance with the visitors of a complex website, or even with day-to-day tasks, as in the case of Amazon Echo.
There are several open-source systems, and they are often domain-specific. A key benefit of chatbots is that everything they do is logged.
If the systems in the back-end are properly designed, a chatbot can use this data to improve itself, making the communication with the user smoother and more reliable over time.
Although the technology is still fairly new and has plenty of room for improvement, it is gaining ground rapidly. As NLP methods continue to enhance, chatbots are bound to be much more useful in the future.
If you are interested in NLP, this is definitely an area worth monitoring closely.
This niche methodology of data science is really an AI technology and involves the use of sophisticated systems, usually ANN-based, that mimic certain styles to create new data either from scratch or using preexisting material (e.g. a photograph, in the case of digital painting). However, artificial creativity is not only related to the arts (e.g. music, painting, and poetry).
It covers other, more practical areas of application too, such as document writing (e.g. in the case of news articles) and product design.
Automated software testing is also something that’s gained traction over the past few years (computer-generated tests tend to be broader and therefore the testing that ensues is more robust).
The application where artificial creativity truly shines, however, at least when it comes to data science, is coming up with novel solutions to problems. Artificial creativity is of paramount importance when it comes to feature engineering (e.g. through Deep Learning systems as well as Genetic Programming).
If you are interested in coming up with novel ways to expand your feature set, this is a field worth exploring. The catch is that you need to have a lot of (labeled) data in order to do this properly; otherwise, the new feature set is bound to be subject to over-fitting.
Other AI-based Methods
Beyond chatbots and artificial creativity, there are several other AI-based methodologies that are used in data science, directly or indirectly. Let’s take a brief look at some of them that have been around long enough to demonstrate their timelessness and viability in this ever-changing field.
First and foremost, optimization is one of the methods that are essential for any data-science related problem, and it is used under the hood in many of the aforementioned methodologies. It involves finding the best (or one of the best) set of parameter values of a function, in order to make its output maximum or minimum.
However, it can be necessary to apply it on its own for finding the best possible solution to a well-defined problem, under certain restrictions, especially in cases where there are several variables involved and the problem cannot be solved analytically.
Yet even in cases where there is an analytic solution possible, we resort to optimization in order to obtain a solution faster, regardless of if that solution is not the absolute best one.
This is because an analytical solution is bound to take much more time and computational resources to calculate, while this extra cost is not always justified since a near-optimum solution is often good enough for all practical purposes.
A few AI methods for optimization that have been proven to be both robust and versatile are Genetic Algorithms, Particle Swarm Optimization (and its many variants), and Simulated Annealing.
Which one is best depends on the application and the data at hand, so I recommend that you learn to use all of them, even if you don’t know the specifics of the algorithms involved in great detail?
AI-based reasoning is another AI methodology that is very popular among theoretical scientists. It involves getting the computer to perform logical operations in logical statements.
These are often used to conduct proofs of theorems, which augment the existing scientific knowledge by facilitating research (particularly in Mathematics).
They can also be useful for data science projects, particularly if you make use of a rule-based system. This kind of artificial reasoning can be useful for conveying insights as well as handling high-level administration of data analytics systems in a manner that is easy to understand and communicate with the stakeholders of a project.
Fuzzy logic is a classic AI methodology that is similar to reasoning but involves a different framework for modeling data. Also, contrary to all statistical approaches which are probabilistic in nature, fuzzy logic is possibilistic and involves levels of membership to particular states rather than chance.
Since variable states are not clear-cut in fuzzy logic, there is a sense of realism in how it treats its entities, as we as human beings tend not to see things in black-and-white for everyday things.
Nevertheless, fuzzy logic operators resemble closely the operators of conventional logic, though they employ different approaches to implementing them. No matter, the end result is always crisp, and the whole process of getting there is quite comprehensive, making this framework a very practical and easy-to-use tool.
The main limitations of fuzzy logic are that it ceases to be intuitive when the number of variables increases, and it takes a lot of work to set up the membership functions, often requiring input from domain experts.
Nevertheless, it is a useful methodology to have in mind as it is easy to apply to existing data science systems, such as classifiers, improving their performance.