Data Science Questions and Hypotheses
This tutorial explains best Data Science Questions that used in Hypotheses. And also explains how we can turn these Data Science questions into a source of useful information through experiments and the analysis of the results.
This whole process involves the creation of hypotheses, the formal scientific means for turning these questions into something that can be tackled in an objective manner.
Importance of Asking (the Right) Questions
Although the people who drive the data science projects are usually the ones who ask the key questions that need to be answered through these projects, you need to ask your own questions for two reasons:
1. The questions your superiors ask tend to be more general and very hard to answer directly, leading to potential miscommunications or inadequate understanding of the underlying problem being investigated
2. As you work with the various data streams at your disposal, you gain a better understanding of problems and can ask more informative (specialized) questions that can get into the heart of the data at hand
Now you may ask “What’s the point of asking questions in data science if the use of AI can solve so many problems for us?” Many people who are infatuated with AI tend to think that, and as a result, consider questions a secondary part of data science, if not something completely irrelevant.
Although the full automation of certain processes through AI may be a great idea for a sci-fi film, it has little to do with the reality and truth.
Artificial Intelligence can be a great aid in data science work, but it is not at a stage where it can do all the work for us. That’s why, regardless of how sophisticated the AI systems are, they cannot ask questions that are meaningful or useful, nor can they communicate them to anyone in a comprehensive and intuitive way.
Sometimes it helps to think of such things with metaphors so that we obtain a more concrete understanding of the corresponding concepts. Think of AI as a good vehicle that can take you from A to B in a reliable and efficient manner.
Yet, even if it is a state-of-the-art car (e.g. a self-driving one), it still needs to know where B is. Finding this is a matter of asking the right questions, something that AI is unable to do in its current state.
As insights come in all shapes and forms, it is important to remember that some of them can only be accessed by going deeper into the data. These insights also tend to be more targeted and valuable, so it’s definitely worth the extra effort.
After all, it’s easy to find the low-hanging fruit of a given problem! To obtain these more challenging insights, you need to perform an in-depth analysis that goes beyond data exploration.
An essential part of this endeavor is formulating questions about the different aspects of the data at hand. Failing to do that is equivalent to providing conventional data analytics (e.g. business intelligence, or econometrics), which although fine in and of itself, is not data science, and passing it as data science would undermine your role and reputation.
Naturally, all these questions need to be grounded in a way that is both formal and unambiguous. In other words, there needs to be some scientific rigor in them and a sense of objectivity as to how they can be tackled.
That’s where hypotheses enter the scene, namely the scientific way of asking questions and expanding one’s knowledge of the problem studied. These are the means that allow for finding something useful, in a practical way, with the questions you come up with while pondering on the data.
Finally, asking questions and formulating hypotheses underline the human aspect of data science in a very hands-on way. Data may look like ones and zeros when handled by the computer, but it is more than that.
Otherwise, everything could be fully automated by a machine (which it can’t, at least not yet). It is this subtleness in the data and the information it contains that makes asking questions even more important and useful in every data science project.
Formulating a Hypothesis
Once you have your question down, you are ready to turn it into something your data is compatible with, namely a hypothesis. A hypothesis is something you can test in a methodical and objective manner.
In fact, most scientific research is done through the use of different kinds of hypotheses that are then tested against measurements (experimental data) and in some cases theories (refined information and knowledge). In data science, we usually focus on the experimental evidence.
Formulating a hypothesis is fairly simple as long as the question is quantifiable. You must make a statement that summarizes the question in a very conservative way (basically something like the negation of the statement you use as a question).
The statement that takes the form of the hypothesis is always a yes-or-no question, and it’s referred to as the Null Hypothesis (usually denoted as H0). This is what you opt to disprove later on by gathering enough evidence against it.
Apart from the Null Hypothesis, you also need to formulate another hypothesis, which is what would be a potential answer to the question at hand.
This is called the Alternative Hypothesis (symbolized as Ha), and although you can never prove it 100%, if you gather enough evidence to disprove the Null Hypothesis, the chances of the Alternative Hypothesis being valid are better.
It is often the case that there are many possibilities beyond that of the Null Hypothesis, so this whole process needs to be repeated several times in order to obtain an answer with a reasonable level of confidence.
We’ll examine this dynamic in more detail in the following blog. For now, let’s look at the most common questions you can ask and how you can formulate hypotheses based on them.
Questions Related to Most Common Use Cases
Naturally, not all questions are suitable for data science projects. Also, certain questions lend themselves more to the discovery of insights, as they are more easily quantifiable and closer to the essence of the data at hand, as opposed to other questions that aim to mainly facilitate our understanding of the problem.
In general, the more specific a question is and the closer it is related to the available data, the more valuable it tends to be. Specifically, we can ask questions related to the relationship between the two features, the difference between two subsets of a variable.
How well two variables in a feature set collaborate with each other for predicting another variable, whether a variable ought to be removed from the set, how similar two variables are to each other, whether variable X causes the phenomenon mirrored in variable Y to occur, and more. Let’s now look at each one of these types of question in more detail.
Is Feature X Related to Feature Y?
This is one of the simplest questions to ask and can yield very useful information about your feature set and the problem in general. Naturally, you can ask the same question with other variables in the dataset, such as the target ones.
However, since usually, you cannot do much about the target variables, more often than not you would ask questions like this by focusing on features.
This way, if you find that feature X is very closely related to feature Y, you may decide to remove X or Y from your dataset, since it doesn’t add a great deal of information. You can think of it as having two people in a meeting that always agree.
However, before taking any action based on the answer you obtain about the relationship between these two features, it is best to examine other features as well, especially if the features themselves are fairly rich in terms of the information they contain.
The hypothesis you can formulate based on this kind of question is also fairly simple. You can hypothesize that the values of these features are of the same population (i.e. H0: the similarity between the values of X and Y is zero).
If the features are continuous, it is important to normalize them first, as well as remove any outliers they may have (especially if you are using a basic metric to measure their relationship).
Otherwise, depending on how different the scale is or how different the values of the outliers are in relation to the other values, you may find the two features different when they are not.
Also, the similarity is usually measured by a specific metric designed for this task. The alternative to this hypothesis would be that the two features are not of the same population (i.e. Ha: there is a measurable similarity between the values of X and Y).
If this whole process is new to you, it helps to write your hypotheses down so that you can refer to them easily afterward. However, as you get more used to them, you can just make a mental note about the hypotheses you formulate as you ask your questions.
An example of this type of question is as follows: we have two features in a dataset, a person’s age (X1) and that person’s work experience (X2). Although they correspond to two different things, they may be quite related.
The question, therefore, would be, “Is there a relationship between a person’s age and their work experience?” Here are the hypotheses we can formulate to answer this question accurately:
H0: there is no measurable relationship between X1 and X2
Ha: X1 is related to X2
Although the answer may be intuitive to us, we cannot be sure unless we test these hypotheses, since the data at hand may have a different story to tell about how these two variables relate to each other.
Is Subset X Significantly Different to Subset Y?
This is a very useful question to ask when you are examining the values of a variable in more depth. In fact, you can’t do any serious data analysis without asking and answering this question.
The subsets X and Y are usually derived from the same variable, but they can come from any variable of your dataset, as long as both of them are of the same type (e.g. both are integers).
However, usually, X and Y are parts of a continuous variable. As such, they are both of the same scales, so no normalization is required.
Also, it is usually the case that any outliers that may exist in that variable have been either removed or changed to adapt to the variable’s distribution.
So, X and Y are in essence two sets of float numbers that may or may not be different enough as quantities to imply that they are really parts of two entirely different populations.
Whether they are or not will depend on how different these values are. In other words, say we have the following hypothesis that we want to check:
H0: the difference between X and Y is insubstantial (more or less zero)
The alternative hypothesis, in this case, would be:
Ha: the difference between X and Y is substantial (greater than zero in absolute value)
Note that it doesn’t matter if X is greater than Y or if Y is greater than X. All we want to find out is if one of them is substantially larger than the other since regardless of which one is larger, the two subsets will be different enough.
This “enough” part is something measurable, usually through a statistic, and if this statistic exceeds a certain threshold, it is referred to as “significant” in scientific terms.
If X and Y stem from a discrete variable, it requires a different approach to answer this question, but the hypothesis formulated is the same. We’ll look into the underlying differences between these two cases in the next blog.
Do Features X and Y Collaborate Well with Each Other for Predicting Variable Z?
This is a very useful question to ask. Many people don’t realize they could ask it, while others have no idea how they could answer it properly. Whatever the case, it’s something worth keeping in mind, especially if you are dealing with a predictive analytics problem with lots of features.
It doesn’t matter if it’s a classification, a regression, or even a time-series, when you have several features, chances are that some of them don’t help much in the prediction, even if they are good predictors on their own.
Of course, how well two features collaborate depend on the problem they are applied on. So, it is important to first decide on the problem and on the metric you’ll rely on primarily for the performance of your model. For classification, you’ll probably go for F1 and Area Under Curve (AUC, relating to the ROC curve).
In this sense, the collaboration question can be viewed from the perspective of the evaluation metric’s value. Therefore, the question can take the form of the following hypothesis (which like before, we need to see if we can disprove):
H0: the addition of feature Y does not affect the value of evaluation metric M when using just X in the predictive model
The alternative hypothesis would be:
Ha: adding Y as a feature in a model consisting only of X will cause the considerable improvement to the performance of the model, as measured by evaluation metric M.
Also, the following set of hypotheses would also be worth using, to formalize the same question:
H0: the removal of feature Y does not affect the value of evaluation metric M when using both X and Y in the predictive model
Ha: removing Y from a model consisting of X and Y will cause considerable degradation in its performance, as measured by evaluation metric M
Note that in both of these approaches to creating a hypothesis, we took into account the direction of the change in the performance metric’s value. This is because the underlining assumption of features X and Y collaborating is that having them work in tandem is better than either one of them working on its own, as measured by our evaluation metric M.
Should We Remove X from the Feature Set?
After pondering the potential positive effect of a feature on a model, the question that comes to mind naturally is the reciprocal of that: would removing a feature, say X, from the feature set be good for the model (i.e. improve its performance)? Or in other words, should we remove X from the feature set for this particular problem?
If you have understood the dynamics of feature collaboration underlined in the previous section, this question and its hypothesis should be fairly obvious. Yet, most people don’t pay enough attention to it, opting for more automated ways to reduce the number of features, oftentimes without realizing what is happening in the process.
If you would rather not take shortcuts and you would prefer to have a more intimate understanding of the dynamics at play, you may want to explore this question more.
This will not only help you explain why you let go of feature X but also gain some insight into the dynamics of features in a predictive analytics model in general. So, when should you take X out of the feature set? Well, there are two distinct possibilities:
1.X degrades the performance of the model (as measured by evaluation metric M)
2. The model’s performance remains the same whether X is present or not (based on the same metric)
One way of encapsulating this in a hypothesis setting is the following:
H0: having X in the model makes its performance, based on evaluation metric M, notably higher than omitting it from the model
Ha: removing X from the model either improves or maintains the same performance, as measured by metric M
Like in the previous question type, it is important to remember that the usefulness of a feature greatly depends on the problem at hand. If you find that removing feature X is the wisest choice, it’s best to still keep it around (i.e. don’t delete it altogether), since it may be valuable as a feature in another problem, or perhaps with some mathematical tinkering.
How Similar are Variables X and Y?
Another question worth asking is related to the first one we covered in the blog, namely the measurement of the similarity between two variables X and Y.
While answering whether the variables are related may be easy, we may be interested in finding out exactly how much they are related. This is not based on some arbitrary mathematical sense of curiosity.
It has a lot of hands-on applications in different data science scenarios, such as predictive analytics. For example, in a regression problem, finding out how similar a feature X is to the target variable Y is a good sign that it ought to be included in the model.
Also, in any problem that involves continuous variables as features, finding that two such variables are very similar to each other may lead us to omit one of them, even without having to go through the process of the previous section, thus saving time.
In order to answer this question, we tend to rely on similarity metrics, so it’s usually not the case that we formulate hypotheses for this sort of question. Besides, most statistical similarity metrics come with a set of statistics that help clarify the significance of the result.
However, even though this is a possibility, the way statistics have modeled the whole similarity matter is both arbitrary and weak, at least for real-world situations, so we’ll refrain from examining this approach.
Besides, twisting the data into a preconceived idea of how it should be (i.e. a statistics distribution) may be convenient, but data science opts to deal with the data as-is rather than how we’d like it to be.
Therefore, it is best to measure similarity with various metrics (particularly ones that don’t make any assumptions about the distributions of the variables involved) rather than rely on some statistical method only.
Similarity metrics are a kind of heuristics designed to depict how closely related two features are on a scale of 0 to 1. You can think of them as the opposite of distances. We’ll look at all of these along with other interesting metrics in detail in the heuristics blog toward the last part of this blog.
Does Variable X Cause Variable Y?
Whether a phenomenon expressed through variable X is the root-cause of a phenomenon denoted by variable Y is a tough problem to solve, and it is definitely beyond the scope of this blog, yet a question related to this is quite valid and worth asking (this kind of problem is usually referred to as root-cause analysis).
Nevertheless, unless you have some control over the whole data acquisition pipeline linked to the data science one, you may find it an insurmountable task.
The reason is that in order to prove or disprove causality, you need to carry out a series of experiments designed for this particular purpose, collect the data from them, and then do your analytics work.
This is why more often than not when opting to investigate this kind of question, we go for a simpler set-up known as A-B testing.
This is not as robust, and it merely provides evidence of a contribution of the phenomenon of variable X in that of variable Y, which is quite different than saying that X is the root-cause of Y.
Nevertheless, it is still a valuable method as it provides us with useful insights about the relationship between the two variables in a way that correlation metrics cannot.
A-B testing is the investigation of what happens when a control variable has a certain value in one case and a different value in another.
The difference in the target variable Y between these two cases can show whether X influences Y in some measurable way. This is a much different question than the original one of this section. Still, it is worth looking into it, as it is common in practice.
The hypothesis that corresponds to this question is fairly simple. One way of formulating the null hypothesis is as follows:
H0: X does not influence Y in any substantial way
The alternative hypothesis, in this case, would be something like:
Ha: X contributes to Y in a substantial way
So, finding out if X is the root-cause of Y constitutes first checking to see if it influences Y, and then eliminating all other potential causes of Y one by one.
To illustrate how complex this kind of analysis can be, consider that determining that smoking cigarettes are beyond a doubt a root cause of cancer (something that seems obvious to us now) took several years of research.
Note that in the majority of cases of this kind of analysis, at least one of the variables (X, Y) is discrete.
Other Question Types
Apart from these question categories, there are several others that are more niche and therefore beyond the scope of this blog (e.g. questions relevant to graph analysis, data mining, and other methodologies discussed previously).
Nevertheless, you should explore other possibilities of questions and hypotheses so that at the very least, you develop a habit of doing that in your data science projects.
This is bound to help you cultivate a more inquisitive approach to data analysis, something that is in and of itself an important aspect of the data science mindset.
Questions Not to Ask
Questions that are very generic or too open-ended should not be asked about the data. For example, questions like “What’s the best methodology to use?” or “What’s the best feature in the feature set?” or “Should I use more data for the model?” are bound to merely waste your time if you try to answer them using the data as is.
The reason is that these questions are quite generic or their answers tend to be void of valuable information.
For example, even if you find out that the best feature in the feature set is feature X, what good will that do? Besides, a feature’s value often depends on the model you use, as well as how it collaborates with other features.
You may still want to answer these questions, but you will need to consult external sources of information (e.g. the client), or make them more specific.
For example, the second question can be transformed into these questions, which make more sense: “What’s the best feature in the feature set for predicting target variable X, if all other features are ignored?” or “What’s the feature that adds the most value in the prediction of target variable X, given the existing model?”
Also, questions that have a sense of bias in them are best to be avoided, as they may distort your understanding of the problem.
For example, “How much better is model A from model B in this problem?” makes the assumption that model A is indeed better than model B, so it may not allow you to be open to the possibility that it’s not.
Also, very complex questions with many conditions in them are not too helpful either. Although they are usually quite specific, they may be complicated when it comes to testing them. So, unless you are adept in logic, you are better off tackling simpler questions that are easier to work with an answer.
Asking questions is essential since the questions given by the project managers are usually not enough or too general to guide you through the data science process of a given problem most effectively.
Formulating a hypothesis is an essential part of answering the questions you come up with, as it allows for rigorous testing that can provide a more robust answer that is as unbiased as possible.
Hypotheses are generally yes or no questions that are subject to statistical testing so that they can be answered in a clear-cut way, and the result is accompanied by a confidence measure.
There are various kinds of questions you can ask. Some of the most common ones are:
Is feature X related to feature Y?
Is subset X significantly different from subset Y?
Do features X and Y collaborate well with each other for predicting variable Z?
Should we remove X from the feature set?
How similar are variables X and Y?
Does variable X cause variable Y?
So, you’ve come up with a promising question for your data, and you have formulated a hypothesis around it (actually you’d probably have come up with a couple of them).
Now what? Well, now it’s time to test it and see if the results are good enough to make the alternative hypothesis you have proposed (Ha) a viable answer. This fairly straight-forward process is something we will explore in detail in this blog before we delve deeper into it in the blog that ensues.
The Importance of Experiments
Experiments are essential in data science, and not just for testing a hypothesis. In essence, they are the means of the application of the scientific method, an empirical approach to obtaining refined knowledge objectively in order to acquire and process information.
Also, it is the make-or-break way of validating a theory and the most concrete differentiator of science and philosophy as well as data science and the speculation of “experts” in a given domain.
Experiments may be challenging, and their results may not always comply with our expectations or agree with our intuition. Still, they are always insightful and gradually advance our understanding of things, especially in complex scenarios or cases where we don’t have sufficient domain knowledge.
In more practical terms, experiments are what makes a concept we have grasped or constructed either a fact or a fleeting error in judgment. This not only provides us with confidence in our perception but also allows for a more rigorous and somewhat objective approach to data analytics.
Many people have lost faith in the results of statistics and for good reason. They tend to draw conclusions that are so dependent on assumptions that they fail to have value in the real world.
The pseudo-scientists that often make use of this tool do so only to propagate their misconceptions rather than do proper scientific work.
In spite of all that, there are parts of statistics that are useful in data science, and one of them is the testing of hypotheses. Experiments are complementary to that, and since they are closely linked to these statistical tests, we’ll group them together for the purpose of this blog.
How to Construct an Experiment
First of all, let’s clarify what we mean by experiments, since sci-fi culture may have tainted your perception of them.
Experiments in data science are usually a form of a series of simulations you perform on a computer, typically in an environment like Jupyter. Optionally, you can conduct all of your experiments from the command line.
The environment you use is not so important, though if you are comfortable with a notebook setting, this would be best, as that kind of environment allows for the inclusion of descriptive text and graphics along with several other options, such as exporting the whole thing as a PDF or an HTML file.
Constructing an experiment entails the following things. If it’s a fairly straight-forward question that you plan to answer and everything is formulated as a clear-cut hypothesis, the experiment will take the form of a statistical test or a series of such tests (in the case that you have several alternative hypotheses).
The statistical tests that are more commonly used are the t-test and the chi-square test. The former works on continuous variables, while the latter is suitable for discrete ones.
Oftentimes, it is the case that you employ several tests to gain a deeper understanding of the hypotheses you want to check, particularly if the probability value (p-value) of the test is close to the threshold.
I recommend that you avoid using statistical tests like the z-test unless you are certain that the underlining assumptions hold true (e.g. the distribution of each variable is Gaussian).
If it is a more complex situation you are dealing with, you may need to do several simulations and then do a statistical test on the outputs of these simulations. We’ll look into this case in more detail in the section that follows.
Something that may seem obvious is that every scientific experiment has to be reproducible. This doesn’t mean that another experiment based on the same data is going to output the exact same results, but the new results should be close to the original ones.
In other words, the conclusion should be the same no matter who performs the experiment, even if they use a different sample of the available data. If the experiment is not reproducible, you need to re-examine your process and question the validity of your conclusions.
Another thing worth noting when it comes to constructing an experiment is that just like everything else you do throughout the data science pipeline, your experiments need to be accompanied by documentation.
You don’t need to write a great deal of text or anything particularly refined, as this documentation is bound to remain unseen by the stakeholders of the project, but if someone reads it, they will want to get the gist of things quickly.
Therefore, the documentation of your experiment is better off being succinct and focusing on the essential aspects of the tasks involved, much like the comments in the code of your scripts.
In addition, even if you don’t share the documentation with anyone, it is still useful since you may need to go back to square one at some point and re-examine what you have done. Moreover, this documentation can be useful when writing up your report for the project, as well as any supplementary material for it, such as slideshow presentations.
Experiments for Assessing the Performance of a Predictive Analytics System
When it comes to assessing the performance of a predictive analytics system, we must take a specific approach to set up the experiments needed to check if the models we create are good enough to put into production.
Although this makes use of statistical testing, it goes beyond that, as these experiments must take into account the data used in these models as well as a set of metrics.
One way to accomplish this is through the use of sampling methods. The most popular such method is K-fold cross-validation, which is a robust way of splitting the dataset in training and testing K times, minimizing the bias of the samples generated.
When it comes to classification problems, the sampling that takes place takes into account the class distribution, something particularly useful if there is an imbalance there.
This sort of sampling is referred to as stratified sampling, and it is more robust than conventional random sampling (although stratified sampling has a random element to it too, just like most sampling methods).
When you are trying out a model and want to examine if it’s robust enough, you’ll need to do a series of training and testings of that model using different subsets of the dataset and calculating a performance metric after each run.
Then you can aggregate all the values of this metric and compare them to a baseline or a desired threshold value, using some statistical test. It is strongly recommended that you gather at least 30 such values before trying out a statistical evaluation of them since the reliability of the results of the experiment depends on the number of data points you have.
Oftentimes, this experimental setup is in conjunction with the K-fold cross-validation method mentioned previously, for an even more robust result.
Experiments of this kind are so vigorous that their results tend to be trustworthy enough to render a scientific publication (particularly if the model you use is something new or a novel variant of an existing predictive analytics system).
Since simulations like these often take some time to run, particularly if you are running them on a large dataset, you may want to include more than one performance metrics. Popular metrics to measure performance are F1 for classification and MSE for regression or time-series analysis.
In addition, if you have several models you wish to test, you may want to include all of them in the same experiment so you can run each of them on the same data as the others.
This subtle point that is often neglected can add another layer of robustness to your experiments, as no one can say that the model you selected as having the best performance simply lucked out. If you have enough runs, the probability of one model outperforming the others due primarily to chance diminishes greatly.
A Matter of Confidence
Confidence is an important matter, not just as a psychological attribute, but also in the way the questions to the data are answered through this experiment-oriented process.
The difference is that in this latter case, we can have a reliable measure of confidence, which has a very important role in the hypothesis testing process, namely quantifying the answer.
What’s more, you don’t need to be a particularly confident person to exercise confidence in your data science work when it comes to this kind of task. You just need to do your due diligence when tackling this challenge and be methodical in your approach to the tests.
Confidence is usually expressed in a heuristic metric that takes the form of a confidence score, having values ranging between 0 and 1. This corresponds to the probability of the system’s prediction being correct (or within an acceptable range of error).
When it comes to statistical methods, this confidence score is the inverse of the p-value of the statistical metric involved.
Yet, no matter how well- defined all these confidence metrics are, their scope is limited and dependent on the dataset involved. This is one of the reasons why it is important to make use of a diverse and balanced dataset when trying to answer questions about the variables involved.
It is often the case that we need to pinpoint a given metric, such as the average value of a variable or its standard deviation. This variable may be a mission-critical or key-performance index.
In cases like this, we tend to opt for a confidence interval, something that ensures a given confidence level for a range of values for the metric we are examining. More often than not, this interval’s level of confidence is set to 95%, yet it can take any value, usually near that point.
Whatever the case, its value directly depends on the p-value threshold we have chosen a-priori (e.g. in this case, 0.05). Note that although statistics are often used for determining this interval, it doesn’t have to be this way.
Nowadays, more robust, assumption-free methods, such as Monte-Carlo simulations, are employed for deriving the exact borders of a confidence interval for any distribution of data.
Regardless of what methods you make use of for answering a data science question, it is important to keep in mind that you can never be 100% certain about your answer. The confidence score is bound to be less than 1, even if you use a large number of iterations in your experiments or a large number of data points in general.
For all practical purposes, however, a high enough confidence score is usually good enough for the project. After all, data science is more akin to engineering than pure mathematics, since just like all other applied sciences, data science focuses on being realistic and practical.
Embracing this ambiguity in the conclusions is a necessary step and a differentiator of the data science mindset from the other, more precision-focused disciplines of science.
In addition, this ambiguity is abundant in big data, which is one of the factors that make data science a priceless tool for dealing with this kind of data.
However, doing more in-depth analysis through further testing can help tackle a large part of this ambiguity and shed some light on the complex problems of big data by making it a bit more ordered and concrete.
Finally, it is essential to keep in mind when answering a question that even if you do everything right, you may still be wrong in your conclusions. This counter-intuitive situation could be because the data used was of low veracity, or it wasn’t cleaned well enough, or maybe parts of the dataset were stale.
All these possibilities go on to demonstrate that data science is not an exact science, especially if the acquisition of the data at hand is beyond the control of the data scientist, as is often the case.
This whole discipline has little room for arrogance; do not rely more on fancy techniques than a solid understanding of the field. It is good to remember this, particularly when communicating your conclusions.
You should not expect any groundbreaking discoveries unless you have access to large volumes of diverse, reliable, and information-rich data, along with sufficient computing resources to process it properly.
[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]
Evaluating the Results of an Experiment
The results of the experiment can be obtained in an automated way, especially if the experiment is fairly simple. However, evaluating these results and understanding how to best act on them is something that requires attention.
That’s because the evaluation and interpretation of the results are closely intertwined. If done properly, there is little room for subjectivity.
Let’s look into this in more detail by examining the two main types of experiments we covered in this blog, statistical tests, and performance of predictive analytics models, and how we can evaluate the results in each case.
If it is a statistical test that you wish to interpret the results of, you just need to examine the various statistics that came along with it. The one that stands out the most, since it is the one yielding the most relevant information, is the p-value, which we talked about previously.
This is a float number which takes values between 0 and 1 (inclusive) and denotes the probability of the result being caused by chance alone (i.e. the aggregation of various factors that were not accounted for, which contributed to it, even if you were not aware of their role or even their existence).
The reason this is so important is that even in a controlled experiment, it is possible that the corresponding result has been strongly influenced by all the other variables that were not taken into account. If the end-result is caused by them, then the p-value is high, and that’s bad news for our test.
Otherwise, it should take a fairly small value (the smaller the better for the experiment). In general, we use 0.05 (5%) as the cut-off point, below which we consider the result statistically significant. If you find this threshold a bit too high, you can make use of other popular values for it, such as 0.01, 0.001, or even lower ones.
If you are dealing with the evaluation of the performance of a classifier, a regressor, or some other predictive analytics system, you just need to gather all the data from the evaluation metric(s) you plan to use for all the systems you wish to test and put that data into a matrix.
Following this, you can run a series of tests (usually a t-test would work very well for this sort of data) to find out which model’s performance, according to the given performance metrics, is both higher than the others and with a statistical significance.
The data you will be using will correspond to columns in the aforementioned matrix. Make sure you calculate the standard deviation or the variance of each column in advance though, since in certain tests, having equal variances is a parameter in the test itself.
As we saw in the previous blog, there are cases where a similarity metric makes more sense for answering a question. These kinds of metrics usually take values between 0 and 1, or between -1 and 1.
In general, the higher the absolute value of the metric, the stronger the signal in the relationship examined. In most cases, this translates into a more affirmative answer to the question examined.
Keep in mind that when using a statistical method for calculating the similarity (e.g. a correlation metric), you will end up with not just the similarity value, but also with a p-value corresponding to it. Although this is usually small, it is something that you may want to consider in your analysis.
It is good to have in mind that whatever results in your tests yield, they are not fool-proof. Regardless of whether they pass a statistical significance test or not, it is still possible that the conclusion is not correct.
While the chances of this happening are small (the p-value is a good indicator for this), it is good to remember that so that you maintain a sense of humility regarding your conclusions and you are not taken by surprise if the unexpected occurs.
Finally, it is sometimes the case that we need to carry out additional iterations in our experiments and even perform new tests to check additional hypotheses. This may seem frustrating, but it is a normal part of the process and to some degree expected in data science work.
If by the end of the experiments, you end up with a larger number of questions than answers to your original ones, that’s fine. This is just how science works in the real world!
The Importance of Sensitivity Analysis
No matter how well we design and conduct our experiments, there is always some bias there, be it in the data we use, the implicit assumptions, or in how we analyze the results and draw the conclusions of these processes.
Unfortunately, this bias is a kind of mistake that often goes by unnoticed, and unless we are aware of this issue and take action, our conclusions may not hold true in the future.
Imagine that you created a model based on the answers to certain questions about the data at hand, only to find out that these answers were not as reliable as they seemed. You probably would not want to put such a model into production!
These issues and uncertainties that may threaten the trustworthiness of your work are bound to disappear with just a bit of extra validation work in the form of sensitivity analysis.
As a bonus, such a process is likely going to provide you with additional understanding of the experiments analyzed and help you gain deeper insight into the dynamics of the data you have.
The fact that many people don’t use sensitivity analysis techniques because they don’t know much about it should not be a reason for you to avoid it too.
There are two broad categories of techniques you can use for sensitivity analysis depending on the scope of the analysis: global and local. We will look into each one of these later on in this blog.
Before we do so, let’s take a look at a very interesting phenomenon called the butterfly effect, which captures the essence of the dynamics of such systems that need sensitivity analysis in the first place.
Global Sensitivity Analysis Using Resampling Methods
One popular and effective way of tackling the inherent uncertainty of experimental results in a holistic manner is resampling methods (there are other methods for global sensitivity analysis, but these are beyond the scope of this blog, as they are not as popular).
Resampling methods are a set of techniques designed to provide a more robust way of deriving reliable conclusions from a dataset by trying out different samples of the dataset (which is also a sample of sorts).
Although this methodology has been traditionally statistics-based, since the development of efficient simulation processes, it has come to include different ways of accomplishing that too.
Apart from purely statistical methods for resampling, such as bootstrapping, permutation methods, and jackknife, there is also Monte Carlo (a very popular method in all kinds of approximation problems). Let’s look at each category in more detail.
Permutation methods are a clever way to perform resampling while making sure that the resulting samples are different from one another, much like the bootstrapping method.
The main differences are that in the case of permutation methods, the sampling is without replacement. Additionally, the process aims to test against hypotheses of “no effect” rather than find confidence intervals of a metric.
The number of permutations possible is limited since no two sub-samples should be the same. However, as the number of data points in each sub-sample is fairly small compared to the number of data points in the original sample, the number of possible permutations is very high.
As for the number of sub-samples needed, having around 10,000 sub-samples is plenty for all practical purposes.
Permutation methods are also known as the randomization technique and it is very established as a resampling approach. Also, it is important to note that this set of methods is assumption-free. Regardless of the distribution of the metrics we calculate (e.g. the p-value of a t-test), the results of this meta-analysis are guaranteed to be reliable.
Jackknife is a lesser-known method that offers an alternative way to perform resampling, different than the previous ones, as it focuses on estimating the bias and the standard error of a particular metric (e.g. the median of a variable in the original sample).
Using a moderate amount of calculations (more than the previous resampling methods), it systematically calculates the required metric by leaving a single data point out of the original sample in each calculation of the metric.
Although this method may seem time-consuming in the case of a fairly large original sample, it is very robust and allows you to gain a thorough understanding of your data and its sensitivity to the metric you are interested in.
If you delve deeper into this approach, you can pinpoint the specific data points that influence this metric. For the purpose of analyzing data stemming from an experiment, it is highly suitable, since it is rare that you will have an exceedingly large amount of data points in this kind of data.
This is a fairly popular method for all kinds of approximations, particularly the more complex ones. It is used widely for simulations of the behavior of complicated processes and is very efficient due to its simplicity and ease of use.
When it comes to resampling, Monte Carlo is applied as the following process:
1. Create a simulated sample using a non-biased randomizing method. This sample is based on the population whose behavior you plan to investigate.
2.Create a pseudo-sample emulating a real-life sample of interest (in our case, this can be a predictive model or a test for a question we are attempting to answer)
3.Repeat step 2 for a total of N times
4. Compute the probability of interest from the aggregate of all the outcomes of the N trials from steps 2 and 3 From this meta-testing, we obtain a rigorous insight regarding the stability of the conclusions of our previous experiments, expressed as a p-value, just like the statistical tests we saw in the previous blog.
Local Sensitivity Analysis Employing “What If?” Questions
“What If?” questions are great, even if they do not lend themselves directly to data-related topics.
However, they have a place in the sensitivity analysis, as they can prove valuable in testing the stability of a conclusion and how specific parameters relate to this. Namely, such a question can be something like, “What if parameter X increases by 10%?
How does this affect the model?” Note that these parameters often relate to specific features, so these questions are also meaningful to the people not directly involved in the model (e.g. the stakeholders of the project).
The high-level comprehensiveness of this approach is one of its key advantages. Also, as this approach involves the analysis of different scenarios, it is often referred to as scenario analysis and is common in even non-data science related situations.
Finally, they allow you to delve deeper into the data and the models that are based on it, oftentimes yielding additional insights to complement the ones stemming from your other analyses.
Some Useful Considerations on Sensitivity Analysis
Although sensitivity analysis is more of a process that can help us decide how stable a model is or how robust the answers to your questions are, it is also something that can be measured.
However, in practice, we rarely put a numeric value on it, mainly because there are more urgent matters that demand our time.
Besides, in the case of a model, whatever model we decide on will likely be updated or even replaced altogether by another model, so even if it is not the most stable model in the world, that’s acceptable.
However, if you are new to this and find that you have the time to delve deeper into sensitivity analysis when evaluating the robustness of a model, you can calculate the sensitivity metric, an interesting heuristic that reflects how sensitive a model is to a particular parameter (we will look into heuristics more in blog 11). You can accomplish this as follows:
1.Calculate the relative change of a parameter X
2.Record the corresponding change in the model’s performance
3.Calculate the relative change of the model’s performance
4.Divide the outcome of step 3 by the outcome of step 1
The resulting number is the sensitivity corresponding to parameter X. The higher it is, the more sensitive (i.e. dependent) the model’s performance is to that parameter.
In the previous section, we examined K-fold cross-validation briefly. This method is actually a very robust way to tackle instability issues proactively since it reduces the risk of having a good performance in a model due to chance in the sample used to train it.
However, the chance of having a lucky streak in sampling is not eliminated completely, which is why in order to ensure that you have a truly well-performing model, you need to repeat the K-fold cross-validation a few times.
Playing around with sensitivity analysis is always useful, even if you are an experienced data scientist. The fact that many professionals in these fields choose not to delve too much into it does not make it any less valuable.
Especially in the beginning of your data science career, sensitivity analysis can help you to better comprehend the models and understand how relative the answers to your questions are once they are tested with data-based experiments.
Even if a test clears an answer for a given question on your data, keep in mind that the answer you obtained may not always hold true, especially if the sample you have is unrepresentative of the whole population of the variables you work with.
The Data Scientist’s Toolbox
The data needs to live somewhere, and chances are that it is not always in neat .csv files like the ones you see in Kaggle (a machine learning competition site). More often than not, the data you use dwells in a database.
Whatever the case, even if it is not in a database originally, you may want to put it in one during the data engineering phase for easier handling, since it is not uncommon for the same data to be used in different data science projects.
If the data is all over the place, you may want to tidy it up and put it in a database where you can easily access it, to create the datasets you will use in your project(s).
Databases come in all shapes and forms these days. However, they are generally in one of three broad categories: SQL-based, NoSQL, and graph-based.
Graph databases are somewhat different than the other ones, as they focus on creating, storing, querying, and processing graphs. However, several of the modern database platforms are multipurpose, so they handle all kinds of objects, including graphs.
Here are five of the most powerful databases that can handle graphs (some of them exclusively):
Neo4j – native graph database, specializing in this particular kind of data structure
AllegroGraph – a high-performance graph database
Teradata Aster – multipurpose high-performance database
ArangoDB – multipurpose distributed database
Graphbase – distributed and resource-cheap database specializing in very large graphs
Although most of these databases are similar, Neo4j stands out, as it is one of the most mature ones in this domain.
Also, it is highly different from the multipurpose ones; it is built around graphs, rather than just to accommodate them. You can learn more about it either at its site’s Getting Started page or at its Tutorials Point section.
Programming Languages for Data Science
Data science is inherently different than conventional data analytics approaches, and one of the key differentiating factors is programming. This is the process of creating a custom program that runs on a computer or a cloud server and processes data to provide an output related to this data, using a specialized coding language.
Excel scripting is not programming, just as creating an automated process in SAS doesn’t qualify. Programming is a separate discipline that is not rocket science, though it is not simple either.
Of course, there are programming languages that are fairly simple, like Scratch (a game-focused language ideal for kids), but these are designed for very specialized applications and do not lend themselves for real-world scenarios.
In data science, we have a set of programming languages that, because of their ease of use and variety of packages (programming libraries), lend themselves to the various methodologies we examined in the previous blog. The most important of these languages are Julia, Python, R, and Scala.
Julia is probably the best data science language out there, partly because it is super-fast, and partly because it is very easy to prototype in. Also, whatever code you write in Julia is going to be as fast as the code you can find in its data science packages.
What’s more, the combination of these features allows it to be the production language for your code, so you won’t need to translate everything into C++ or Java in order to deploy it. Julia was developed at MIT and has gained popularity all over the world since then through conferences (JuliaCon) and meet-ups.
This growing user-community allows it to provide some support to new users, making it easier for people to learn it.
Also, as it is designed with the latest programming language standards, it is more future-proof than most other languages out there.
However, due to its young age, it does not have all the data science libraries of other, more established languages, and there are lots of changes from one release to the next.
Another thing to note, which is a more neutral characteristic of Julia, is that it is primarily a functional language. This is a different programming paradigm than the object-oriented languages that dominate the programming world these days, and which focus on objects and classes, rather than pieces of code performing certain tasks. Yet, Julia has a lot of elements of that paradigm too.
Overall, Julia is a great multipurpose language that lends itself to data science. Its newness should not deter you from using it in your data science projects.
In fact, this newness can be a good thing. You’d be one of the early adopters of a language which is gaining popularity quickly in both industry and academia.
This way, in a few years, when more companies are aware of this language’s merits and have it as a requirement for their recruits, you’ll be among the most experienced candidates.
Python is what Julia would look like if it were developed 20 years earlier. An easy, object-oriented language, it has been used extensively for all kinds of scripts (some people classify it as a scripting language).
Over the past few years, it has become popular in the data science community, particularly version 2.7, as that fork of it has the majority of the data science packages of the language.
Python’s high-level style makes it easy to learn and debug, yet it is fairly slow, which is why all code developed in this language is usually translated into a low-level language in order to gain performance.
Its large set of libraries for all kinds of data science methods enable its users to utilize it without having to do much coding themselves.
The downside of this is that if you cannot find what you are looking for, you will have to develop it yourself, and it is bound to be significantly less efficient than a pre-existing package.
Python has a large user community and a plethora of tutorials and blogs, so learning it and mastering it is a straight-forward process. You can even find working scripts on the web from other users, so you often don’t need to do much coding yourself. Moreover, it runs on pretty much every computer out there, even tiny PCs like Raspberry Pi and Arduino.
Apart from its overall low speed, Python has other issues, such as the fact that its loops are quite inefficient. Because of this, they are often a taboo of sorts. Also, the fact that there are two main versions of it (2.x and 3.x) makes it very confusing; there are good packages in one that is often not available to the other version.
Even if the newest version of Python (right now, 3.6) is pretty robust for a high-level language, it lacks the packages that older versions of it have. Nevertheless, Python is widely popular and has established itself as a data science language, even if it is not the best one for this purpose.
There is some controversy over whether R is a programming language or not, due to the fact that it is not a multipurpose coding tool by any stretch.
However, for all practical purposes, it is a programming language, albeit a weak one. R offers easy access to statistics and other data analysis tools and requires little programming from the user.
Just like Julia and Python, it is intuitive, and many people find it easy to learn, even if they do not have any programming experience.
This is partly due to its exceptional IDE, RStudio, which is open-source. Although it was primarily designed for statistics and its target group has traditionally been academia, it has gained popularity in the data science community and has a large user base. The fact that it has some excellent plotting libraries may have contributed to this.
One major downside to R is that it is particularly slow. Another is the fact its for-loops are very slow, which are so much frowned upon that they are not even included in some R tutorials.
Also, because of various reasons, including the fact that there are newcomers in the data science languages arena, R is on the wane lately. Still, it is a useful tool for small-scale data analytics, particularly if you just want to use it in a proof-of-concept project.
Just like Julia, Scala is a functional language with elements of object-oriented programming. As such, it is fairly future-proof, making it worthy as an investment of your time (particularly if you know Java). The reason for this is that Scala stems from Java and even runs on the JVM.
Due to its clean code and its seamless integration with Apache Spark, it has a growing user community. Finally, it is fairly fast and easier to use than low-level languages, though not as intuitive as the other languages I mentioned in this section.
Unfortunately, Scala doesn’t have many data science libraries since it has only been around for a few years. Also, it does not collaborate very well with other programming languages (apart from Java, of course).
Finally, Scala is not easy to learn, although there are several blogs and tutorials out there demonstrating how it can be used for various applications.
Generally, it is a useful language for data science, particularly if you are into the functional paradigm, if you have worked with Java before, or if you are fond of Spark.
Which Language is Best for You?
The Most Useful Packages for Julia and Python
Since Julia and Python dominate the data science world in terms of overall usefulness and ease of use, here is a list of libraries (packages) that are popular and/or useful for data science projects:
1.NumPy – You cannot do anything useful with Python without NumPy, as it provides the math essentials for any data-related project
2.SciPy – very useful package for math, with an emphasis on scientific computing
3.Matplotlib – 2-D plotting
4.Pandas – a package that allows for data frame structures
5.Statsmodels – statistical models package, including one for regression
6.Distance – the go-to package for implementation of various distance metrics
7.IPython – Although not strictly data-related, it is essential for making Python easy to use, and it is capable of integrating with Jupyter notebooks
8.scikit-learn – a collection of implementations of various machine learning and data mining algorithms
9.Theano – a package designed to facilitate computationally heavy processes using GPUs; useful for deep learning, among other resource-demanding applications
10.NLTK –Although its parts-of-speech (POS) tagging is sub-optimal, NLTK is an essential package for NLP, and it’s easy to use with good documentation
11.tsne – the Python implementation of the T-SNE algorithm for reducing a dataset into 2-D and 3-D, for easier visualization
12.Scrapy – a useful web-scraping package
13.Bokeh – a great visualization package that allows for interactive plots
14.NetworkX – a very useful graph analytics package that is both versatile and scalable
15.mxnet – the Python API for the state-of-the-art deep learning framework, developed by Apache and used heavily on the Amazon cloud
16.elm – a package for developing Extreme Learning Machines systems, a not-so-well-known machine learning system
17.tensorflow – the Python API for the TensorFlow deep learning framework Julia:
1.StatsBase – a collection of various statistics functions
2.HypothesisTests – a package containing several statistical methods for hypothesis testing
3.Gadfly – one of the best plotting packages, written entirely in Julia
4.PyPlot – a great plotting package, borrowed from Python’s Matplotlib; ideal for heat maps
5.Clustering – a package specializing in clustering methods
6.DecisionTree – decision trees and random forests package
7.Graphs – a graph analytics package (the most complete one out there)
8.LightGraphs – a package with relatively fast graph algorithms
9.DataFrames – the equivalent of pandas for Julia
10.Distances – a very useful package for distance calculation, covering all major distance metrics
11.GLM – generalized linear model package for regression analysis
12.TSNE – a package implementing the T-SNE algorithm by the creators of the algorithm
13.ELM – the go-to package for Extreme Learning Machines, having a single hidden layer
14.MultivariateStats – a great place to obtain various useful statistics functions, including PCA
15.MLBase – an excellent resource for various support functions for ML applications
16.MXNet – the Julia API for a minute, a great option for deep-learning applications
17.TensorFlow – the Julia API for the TensorFlow deep learning framework
Other Data Analytics Software
Apart from the aforementioned data analytics software, there are others that have a more specialized or niche role in data science.
You do not need to know them in order to perform your projects, but you may hear about them and find companies that use them. These are mainly proprietary programs, so their user base is somewhat smaller than the communities of open-source software.
Although hardly any data scientists use MATLAB anymore, it is still the go-to framework for many researchers, and there is a lot of free code (.m files) available at the MathWorks site.
There are also a couple of open-source clones of MATLAB out there, for those unwilling to pay the expensive license fee. These are Octave and Scilab. Also, it is noteworthy that the SciPy package in Python borrows a lot of MATLAB’s functionality, so it is a bit like MATLAB for Python users.
Despite its issues, MATLAB is a great tool for data visualization, which is why it is sometimes preferred by researchers across different fields.
This lesser-known software is one of the best programs out there for modeling processes in such a way that it enables even outsiders to understand what’s happening. Analytica mimics the functionality of flowcharts, making them executable programs by adding code on the back-end.
Although it seems old-fashioned these days due to the more traditional look-and-feel that it exhibits, it is still a useful tool, particularly for modeling and analyzing high-level data analytics processes.
Analytica is very intuitive and easy to use, and its visualization tools make it shine. However, it fetches a high price and doesn’t collaborate with any other data analytics tools (apart from MS Excel), making it fairly unpopular among data scientists today.
Still, it is much better than many other alternatives that require you to take a whole course in order to be able to do anything meaningful with them. There is a free version of it, able to handle small datasets, should you want to give it a try.
Poised more towards scientific analytics work, Mathematica is another niche software that lends itself to data analytics, perhaps more than other programs that specialize in the data science field.
Its intuitive language that resembles Python and its excellent documentation make it the go-to option for anyone in the research industry who wants to create a functional model of the phenomenon being studied.
It would be unfair to compare it with MATLAB though, as some people do since Mathematica is able to do abstract math too. Additionally, its plotting capabilities make one wonder if this is actually a visualization tool.
Even more, all of its functionality is embedded in its kernel, meaning that you don’t have to worry about finding, installing, and loading the right package for your task.
If it weren’t for its name, which implies a mathematics-oriented application, Mathematica would probably have gotten more traction in the data science community. Its high price doesn’t help its case either.
Nevertheless, if you find yourself in a company that has invested in a license for it, then it would be worth learning it in depth through its numerous educational materials that are made available on the Wolfram site, as well as on Amazon.
Also, its interface is so well-designed that you do not need any programming background to carry out meaningful tasks, though if you do know a programming language already, it would be much easier to learn.
One thing I hope became abundantly clear is that visualization is big on the data science pipeline. What good are all the insights you create if they are not delivered properly with the help of visuals that are both insightful and possibly even impressive?
This is particularly important if data science is new in the organization you work for and needs to assert itself there, or if you are new to the field.
Even if you cannot hide lack of usefulness behind a pretty plot, you may need to get your graphics as glamorous as possible if you are to drive home the point you are trying to make in the final stage of the pipeline. Fortunately, there are several tools out there that specialize in just that. Here are some of the more widely known ones.
Modern Visualization for the Data Era
The Canadian company that came up with this visualization tool must have really known what they were doing because their product is by far the best visualization tool out there.
Its key benefit is that it allows you to work your plots on different platforms, while also being easy to use, and the quality of these plots is quite good.
If you plan to share your graphics with other people who prefer to use a different programming language but who may want to add or change them, Modern Visualization for the Data Era is the way to go.
Even the free edition is good enough to make a difference, although if you want to create something more professional, like a dashboard, you will need to pay a certain amount per month to take advantage of the full set of features this tool has.
You can download it as a .zip file and make use of its scripts in the web page documents you create with it to showcase your plots.
Its key features are that it is very fast, supports a variety of datasets, and allows for animations as well as interactive plots. On its website (www.d3js.org), you can find tutorials as well as examples.
The same company that created Mathematica has also developed a search engine specializing in knowledge (that is related to science and engineering). One of the features of this online product is a data visualization.
Although it is not ideal for this kind of application, it is good enough for some data exploration tasks, especially considering that it is free and that it is primarily designed for knowledge retrieval rather than anything data science related.
It will be interesting to see how it evolves, considering that Wolfram Inc. has been looking at expanding its domain to include data science applications. The Wolfram language lends itself to it, as it is now armed with a series of machine learning algorithms alongside the ones for interactive plots.
Tableau is used to create professional-looking plots for BI professionals and data scientists alike. Although there are also free versions of it available, if you want to do anything meaningful with it, you need to have one of the paid versions.
Whatever the case, it is popular, so you will likely encounter several companies that require it as a skill (even though there are several better and cheaper alternatives out there, including some open-source ones).
Data Governance Software
Data governance involves storing, managing, and processing data in a distributed environment, something essential when it comes to big data.
There are several options for this, so a lot of the low-level work has been abstracted through user-friendly interfaces, making the whole process much easier.
The key data governance software out there includes Spark, Hadoop, and Storm. There are also other ones available that may be useful, even if they are not as popular.
There is little doubt nowadays that Spark is the best way to go when it comes to data governance, even if it is still a fairly new framework. Its increased popularity over the past years is remarkable, and its integration with Python and R is a big plus.
Still, if you want to do serious work with it, you may want to pick up Scala, since that’s the language it is most compatible with (Spark is also written in that language).
The key advantages of Spark are generality (it can be used for different data science applications, not just data analytics tasks), versatility (it can be run in different environments, such as Hadoop clusters, Amazon’s S3 infrastructure, and more), speed (it is significantly faster than its alternatives), ease of use (it makes use of a high-level programming paradigm.
And can be used from within your programming language by importing the corresponding package), and has a vibrant community of users.
Despite its simplicity, Spark covers various verticals of data governance, the most important of which are:
Data querying through a SQL-like language and a robust data frames system (Spark SQL)
Machine learning analytics (MLlib)
Graph analytics (GraphX)
Stream analysis (Spark Streaming)
Hadoop is the most established big data platform. Although lately, it has been on the wane, many companies still use it, and there are plenty of data scientists who swear by it.
Geared toward the traditional cluster-based approach to big data governance, Hadoop appeals more to very large organizations with lots of computers linked to each other.
Its various components are low-level, yet they are manageable by several programming languages through APIs. The main aspects of the Hadoop ecosystem are:
MapReduce – the main algorithm for splitting a task into smaller sub-tasks, spreading them across the various nodes of the computer cluster, and then fusing the results
JobTracker – the coordinator program of the various parallel processes across the cluster
HDFS (Hadoop Distributed File System) – the file system for managing files stored in a cluster
HBase – Hadoop’s NoSQL database system
Hive – a query platform for Hadoop databases
Mahout – a machine learning platform
Pig – Hadoop’s scripting language
Sqoop – Hadoop’s data integration component
Oozie – a workflow and scheduling system for Hadoop jobs
ZooKeeper – Hadoop coordination component
Ambari – a management and monitoring tool for the Hadoop ecosystem
Storm comes from the same company that developed Spark (Apache) and focuses on handling data that moves at high speeds (each node of a cluster running Storm can process over a million tuples per second).
Its simplicity, ease of use, and versatility in terms of programming languages make it an appealing option for many people. The fact that it is scalable, fault-tolerant, and quite user-friendly in terms of setting up and using makes it a great tool to be familiar with, even if you will not use it in every data science project you undertake.
Version Control Systems (VCS)
Version control is an important matter in all situations where more than one person works on the same code base for a project.
Although it may sometimes be unnecessary when working on a project by yourself, it is often the case that you need to revert to previous versions of your programs, so a version control system can be useful in this case.
A VCS can be either centralized (i.e. all of the repositories is in a centralized location), or decentralized (i.e. the data is spread out across various computers). Lately, the focus is on the latter category, as it has several advantages over the centralized VCS.
Chances are that the organization you work for will have a VCS in place and will already be committed to it, as such frameworks are very popular among software developers.
Since you may often be working with them, you will be expected to know at least one of the most well-known of such systems, namely Git, Github, and CVS. Let’s take a closer look at each one.
Git is one of the more established VCs out there, and it was developed by Linus Torvalds, who is also the creator of the Linux kernel, used widely across all kinds of devices (including most smartphones) as the core component of their operating system.
Although this may have contributed to Git’s popularity, the fact is that it is so easy to learn, has so much educational material out there (including lots of videos), and is so widespread in the industry that it is the go-to VCS for the majority of coding applications, including data science.
Also, although there are GUIs out there for it, most of the users prefer to use the command line interface for it, alongside its cloud counterpart, Github. Although these two are often used in conjunction, Git is compatible with other cloud storage platforms too, such as Dropbox.
Just like most VCs, Git is open-source and has a respectable-sized community around it and is cross-platform.
If you are still not convinced about its usefulness, it is also being used by a variety of companies, like Microsoft, Facebook, Twitter, Android/Google, and Netflix. You can learn more about it at its official website: https://git-scm.com.
Github is probably the most commonly used VCS, both among developers and data scientists. Although it is not in any way superior to Git, it is still a good VCS, especially if you are not fond of the command line interface.
Github’s key selling points (it is not free if you want to take advantage of its most important features, such as private repositories) are that it allows for easy coding reviews, provides end-to-end encryption, and makes changes/branches in the code more manageable.
This means that through the comments, you are encouraged to include all your code revisions and the difference-tracking system the system provides, and you can easily revert to previous versions of your programs as well as examine the variations of your current script from an earlier one.
Github also allows for easy sharing of your code, as most programmers have an account there (oftentimes even companies including Julia Computing publish their material there).
Bottom line, even if you decide to use something else as a VCS, it is a good idea to become intimately acquainted with Github; chances are, you will visit it at one point.
This is one of the most well-known centralized VCS. Just like Git, it is free and fairly easy to use. Its key advantages over other (centralized) VCS are the following:
It allows the running for logging scripts
It enables its users to have local versions of their files (vendor branches), something useful for large teams
Several people can work on the same file simultaneously
Entire collections of files can be manipulated with a single command, allowing for a modular approach to version control
It works on most Unix-like operating systems, as well as on different versions of Windows.