Business Problems and Data Science Solutions
Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining. An important principle of data science is that data mining is a process with fairly well-understood stages.
Some involve the application of information technology, such as the automated discovery and evaluation of patterns from data, while others mostly require an analyst’s creativity, business knowledge, and common sense.
Understanding the whole process helps to structure data mining projects, so they are closer to systematic analyses rather than heroic endeavors driven by chance and individual acumen.
Since the data mining process breaks up the overall task of finding patterns from data into a set of well-defined subtasks, it is also useful for structuring discussions about data science.
In this blog, we will use the process as an overarching framework for our discussion. This blog introduces the data mining process, but first, we provide additional context by discussing common types of data mining tasks.
This blog discussing a set of important business analytics subjects that our focus on data warehousing, and basic business problems statistics and data science solutions.
From Business Problems to Data Mining Tasks
Each data-driven business decision-making problem is unique, comprising its own combination of goals, desires, constraints, and even personalities. As with much engineering, though, there are sets of common tasks that underlie business problems.
In collaboration with business stakeholders, data scientists decompose a business problem into subtasks. The solutions to the subtasks can then be composed to solve the overall problem.
Some of these subtasks are unique to the particular business problem, but others are common data mining tasks. For example, our telecommunications churn problem is unique to MegaTelCo: there are specifics of the problem that is different from churn problems of any other telecommunications firm.
However, a subtask that will likely be part of the solution to any churn problem is to estimate from historical data the probability of a customer terminating her contract shortly after it has expired.
Once the idiosyncratic MegaTelCo data have been assembled into a particular format, this probability estimation fits the mold of one very common data mining task. We know a lot about solving the common data mining tasks, both scientifically and practically.
In later blogs, we also will provide data science frameworks to help with the decomposition of business problems and with the re-composition of the solutions to the subtasks.
A critical skill in data science is the ability to decompose a data-analytics problem into pieces such that each piece matches a known task for which tools are available. Recognizing familiar problems and their solutions avoids wasting time and resources reinventing the wheel.
It also allows people to focus attention on more interesting parts of the process that require human involvement—parts that have not been automated, so human creativity and intelligence must come into play.
Despite a large number of specific data mining algorithms developed over the years, there are only a handful of fundamentally different types of tasks these algorithms address. It is worth defining these tasks clearly.
The next several blogs will use the first two (classification and regression) to illustrate several fundamental concepts. In what follows, the term “an individual” will refer to an entity about which we have data, such as a customer or a consumer, or it could be an inanimate entity such as a business.
In many business analytics projects, we want to find “correlations” between a particular variable describing an individual and other variables.
For example, in historical data, we may know which customers left the company after their contracts expired. We may want to find out which other variables correlate with a customer leaving in the near future. Finding such correlations are the most basic examples of classification and regression tasks.
1. Classification and class probability estimation attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. Usually, the classes are mutually exclusive.
An example classification question would be: “Among all the customers of MegaTelCo, which are likely to respond to a given offer?” In this example the two classes could be called will respond and will not respond.
For a classification task, a data mining procedure produces a model that, given a new individual, determines which class that individual belongs to. A closely related task is scoring or class probability estimation.
A scoring model applied to an individual produces, instead of a class prediction, a score representing the probability (or some other quantification of likelihood) that that individual belongs to each class.
In our customer response scenario, a scoring model would be able to evaluate each individual customer and produce a score of how likely each is to respond to the offer. Classification and scoring are very closely related; as we shall see, a model that can do one can usually be modified to do the other.
2. Regression (“value estimation”) attempts to estimate or predict, for each individual, the numerical value of some variable for that individual. An example regression question would be: “How much will a given customer use the service?”
The property (variable) to be predicted here is service usage, and a model could be generated by looking at other, similar individuals in the population and their historical usage. A regression procedure produces a model that, given an individual, estimates the value of the particular variable specific to that individual.
Regression is related to classification, but the two are different. Informally, classification predicts whether something will happen, whereas regression predicts how much something will happen. The difference will become clearer as the blog progresses.
3. Similarity matching attempts to identify similar individuals based on data known about them. Similarity matching can be used directly to find similar entities. For example, IBM is interested in finding companies similar to their best business customers, in order to focus their sales force on the best opportunities.
They use similarity matching based on “firmographic” data describing characteristics of the companies. Similarity matching is the basis for one of the most popular methods for making product recommendations (finding people who are similar to you in terms of the products they have liked or have purchased).
Similarity measures underlie certain solutions to other data mining tasks, such as classification, regression, and clustering.
4. Clustering attempts to group individuals in a population together by their similarity, but not driven by any specific purpose. An example clustering question would be: “Do our customers form natural groups or segments?”
Clustering is useful in preliminary domain exploration to see which natural groups exist because these groups, in turn, may suggest other data mining tasks or approaches.
Clustering also is used as input to decision-making processes focusing on questions such as: What products should we offer or develop? How should our customer care teams (or sales teams) be structured?
5. Co-occurrence grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based on transactions involving them.
An example co-occurrence question would be: What items are commonly purchased together? While clustering looks at the similarity between objects based on the objects’ attributes, co-occurrence grouping considers the similarity of objects based on their appearing together in transactions.
For example, analyzing purchase records from a supermarket may uncover that ground meat is purchased together with hot sauce much more frequently than we might expect. Deciding how to act upon this discovery might require some creativity, but it could suggest a special promotion, product display, or combination offer.
Co-occurrence of products in purchases is a common type of grouping known as market-basket analysis. Some recommendation systems also perform a type of affinity grouping by finding, for example, pairs of books that are purchased frequently by the same people (“people who bought X also bought Y”).
The result of co-occurrence grouping is a description of items that occur together. These descriptions usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is.
6. Profiling (also known as behavior description) attempts to characterize the typical behavior of an individual, group, or population. An example profiling question would be: “What is the typical cell phone usage of this customer segment?”
Behavior may not have a simple description; profiling cell phone usage might require a complex description of the night and weekend airtime averages, international usage, roaming charges, text minutes, and so on. Behavior can be described generally over an entire population, or down to the level of small groups or even individuals.
Profiling is often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computer systems (such as someone breaking into your iTunes account).
For example, if we know what kind of purchases a person typically makes on a credit card, we can determine whether a new charge on the card fits that profile or not. We can use the degree of a mismatch as a suspicion score and issue an alarm if it is too high.
7. Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist and possibly also estimating the strength of the link.
Link prediction is common in social networking systems: “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?” Link prediction can also estimate the strength of a link.
For example, for recommending movies to customers one can think of a graph between customers and the movies they’ve watched or rated. Within the graph, we search for links that do not exist between customers and movies, but that we predict should exist and should be strong. These links form the basis for recommendations.
8. Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. The smaller dataset may be easier to deal with or to process. Moreover, the smaller dataset may better reveal the information.
For example, a massive dataset on consumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example, viewer genre preferences). Data reduction usually involves loss of information. What is important is the trade-off for improved insight.
9. Causal modeling attempts to help us understand what events or actions actually influence others. For example, consider that we use predictive modeling to target advertisements to consumers, and we observe that indeed the targeted consumers purchase at a higher rate subsequent to having been targeted.
Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job of identifying those consumers who would have purchased anyway?
Techniques for causal modeling include those involving substantial investment in data, such as randomized controlled experiments (e.g., so-called “A/B tests”), as well as sophisticated methods for drawing causal conclusions from observational data.
Both experimental and observational methods for causal modeling generally can be viewed as “counterfactual” analysis: they attempt to understand what would be the difference between the situations—which cannot both happen —where the “treatment” event (e.g., showing an advertisement to a particular in‐ dividual) were to happen and were not to happen.
In all cases, a careful data scientist should always include with a causal conclusion the exact assumptions that must be made in order for the causal conclusion to hold (there always are such assumptions—always ask).
When undertaking causal modeling, a business needs to weigh the trade-off of increasing investment to reduce the assumptions made, versus deciding that the conclusions are good enough given the assumptions.
Even in the most careful randomized, controlled experimentation, assumptions are made that could render the causal conclusions invalid. The discovery of the “placebo effect” in medicine illustrates a notorious situation where an assumption was overlooked in carefully designed randomized experimentation.
Discussing all of these tasks in detail would fill multiple books. In this blog, we present a collection of the most fundamental data science principles—principles that together underlie all of these types of tasks.
We will illustrate the principles mainly using classification, regression, similarity matching, and clustering, and will discuss others when they provide important illustrations of the fundamental principles.
Consider which of these types of tasks might fit our churn-prediction problem. Often, practitioners formulate churn prediction as a problem of finding segments of customers who are more or less likely to leave.
This segmentation problem sounds like a classification problem, or possibly clustering, or even regression. To decide the best formulation, we first need to introduce some important distinctions.
Supervised Versus Unsupervised Methods
Consider two similar questions we might ask about a customer population. The first is: “Do our customers naturally fall into different groups?” Here no specific purpose or target has been specified for the grouping. When there is no such target, the data mining problem is referred to as unsupervised. Contrast this with a slightly different question:
“Can we find groups of customers who have particularly high likelihoods of canceling their service soon after their contracts expire?” Here there is a specific target defined: will a customer leave when her contract expires?
In this case, segmentation is being done for a specific reason: to take action based on the likelihood of churn. This is called a supervised data mining problem.
A note on the terms: Supervised and unsupervised learning
The terms supervised and unsupervised were inherited from the field of machine learning. Metaphorically, a teacher “supervises” the learner by carefully providing target information along with a set of examples.
An unsupervised learning task might involve the same set of examples but would not include the target information. The learner would be given no information about the purpose of the learning but would be left to form its own conclusions about what the examples have in common.
The difference between these questions is subtle but important. If a specific target can be provided, the problem can be phrased as a supervised one. Supervised tasks require different techniques than unsupervised tasks do, and the results often are much more useful.
A supervised technique is given a specific purpose for the grouping—predicting the target. Clustering, an unsupervised task, produces groupings based on similarities, but there is no guarantee that these similarities are meaningful or will be useful for any particular purpose.
Technically, another condition must be met for supervised data mining: there must be data on the target. It is not enough that the target information exists in principle; it must also exist in the data.
For example, it might be useful to know whether a given customer will stay for at least six months, but if in historical data this retention information is missing or incomplete (if, say, the data are only retained for two months) the target values cannot be provided.
Acquiring data on the target often is a key data science investment. The value for the target variable for an individual is often called the individual’s label, emphasizing that often (not always) one must incur the expense to actively label the data.
Classification, regression, and causal modeling generally are solved with supervised methods. Similarity matching, link prediction, and data reduction could be either.
Clustering, co-occurrence grouping, and profiling generally are unsupervised. The fundamental principles of data mining that we will present underlie all these types of technique.
Two main subclasses of supervised data mining, classification, and regression, are distinguished by the type of target. Regression involves a numeric target while classification involves a categorical (often binary) target. Consider these similar questions we might address with supervised data mining:
“Will this customer purchase service S1 if given incentive I?”
This is a classification problem because it has a binary target (the customer either purchases or does not).
“Which service package (S1, S2, or none) will a customer likely purchase if given incentive I?”
This is also a classification problem, with a three-valued target.
“How much will this customer use the service?”
This is a regression problem because it has a numeric target. The target variable is the amount of usage (actual or predicted) per customer.
There are subtleties among these questions that should be brought out. For business applications, we often want a numerical prediction over a categorical target.
In the churn example, a basic yes/no prediction of whether a customer is likely to continue to subscribe to the service may not be sufficient; we want to model the probability that the customer will continue.
This is still considered classification modeling rather than regression because the underlying target is categorical. Where necessary for clarity, this is called “class probability estimation.”
A vital part in the early stages of the data mining process is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to produce a precise definition of a target variable. This variable must be a specific quantity that will be the focus of the data mining (and for which we can obtain values for some example data).
Data Mining and Its Results
There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining.
Students often confuse these two processes when studying data science, and managers sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct.
In our churn example, consider the deployment scenario in which the results will be used. We want to use the model to predict which of our customers will leave. Specifically, assume that data mining has created a class probability estimation model M. Given each
The upper half of the figure illustrates the mining of historical data to produce a model. Importantly, the historical data have the target (“class”) value specified. The bottom half shows the result of the data mining in use, where the model is applied to new data for which we do not know the class value.
The model predicts both the class value and the probability that the class variable will take on that value.
The existing customer described using a set of characteristics, M takes these characteristics as input and produces a score or probability estimate of attrition. This is the use of the results of data mining. The data mining produces the model M from some other, often historical, data.
The Data Mining Process
Data mining is a craft. It involves the application of a substantial amount of science and technology, but the proper application still involves art as well.
But as with many mature crafts, there is a well-understood process that places a structure on the problem, allowing reasonable consistency, repeatability, and objectiveness. A useful codification of the data mining process is given by the Cross-Industry Standard Process for Data Mining
This process diagram makes explicit the fact that iteration is the rule rather than the exception. Going through the process once without having solved the problem is, generally speaking, not a failure.
Often the entire process is an exploration of the data, and after the first iteration, the data science team knows much more. The next iteration can be much more well-informed. Let’s now discuss the steps in detail.
Initially, it is vital to understand the problem to be solved. This may seem obvious, but business projects seldom come pre-packaged as clear and unambiguous data mining problems.
Often recasting the problem and designing a solution is an iterative process of discovery. The initial formulation may not be complete or optimal so multiple iterations may be necessary for an acceptable solution formulation to appear.
The Business Understanding stage represents a part of the craft where the analysts’ creativity plays a large role. Data science has some things to say, as we will describe, but often the key to great success is a creative problem formulation by some analyst regarding how to cast the business problem as one or more data science problems. High-level knowledge of the fundamentals helps creative business analysts see novel formulations.
We have a set of powerful tools to solve particular data mining problems: the basic data mining tasks discussed in “From Business Problems to Data Mining Tasks”. Typically, the early stages of the endeavor involve designing a solution that takes advantage of these tools.
This can mean structuring (engineering) the problem such that one or more subproblems involve building models for classification, regression, probability estimation, and so on.
In this first stage, the design team should think carefully about the use scenario. This itself is one of the most important concepts of data science. What exactly do we want to do? How exactly would we do it? What parts of this use scenario constitute possible data mining models?
In discussing this in more detail, we will begin with a simplified view of the use scenario, but as we go forward we will loop back and realize that often the use scenario must be adjusted to better reflect the actual business need.
We will present conceptual tools to help our thinking here, for example framing a business problem in terms of expected value can allow us to systematically decompose it into data mining tasks.
If solving the business problem is the goal, the data comprise the available raw material from which the solution will be built. It is important to understand the strengths and limitations of the data because rarely is there an exact match with the problem.
Historical data often are collected for purposes unrelated to the current business problem, or for no explicit purpose at all.
A customer database, a transaction database, and a marketing response database contain different information, may cover different intersecting populations, and may have varying degrees of reliability.
It is also common for the costs of data to vary. Some data will be available virtually for free while others will require effort to obtain.
Some data may be purchased. Still, other data simply won’t exist and will require entire ancillary projects to arrange their collection. A critical part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether further investment is merited.
Even after all datasets are acquired, collating them may require additional effort. For example, customer records and product identifiers are notoriously variable and noisy. Cleaning and matching customer records to ensure only one record per customer is itself a complicated analytics problem.
As data understanding progresses, solution paths may change direction in response, and team efforts may even fork. Fraud detection provides an illustration of this. Data mining has been used extensively for fraud detection, and many fraud detection problems involve classic supervised data mining tasks.
Consider the task of catching credit card fraud. Charges show up on each customer’s account, so fraudulent charges are usually caught—if not initiated by the company, then later by the customer when account activity is reviewed.
We can assume that nearly all fraud is identified and reliably labeled since the legitimate customer and the person perpetrating the fraud are different people and have opposite goals. Thus credit card transactions have reliable labels (fraud and legitimate) that may serve as targets for a supervised technique.
Now consider the related problem of catching Medicare fraud. This is a huge problem in the United States costing billions of dollars annually. Though this may seem like a conventional fraud detection problem, as we consider the relationship of the business problem to the data, we realize that the problem is significantly different.
The perpetrators of fraud—medical providers who submit false claims, and sometimes their patients—are also legitimate service providers and users of the billing system. Those who commit fraud are a subset of the legitimate users; there is no separate disinterested party who will declare exactly what the “correct” charges should be.
Consequently, the Medicare billing data have no reliable target variable indicating fraud, and a supervised learning approach that could work for credit card fraud is not applicable. Such a problem usually requires unsupervised approaches such as profiling, clustering, anomaly detection, and co-occurrence grouping.
The fact that both of these are fraud detection problems is a superficial similarity that is actually misleading.
In data understanding, we need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply.
It is not unusual for a business problem to contain several data mining tasks, often of different types, and combining their solutions will be necessary.
The analytic technologies that we can bring to bear are powerful but they impose certain requirements on the data they use. They often require data to be in a form different from how the data are provided naturally, and some conversion will be necessary.
Therefore a data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results.
Typical examples of data preparation are converting data to tabular format, removing or inferring missing values, and converting data to different types. Some data mining techniques are designed for symbolic and categorical data, while others handle only numeric values.
In addition, numerical values must often be normalized or scaled so that they are comparable. Standard techniques and rules of thumb are available for doing such conversions.
In general, though, this blog will not focus on data preparation techniques. We will define basic data formats and will only be concerned with data preparation details when they shed light on some fundamental principle of data science or are necessary to present a concrete example.
More generally, data scientists may spend considerable time early in the process defining the variables used later in the process. This is one of the main points at which human creativity, common sense, and business knowledge come into play.
Often the quality of the data mining solution rests on how well the analysts structure the problems and craft the variables (and sometimes it can be surprisingly hard for them to admit it).
One very general and important concern during data preparation is to beware of “leaks”. A leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.
As an example, when predicting whether at a particular point in time a website visitor would end her session or continue surfing to another page, the variable “total number of web pages visited in the session” is predictive.
However, the total number of web pages visited in the session would not be known until after the session was over—at which point one would know the value for the target variable!
As another illustrative example, consider predicting whether a customer will be a “big spender”; knowing the categories of the items purchased (or worse, the amount of tax paid) are very predictive but are not known at decision-making time. Leakage must be considered carefully during data preparation because data preparation typically is performed after the fact—from historical data.
Modeling is the subject of the next several blogs and we will not dwell on it here, except to say that the output of modeling is some sort of model or pattern capturing regularities in the data.
The modeling stage is the primary place where data mining techniques are applied to the data. It is important to have some understanding of the fundamental ideas of data mining, including the sorts of techniques and algorithms that exist because this is the part of the craft where the most science and technology can be brought to bear.
The purpose of the evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on. If we look hard enough at any dataset we will find patterns, but they may not survive careful scrutiny.
We would like to have confidence that the models and patterns extracted from the data are true regularities and not just idiosyncrasies or sample anomalies.
It is possible to deploy results immediately after data mining but this is inadvisable; it is usually far easier, cheaper, quicker, and safer to test a model first in a controlled laboratory setting.
Equally important, the evaluation stage also serves to help ensure that the model satisfies the original business goals. Recall that the primary goal of data science for business is to support decision making and that we started the process by focusing on the business problem we would like to solve.
Usually, a data mining solution is only a piece of the larger solution, and it needs to be evaluated as such. Further, even if a model passes strict evaluation tests in “in the lab,” there may be external considerations that make it impractical.
For example, a common flaw with detection solutions (such as fraud detection, spam detection, and intrusion monitoring) is that they produce too many false alarms.
A model may be extremely accurate (> 99%) by laboratory standards, but evaluation in the actual business context may reveal that it still produces too many false alarms to be economically feasible. (How much would it cost to provide the staff to deal with all those false alarms? What would be the cost in customer dissatisfaction?)
Evaluating the results of data mining includes both quantitative and qualitative assessments. Various stakeholders have interests in the business decision-making that will be accomplished or supported by the resultant models.
In many cases, these stakeholders need to “sign off” on the deployment of the models, and in order to do so need to be satisfied by the quality of the model’s decisions.
What that means varies from application to application, but often stakeholders are looking to see whether the model is going to do more good than harm, and especially that the model is unlikely to make catastrophic mistakes.
To facilitate such qualitative assessment, the data scientist must think about the comprehensibility of the model to stakeholders (not just to the data scientists).
And if the model itself is not comprehensible (e.g., maybe the model is a very complex mathematical formula), how can the data scientists work to make the behavior of the model be comprehensible.
Finally, a comprehensive evaluation framework is important because getting detailed information on the performance of a deployed model may be difficult or impossible. Often there is only limited access to the deployment environment so making a comprehensive evaluation “in production” is difficult.
Deployed systems typically contain many “moving parts,” and assessing the contribution of a single part is difficult. Firms with sophisticated data science teams wisely build testbed environments that mirror production data as closely as possible, in order to get the most realistic evaluations before taking the risk of deployment.
Nonetheless, in some cases, we may want to extend evaluation into the development environment, for example by instrumenting a live system to be able to conduct randomized experiments.
In our churn example, if we have decided from laboratory tests that a data mined model will give us better churn reduction, we may want to move on to an “in vivo” evaluation, in which a live system randomly applies the model to some customers while keeping other customers as a control group.
Such experiments must be designed carefully, and the technical details are beyond the scope of this blog. The interested reader could start with the lessons-learned articles by Ron Kohavi and his coauthors.
We may also want to instrument deployed systems for evaluations to make sure that the world is not changing to the detriment of the model’s decision-making. For example, behavior can change—in some cases, like fraud or spam, in direct response to the deployment of models.
Additionally, the output of the model is critically dependent on the input data; input data can change in format and in substance, often without any alerting of the data science team. Raeder et al. present a detailed discussion of system design to help deal with these and other related evaluation-in-deployment issues.
In deployment, the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment.
The clearest cases of deployment involve implementing a predictive model in some information system or business process. In our churn example, a model for predicting the likelihood of churn could be integrated with the business process for churn management
2. For example, in one data mining project, a model was created to diagnose problems in local phone networks, and to dispatch technicians to the likely site of the problem. Before deployment, a team of phone company stakeholders requested that the model is tweaked so that exceptions were made for hospitals.
—for example, by sending special offers to customers who are predicted to be particularly at risk. A new fraud detection model may be built into a workforce management information system, to monitor accounts and create “cases” for fraud analysts to examine.
Increasingly, the data mining techniques themselves are deployed. For example, for targeting online advertisements, systems are deployed that automatically build (and test) models in production when a new advertising campaign is presented.
Two main reasons for deploying the data mining system itself rather than the models produced by a data mining system are (i) the world may change faster than the data science team can adapt, as with fraud and intrusion detection, and (ii) a business has too many modeling tasks for their data science team to manually curate each model individually.
In these cases, it may be best to deploy the data mining phase into production. In doing so, it is critical to instrument the process alert the data science team of any seeming anomalies and to provide fail-safe operation.
Deployment can also be much less “technical.” In a celebrated case, data mining discovered a set of rules that could help to quickly diagnose and fix a common error in industrial printing.
The deployment succeeded simply by taping a sheet of paper containing the rules to the side of the printers. Deployment can also be much more subtle, such as a change to data acquisition procedures, or a change to strategy, marketing, or operations resulting from insight gained from mining the data.
Deploying a model into a production system typically requires that the model is recorded for the production environment, usually for greater speed or compatibility with an existing system.
This may incur substantial expense and investment. In many cases, the data science team is responsible for producing a working prototype, along with its evaluation. These are passed to a development team.
Practically speaking, there are risks with “over the wall” transfers from data science to develop. It may be helpful to remember the maxim: “Your model is not what the data scientists design, it’s what the engineers build.” From a management perspective, it is advisable to have members of the development team involved early on in the data science project.
They can begin as advisors, providing critical insight into the data science team. Increasingly in practice, these particular developers are “data science engineers”—software engineers who have particular expertise both in the production systems and in data science. These developers gradually assume more responsibility as the project matures.
At some point, the developers will take the lead and assume ownership of the product. Generally, the data scientists should still remain involved in the project into final deployment, as advisors or as developers depending on their skills.
Regardless of whether deployment is successful, the process often returns to the Business Understanding phase. The process of mining data produces a great deal of insight into the business problem and the difficulties of its solution.
A second iteration can yield an improved solution. Just the experience of thinking about the business, the data, and the performance goals often lead to new ideas for improving business performance, and even new lines of business or new ventures.
Note that it is not necessary to fail in deployment to start the cycle again. The Evaluation stage may reveal that results are not good enough to deploy, and we need to adjust the problem definition or get different data.
This is represented by the “shortcut” link from Evaluation back to Business Understanding in the process diagram.
In practice, there should be shortcuts back from each stage to each prior one because the process always retains some exploratory aspects, and a project should be flexible enough to revisit prior steps based on discoveries made.
Implications for Managing the Data Science Team
It is tempting—but usually a mistake—to view the data mining process as a software development cycle. Indeed, data mining projects are often treated and managed as engineering projects, which is understandable when they are initiated by software departments, with data generated by a large software system and analytics results fed back into it.
Managers are usually familiar with software technologies and are comfortable managing software projects. Milestones can be agreed upon and success is usually unambiguous.
Software managers might look at the CRISP data mining cycle and think it looks comfortably similar to a software development cycle, so they should be right at home managing an analytics project the same way.
This can be a mistake because data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based on exploration; it iterates on approaches and strategy rather than on software designs.
Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem. Engineering a data mining solution directly for deployment can be an expensive premature commitment.
Instead, analytics projects should prepare to invest in information to reduce uncertainty in various ways. Small investments can be made via pilot studies and throwaway prototypes. Data scientists should review the literature to see what else has been done and how it has worked.
On a larger scale, a team can invest substantially in building experimental testbeds to allow extensive agile experimentation. If you’re a software manager, this will look more like research and exploration than you’re used to, and maybe more than you’re comfortable with.
3. Software professionals may recognize the similarity to the philosophy of “Fail faster to succeed sooner”.
Software skills versus analytics skills
Although data mining involves software, it also requires skills that may not be common among programmers. In software engineering, the ability to write efficient, high-quality code from requirements may be paramount. Team members may be evaluated using software metrics such as the amount of code written or a number of bug tickets closed.
In analytics, it’s more important for individuals to be able to formulate problems well, to prototype solutions quickly, to make reasonable assumptions in the face of ill-structured problems, to design experiments that represent good investments, and to analyze results.
In building a data science team, these qualities, rather than traditional software engineering expertise, are skills that should be sought.
Other Analytics Techniques and Technologies
Business analytics involves the application of various technologies to the analysis of data. Many of these go beyond this blog’s focus on data-analytic thinking and the principles of extracting useful patterns from data.
Nonetheless, it is important to be acquainted with these related techniques, to understand what their goals are, what role they play, and when it may be beneficial to consult experts in them.
To this end, we present six groups of related analytic techniques. Where appropriate we draw comparisons and contrasts with data mining.
The main difference is that data mining focuses on the automated search for knowledge, patterns, or regularities from data. An important skill for a business analyst is to be able to recognize what sort of analytic technique is appropriate for addressing a particular problem.
The term “statistics” has two different uses in business analytics. First, it is used as a catchall term for the computation of particular numeric values of interest from data (e.g., “We need to gather some statistics on our customers’ usage to determine what’s going wrong here.”) These values often include sums, averages, rates, and so on.
Let’s call these “summary statistics.” Often we want to dig deeper, and calculate summary statistics conditionally on one or more subsets of the population (e.g., “Does the churn rate differ between male and female customers?” and “What about high-income customers in the Northeast (denotes a region of the USA)?”) Summary statistics are the basic building blocks of much data science theory and practice.
It is important to keep in mind that it is rare for the discovery to be completely automated. The important factor is that data mining automates at least partially the search and discovery process, rather than providing technical support for manual search and discovery.
Summary statistics should be chosen with close attention to the business problem to be solved (one of the fundamental principles we will present later), and also with attention to the distribution of the data they are summarizing.
For example, the average (mean) income in the United States according to the 2004 Census Bureau Economic Survey was over $60,000. If we were to use that as a measure of the average income in order to make policy decisions, we would be misleading ourselves.
The distribution of incomes in the U.S. is highly skewed, with many people making relatively little and some people making fantastically much. In such cases, the arithmetic mean tells us relatively little about how much people are making. Instead, we should use a different measure of “average” income, such as the median.
The median income—that amount where half the population makes more and half make less—in the U.S. in the 2004 Census study was only $44,389 —considerably less than the mean.
This example may seem obvious because we are so accustomed to hearing about the “median income,” but the same reasoning applies to any computation of summary statistics: have you thought about the problem you would like to solve or the question you would like to answer?
Have you considered the distribution of the data, and whether the chosen statistic is appropriate?
The other use of the term “statistics” is to denote the field of study that goes by that name, for which we might differentiate by using the proper name, Statistics. The field of Statistics provides us with a huge amount of knowledge that underlies analytics and can be thought of as a component of the larger field of Data Science.
For example, Statistics helps us to understand different data distributions and what statistics are appropriate to summarize each. Statistics help us understand how to use data to test hypotheses and to estimate the uncertainty of conclusions.
In relation to data mining, hypothesis testing can help determine whether an observed pattern is likely to be a valid, general regularity as opposed to a chance occurrence in some particular dataset. Most relevant to this blog, many of the techniques for extracting models or patterns from data have their roots in Statistics.
For example, a preliminary study may suggest that customers in the Northeast have a churn rate of 22.5%, whereas the nationwide average churn rate is only 15%. This may be just a chance fluctuation since the churn rate is not constant; it varies over regions and over time, so differences are to be expected.
But the Northeast rate is one and a half times the U.S. average, which seems unusually high. What is the chance that this is due to random variation? Statistical hypothesis testing is used to answer such questions.
Closely related is the quantification of uncertainty into confidence intervals. The overall churn rate is 15%, but there is some variation; traditional statistical analysis may reveal that 95% of the time the churn rate is expected to fall between 13% and 17%.
This contrasts with the (complimentary) process of data mining, which may be seen as hypothesis generation. Can we find patterns in data in the first place? Hypothesis generation should then be followed by careful hypothesis testing.
In addition, data mining procedures may produce numerical estimates, and we often also want to provide confidence intervals on these estimates. We will return to this when we discuss the evaluation of the results of data mining.
In this blog, we are not going to spend more time discussing these basic statistical concepts. There are plenty of introductory books on statistics and statistics for business, and any treatment we would try to squeeze in would be either very narrow or superficial.
That said, one statistical term that is often heard in the context of business analytics is “correlation.” For example, “Are there any indicators that correlate with a customer’s later defection?”
As with the term statistics, “correlation” has both a general-purpose meaning (variations in one quantity tell us something about variations in the other), and a specific technical meaning (e.g., linear correlation based on a particular mathematical formula).
The notion of correlation will be the jumping off point for the rest of our discussion of data science for business.
A query is a specific request for a subset of data or for statistics about data, formulated in a technical language and posed to a database system. Many tools are available to answer one-off or repeating queries about data posed by an analyst.
These tools are usually frontends to database systems, based on Structured Query Language (SQL) or a tool with a graphical user interface (GUI) to help formulate queries (e.g., query-by-example, or QBE).
For example, if the analyst can define “profitable” in operational terms computable from items in the database, then a query tool could answer: “Who are the most profitable customers in the Northeast?”
The analyst may then run the query to retrieve a list of the most profitable customers, possibly ranked by profitability. This activity differs fundamentally from data mining in that there is no discovery of patterns or models.
Database queries are appropriate when an analyst already has an idea of what might be an interesting subpopulation of the data and wants to investigate this population or confirm a hypothesis about it.
For example, if an analyst suspects that middle-aged men living in the Northeast have some particularly interesting churning behavior, she could compose a SQL query:
SELECT * FROM CUSTOMERS WHERE AGE > 45 and SEX='M' and DOMICILE = 'NE'
If those are the people to be targeted with an offer, a query tool can be used to retrieve all of the information about them (“*”) from the CUSTOMERS table in the database.
In contrast, data mining could be used to come up with this query in the first place— as a pattern or regularity in the data.
A data mining procedure might examine prior customers who did and do not defect, and determine that this segment (characterized as “AGE is greater than 45 and SEX is male and DOMICILE is Northeast-USA”) is predictive with respect to churn rate. After translating this into a SQL query, a query tool could then be used to find the matching records in the database.
Query tools generally have the ability to execute sophisticated logic, including computing summary statistics over subpopulations, sorting, joining together multiple tables with related data, and more. Data scientists often become quite adept at writing queries to extract the data they need.
On-line Analytical Processing (OLAP) provides an easy-to-use GUI to query large data collections, for the purpose of facilitating data exploration. The idea of “on-line” processing is that it is done in realtime, so analysts and decision makers can find answers to their queries quickly and efficiently.
Unlike the “ad hoc” querying enabled by tools like SQL, for OLAP the dimensions of analysis must be pre-programmed into the OLAP system. If we’ve foreseen that we would want to explore sales volume by region and time, we could have these three dimensions programmed into the system, and drill down into populations, often simply by clicking and dragging and manipulating dynamic charts.
OLAP systems are designed to facilitate manual or visual exploration of the data by analysts. OLAP performs no modeling or automatic pattern finding.
As an additional contrast, unlike with OLAP, data mining tools generally can incorporate new dimensions of analysis easily as part of the exploration. OLAP tools can be a useful complement to data mining tools for discovery from business data.
Data warehouses collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database. Analytical systems can access data warehouses. Data warehousing may be seen as a facilitating technology of data mining.
It is not always necessary, as most data mining does not access a data warehouse, but firms that decide to invest in data warehouses often can apply data mining more broadly and more deeply in the organization.
For example, if a data warehouse integrates records from sales and billing as well as from human resources, it can be used to find characteristic patterns of effective salespeople.
Some of the same methods we discuss in this blog are at the core of a different set of analytic methods, which often are collected under the rubric regression analysis, and are widely applied in the field of statistics and also in other fields founded on econometric analysis. This blog will focus on different issues than usually encountered in a regression analysis blog or class.
Here we are less interested in explaining a particular dataset as we are in extracting patterns that will generalize to other data, and for the purpose of improving some business process. Typically, this will involve estimating or predicting values for cases that are not in the analyzed data set.
So, as an example, in this blog, we are less interested in digging into the reasons for churn (important as they may be) in a particular historical set of data and more interested in predicting which customers who have not yet left would be the best to target to reduce future churn.
Therefore, we will spend some time talking about testing patterns on new data to evaluate their generality, and about techniques for reducing the tendency to find patterns specific to a particular set of data, but that do not generalize to the population from which the data come.
The topic of explanatory modeling versus predictive modeling can elicit deep-felt debate, which goes well beyond our focus. What is important is to realize that there is considerable overlap in the techniques used, but that the lessons learned from explanatory modeling do not all apply to predictive modeling.
So a reader with some background in regression analysis may encounter new and even seemingly contradictory lessons.
Machine Learning and Data Mining
The collection of methods for extracting (predictive) models from data, now known as machine learning methods, were developed in several fields contemporaneously, most notably Machine Learning, Applied Statistics, and Pattern Recognition.
Machine Learning as a field of study arose as a subfield of Artificial Intelligence, which was concerned with methods for improving the knowledge or performance of an intelligent agent over time, in response to the agent’s experience in the world.
Such improvement often involves analyzing data from the environment and making predictions about unknown quantities, and over the years this data analysis aspect of machine learning has come to play a very large role in the field.
As machine learning methods were deployed broadly, the scientific disciplines of Machine Learning, Applied Statistics, and Pattern Recognition developed close ties, and the separation between the fields has blurred.
5. The interested reader is urged to read the discussion by Shmueli (2010).
6. Those who pursue the study in depth will have the seeming contradictions worked out. Such deep study is not necessary to understand the fundamental principles.
The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns.
Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly. Nevertheless, it is worth pointing out some of the differences to give perspective.
Speaking generally, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It also is concerned with issues of agency and cognition—how will an intelligent agent using learned knowledge to reason and act in its environment—which are not concerns of Data Mining.
Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is.
As such, research focused on commercial applications and business issues of data analysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on.
Answering Business Questions with These Techniques
To illustrate how these techniques apply to business analytics, consider a set of questions that may arise and the technologies that would be appropriate for answering them.
These questions are all related but each is subtly different. It is important to understand these differences in order to understand what technologies one needs to employ and what people may be necessary to consult.
1. Who are the most profitable customers?
If “profitable” can be defined clearly based on existing data, this is a straightforward database query. A standard query tool could be used to retrieve a set of customer records from a database. The results could be sorted by cumulative transaction amount or some other operational indicator of profitability.
2. Is there really a difference between the profitable customers and the average customer?
This is a question about conjecture or hypothesis (in this case, “There is a difference in value to the company between the profitable customers and the average customer”), and statistical hypothesis testing would be used to confirm or disconfirm it.
Statistical analysis could also derive a probability or confidence bound that the difference was real. Typically, the result would be like: “The value of these profitable customers is significantly different from that of the average customer, with probability < 5% that this is due to random chance.”
3. But who really are these customers? Can I characterize them?
We often would like to do more than just list out the profitable customers. We would like to describe the common characteristics of profitable customers. The characteristics of individual customers can be extracted from a database using techniques such as database querying, which also can be used to generate summary statistics.
A deeper analysis should involve determining what characteristics differentiate profitable customers from unprofitable ones. This is the realm of data science, using data mining techniques for automated pattern finding—which we discuss in depth in the subsequent blogs.
4. Will some particular new customer be profitable? How much revenue should I expect this customer to generate?
These questions could be addressed by data mining techniques that examine historical customer records and produce predictive models of profitability. Such techniques would generate models from historical data that could then be applied to new customers to generate predictions. Again, this is the subject of the following blogs.
Note that this last pair of questions is subtly different data mining questions. The first, a classification question, may be phrased as a prediction of whether a given new customer will be profitable (yes/no or the probability thereof). The second may be phrased as a prediction of the value (numerical) that the customer will bring to the company. More on that as we proceed.
Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects.
We will refer back to the data mining process repeatedly throughout the blog, showing how each fundamental concept fits in. In turn, understanding the fundamentals of data science substantially improves the chances of success as an enterprise invokes the data mining process.
The various fields of study related to data science have developed a set of canonical task types, such as classification, regression, and clustering. Each task type serves a different purpose and has an associated set of solution techniques.
A data scientist typically attacks a new project by decomposing it such that one or more of these canonical tasks is revealed, choosing a solution technique for each, then composing the solutions. Doing this expertly may take considerable experience and skill.
A successful data mining project involves an intelligent compromise between what the data can do (i.e., what they can predict, and how well) and the project goals. For this reason, it is important to keep in mind how data mining results will be used and use this to inform the data mining process itself.
Data mining differs from and is complementary to, important supporting technologies such as statistical hypothesis testing and database querying (which have their own blogs and classes).
Though the boundaries between data mining and related techniques are not always sharp, it is important to know about other techniques’ capabilities and strengths to know when they should be used.
To a business manager, the data mining process is useful as a framework for analyzing a data mining project or proposal. The process provides a systematic organization, including a set of questions that can be asked about a project or a proposed project to help understand whether the project is well conceived or is fundamentally flawed.
We will return to this after we have discussed in detail some more of the fundamental principles themselves—to which we turn now.
Data Science and Business Strategy
Thinking Data-Analytically, Redux managers have to understand the fundamental principles well enough to envision and/ or appreciate data science opportunities, to supply the appropriate resources to the data science teams, and to be willing to invest in data and experimentation.
Furthermore, unless the firm has on its management team a seasoned, practical data scientist, often the management must steer the data science team carefully to make sure that the team stays on track toward an eventually useful business solution.
This is very difficult if the managers don’t really understand the principles. Managers need to be able to ask probing questions of a data scientist, who often can get lost in technical details. We need to accept that each of us has strengths and weaknesses, and as data science projects span so much of a business, a diverse team is essential.
Just as we can’t expect a manager necessarily to have deep expertise in data science, we can’t expect a data scientist necessarily to have deep expertise in business solutions.
However, an effective data science team involves collaboration between the two, and each needs to have some understanding of the fundamentals of the other’s area of responsibility.
Just as it would be a Sisyphean task to manage a data science team where the team had no understanding of the fundamental concepts of business, it likewise is extremely frustrating at best, and often a tremendous waste, for data scientists to struggle under management that does not understand basic principles of data science.
For example, it is not uncommon for data scientists to struggle under management that (sometimes vaguely) sees the potential benefit of predictive modeling, but does not have enough appreciation for the process to invest in proper training data or in proper evaluation procedures.
Such a company may “succeed” in engineering a model that is predictive enough to produce a viable product or service but will be at a severe disadvantage to a competitor who invests in doing the data science well.
A solid grounding in the fundamentals of data science has much more far-reaching strategic implications. We know of no systematic scientific study, but broad experience has shown that as executives, managers, and investors increase their exposure to data science projects, they see more and more opportunities in turn.
We see extreme cases in companies like Google and Amazon (there is a vast amount of data science underlying web search, as well as Amazon’s product recommendations and other offerings). Both of these companies eventually built subsequent products offering “big data” and data-science related services to other firms.
Many, possibly most, data-science oriented start-ups use Amazon’s cloud storage and processing services for some tasks. Google’s “Prediction API” is increasing in sophistication and utility (we don’t know how broadly used it is).
Those are extreme cases, but the basic pattern is seen in almost every data-rich firm. Once the data science capability has been developed for one application, other applications throughout the business become obvious. Louis Pasteur famously wrote, “For‐ tune favors the prepared mind.”
Modern thinking on creativity focuses on the juxtaposition of a new way of thinking with a mind “saturated” with a particular problem. Working through case studies (either in theory or in practice) of data science applications helps prime the mind to see opportunities and connections to new problems that could benefit from data science.
For example, in the late 1980s and early 1990s, one of the largest phone companies had applied predictive modeling—using the techniques we’ve described in this blog—to the problem of reducing the cost of repairing problems in the telephone network and to the design of speech recognition systems.
With the increased understanding of the use of data science for helping to solve business problems, the firm subsequently applied similar ideas to decisions about how to allocate a massive capital investment to best improve its network, and how to reduce fraud in its burgeoning wireless business.
The progression continued. Data science projects for reducing fraud discovered that incorporating features based on social-network connections (via who-calls-whom data) into fraud prediction models improved the ability to discover fraud substantially.
In the early 2000s, telecommunications firms produced the first solutions using such social connections to improve marketing—and improve marketing it did, showing huge performance lifts over traditional targeted marketing based on socio-demographic, geographic, and prior purchase data.
Next, in telecommunications, such social features were added to models for churn prediction, with equally beneficial results. The ideas diffused to the online advertising industry, and there was a subsequent flurry of development of online advertising based on the incorporation of data on online social connections (at Facebook and at other firms in the online advertising ecosystem).
This progression was driven both by experienced data scientists moving among business problems as well as by data science savvy managers and entrepreneurs, who saw new opportunities for data science advances in the academic and business literature.
Achieving Competitive Advantage with Data Science
Increasingly, firms are considering whether and how they can obtain competitive ad‐ vantage from their data and/or from their data science capability. This is important strategic thinking that should not be superficial, so let’s spend some time digging into it.
Data and data science capability are (complimentary) strategic assets. Under what conditions can a firm achieve competitive advantage from such an asset? First of all, the asset has to be valuable to the firm. This seems obvious, but note that the value of an asset to a firm depends on the other strategic decisions that the firm has made.
Outside of the context of data science, in the personal computer industry in the 1990s, Dell famously got the substantial competitive advantage early over industry leader Compaq from using web-based systems to allow customers to configure computers to their personal needs and liking. Compaq could not get the same value from web-based systems. One main reason was that Dell and Compaq had implemented different strategies:
Dell already was a direct-to-customer computer retailer, selling via catalogs; web-based systems held tremendous value given this strategy. Compaq sold computers mainly via retail outlets; web-based systems were not nearly as valuable given this alternative strategy.
When Compaq tried to replicate Dell’s web-based strategy, it faced a severe backlash from its retailers. The upshot is that the value of the new asset (web-based systems) was dependent on each company’s other strategic decisions.
The lesson is that we need to think carefully in the business understanding phase as to how data and data science can provide value in the context of our business strategy, and also whether it would do the same in the context of our competitors’ strategies. This can identify both possible opportunities and possible threats.
A direct data science analogy of the Dell-Compaq example is Amazon versus Borders. Even very early, Amazon’s data on customers’ book purchases allowed personalized recommendations to be delivered to customers while they were shopping online.
Even if Borders were able to exploit its data on who bought what books, its brick-and-mortar retail strategy did not allow the same seamless delivery of data science-based recommendations.
So, a prerequisite for competitive advantage is that the asset is valuable in the context of our strategy. We’ve already begun to talk about the second set of criteria: in order to gain competitive advantage, competitors either must not possess the asset or must not be able to obtain the same value from it.
We should think both about the data asset(s) and the data science capability. Do we have a unique data asset? If not, do we have an asset the utilization of which is better aligned with our strategy than with the strategy of our competitors? Or are we better able to take advantage of the data asset due to our better data science capability?
The flip side of asking about achieving a competitive advantage with data and data science is asking whether we are at a competitive disadvantage. It may be that the answers to the previous questions are affirmative for our competitors and not for us.
In what follows we will assume that we are looking to achieve a competitive advantage, but the arguments apply symmetrically if we are trying to achieve parity with a data-savvy competitor.
Sustaining Competitive Advantage with Data Science
The next question is: even if we can achieve competitive advantage, can we sustain it? If our competitors can easily duplicate our assets and capabilities, our advantage may be short-lived.
This is an especially critical question if our competitors have greater resources than we do: by adopting our strategy, they may surpass us if they have greater resources.
One strategy for competing based on data science is to plan to always keep one step ahead of the competition: always be investing in new data assets, and always be devel‐ oping new techniques and capabilities. Such a strategy can provide for an exciting and possibly fast-growing business, but generally few companies are able to execute it.
For example, you must have confidence that you have one of the best data science teams, since the effectiveness of data scientists has a huge variance, with the best being much more talented than the average. If you have a great team, you may be willing to bet that you can keep ahead of the competition. We will discuss data science teams more below.
The alternative to always keeping one step ahead of the competition is to achieve sustainable competitive advantage due to a competitor’s inability to replicate, or their elevated expense of replicating the data asset or the data science capability. There are several avenues to such sustainability.
Formidable Historical Advantage
Historical circumstances may have placed our firm in an advantageous position, and it may be too costly for competitors to reach the same position. Amazon again provides an outstanding example. In the “Dotcom Boom” of the 1990s, Amazon was able to sell books below cost, and investors continued to reward the company.
This allowed Amazon to amass tremendous data assets (such as massive data on online consumers’ buying preferences and online product reviews), which then allowed them to create valuable data-based products (such as recommendations and product ratings).
These historical circumstances are gone: it is unlikely today that investors would provide the same level of support to a competitor that was trying to replicate Amazon’s data asset by selling books below cost for years on end (not to mention that Amazon has moved far beyond books).
This example also illustrates that the data products themselves can increase the cost to competitors of replicating the data asset. Consumers value the data-driven recommendations and product reviews/ratings that Amazon provides.
This creates switching costs: competitors would have to provide extra value to Amazon’s customers to entice them to shop elsewhere—either with lower prices or with some other valuable product or service that Amazon does not provide.
Thus, when the data acquisition is tied directly to the value provided by the data, the resulting virtuous cycle creates a catch-22 for competitors: competitors need customers in order to acquire the necessary data, but they need the data in order to provide equivalent service to attract the customers.
Entrepreneurs and investors might turn this strategic consideration around: what historical circumstances now exist that may not continue indefinitely, and which may allow me to gain access to or to build a data asset more cheaply that will be possible in the future? Or which will allow me to build a data science team that would be more costly (or impossible) to build in the future?
Unique Intellectual Property
Our firm may have unique intellectual property. Data science intellectual property can include novel techniques for mining the data or for using the results. These might be patented, or they might just be trade secrets.
In the former case, a competitor either will be unable to (legally) duplicate the solution or will have an increased expense of doing so, either by licensing our technology or by developing new technology to avoid infringing on the patent.
In the case of a trade secret, it may be that the competitor simply does not know how we have implemented our solution. With data science solutions, the actual mechanism is often hidden; with only the result being visible.
Unique Intangible Collateral Assets
Our competitors may not be able to figure out how to put our solution into practice. With successful data science solutions, the actual source of good performance (for example with effective predictive modeling) may be unclear. The effectiveness of a predictive modeling solution may depend critically on the problem engineering, the attributes created, the combining of different models, and so on.
It often is not clear to a competitor how performance is achieved in practice. Even if our algorithms are published in detail, many implementation details may be critical to getting a solution that works in the lab to work in production.
Furthermore, success may be based on intangible assets such as a company culture that is particularly suitable to the deployment of data science solutions. For example, a culture that embraces business experimentation and the (rigorous) supporting of claims with data will naturally be an easier place for data science solutions to succeed.
Alternatively, if developers are encouraged to understand data science, they are less likely to screw up an otherwise top-quality solution. Recall our maxim: Your model is not what your data scientists design, it’s what your engineers implement.
Superior Data Scientists
Maybe our data scientists simply are much better than our competitors’. There is a huge variance in the quality and ability of data scientists.
Even among well-trained data scientists, it is well accepted within the data science community that certain individuals have the combination of innate creativity, analytical acumen, business sense, and perseverance that enables them to create remarkably better solutions than their peers.
This extreme difference in ability is illustrated by the year-after-year results in the KDD Cup data mining competition. Every year, the top professional society for data scientists, the ACM SIGKDD, holds its annual conference (the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining).
Each year the conference holds a data mining competition. Some data scientists love to compete, and there are many competitions. The Netflix competition is one of the most famous, and such competitions have even been turned into a crowd-sourcing business (see Kaggle). The KDD Cup is the granddaddy of data mining competitions and has been held every year since 1997.
Why is this relevant? Some of the best data scientists in the world participate in these competitions. Depending on the year and the task, hundreds or thousands of competitors try their hand at solving the problem.
If data science talent were evenly distributed, then one would think it unlikely to see the same individuals repeatedly winning the competitions. But that’s exactly what we see.
There are individuals who have been on winning teams repeatedly, sometimes multiple years in a row and for multiple tasks each year (sometimes the competition has more than one task).
The point is that there is substantial variation in the ability even of the best data scientists, and this is illustrated by the “objective” results of the KDD Cup competitions.
The upshot is that because of the large variation in ability, the best data scientists can pick and choose the employment opportunities that suit their desires with respect to salary, culture, advancement opportunities, and so on.
The variation in the quality of data scientists is amplified by the simple fact that top-notch data scientists are in high demand. Anyone can call himself a data scientist, and few companies can really evaluate data scientists well as potential hires. This leads to another catch: you need at least one top-notch data scientist to truly evaluate the quality of prospective hires.
Thus, if our company has managed to build a strong data science capability, we have a substantial and sustained advantage over competitors who are having trouble hiring data scientists. Further, top-notch data scientists like to work with other top-notch data scientists, which compounds our advantage.
We also must embrace the fact that data science is in part a craft. Analytical expertise takes time to acquire, and all the great books and video lectures alone will not turn someone into a master. The craft is learned by experience. The most effective learning path resembles that in the classic trades: aspiring data scientists work as apprentices to masters.
This could be in a graduate program with a top applications-oriented professor, in a postdoctoral program, or in the industry working with one of the best industrial data scientists. At some point, the apprentice is skilled enough to become a “journeyman,” and will then work more independently on a team or even lead projects of her own.
Many high-quality data scientists happily work in this capacity for their careers. Some small subset becomes masters themselves, because of a combination of their talent at recognizing the potential of new data science opportunities (more on that in a moment) and their mastery of theory and technique. Some of these then take on apprentices.
Understanding this learning path can help to focus on hiring efforts, looking for data scientists who have apprenticed with top-notch masters. It also can be used tactically in a less obvious way: if you can hire one master data scientist, top-notch aspiring data scientists may come to apprentice with her.
In addition to all this, a top-notch data scientist needs to have a strong professional network. We don’t mean a network in the sense of what one might find in an online professional networking system; an effective data scientist needs to have deep connections to other data scientists throughout the data science community.
The reason is simply that the field of data science is immense and there are far too many diverse topics for any individual to master. A top-notch data scientist is a master of some area of technical expertise and is familiar with many others. (Beware of the “jack-of-all-trades, master of none.”) However, we do not want the data scientist’s mastery of some area of
1. This is not to say that one should look at the KDD Cup winners as necessarily the best data miners in the world. Many top-notch data scientists have never competed in such a competition; some compete once and then focus their efforts on other things. technical expertise to turn into the proverbial hammer for which all problems are nails.
A top-notch data scientist will pull in the necessary expertise for the problem at hand. This is facilitated tremendously by strong and deep professional contacts. Data scientists call on each other to help in steering them to the right solutions. The better a professional network is, the better will be the solution. And, the best data scientists have the best connections.
Superior Data Science Management
Possibly even more critical to success for data science in business is having good management of the data science team. Good data science managers are especially hard to find.
They need to understand the fundamentals of data science well, possibly even being competent data scientists themselves. Good data science managers also must possess a set of other abilities that are rare in a single individual:
• They need to truly understand and appreciate the needs of the business. What’s more, they should be able to anticipate the needs of the business, so that they can interact with their counterparts in other functional areas to develop ideas for new data science products and services.
• They need to be able to communicate well with and be respected by both “techies” and “suits”; often this means translating data science jargon (which we have tried to minimize in this blog) into business jargon, and vice versa.
• They need to coordinate technically complex activities, such as the integration of multiple models or procedures with business constraints and costs. They often need to understand the technical architectures of the business, such as the data systems or production software systems, in order to ensure that the solutions the team produces are actually useful in practice.
• They need to be able to anticipate the outcomes of data science projects. As we have discussed, data science is more similar to R&D than to any other business activity. Whether a particular data science project will produce positive results is highly uncertain at the outset, and possibly even well into the project.
Elsewhere we discuss how it is important to produce proof-of-concept studies quickly, but neither positive nor negative outcomes of such studies are highly predictive of success or failure of the larger project. They just give guidance to investments in the next cycle of the data mining process.
If we look to R&D management for clues about data science management, we find that there is only one reliable predictor of the success of a research project, and it is highly predictive: the prior success of the investigator.
We see a similar situation with data science projects. There are individuals who seem to have an intuitive sense of which projects will pay off. We do not know of a careful analysis of why this is the case, but experience shows that it is.
As with data science competitions, where we see remarkable repeat performances by the same individuals, we also see individuals repeatedly envisioning new data science opportunities and managing them to great success—and this is particularly impressive as many data science managers never see even one project through to great success.
• They need to do all this within the culture of a particular firm.
Finally, our data science capability may be difficult or expensive for a competitor to duplicate because we can hire data scientists and data science managers better.
This may be due to our reputation and brand appeal with data scientists—a data scientist may prefer to work for a company known as being friendly to data science and data scientists. Or our firm may have a more subtle appeal. So let’s examine in a little more detail what it takes to attract top-notch data scientists.
Attracting and Nurturing Data Scientists and Their Teams
At the beginning of the blog, we noted that the two most important factors in ensuring that our firm gets the most from its data assets are: (i) the firm’s management must think data-analytically, and (ii) the firm’s management must create a culture where data science, and data scientists, will thrive.
As we mentioned above, there can be a huge difference between the effectiveness of a great data scientist and an average data scientist, and between a great data science team and an individually great data scientist. But how can one confidently engage top-notch data scientists? How can we create great teams?
This is a very difficult question to answer in practice. At the time of this writing, the supply of top-notch data scientists is quite thin, resulting in a very competitive market for them.
The best companies at hiring data scientists are the IBMs, Microsofts, and Googles of the world, who clearly demonstrate the value they place in data science via compensation, perks, and/or intangibles, such as one particular factor not to be taken lightly: data scientists like to be around other top-notch data scientists.
One might argue that they need to be around other top-notch data scientists, not only to enjoy their day-to-day work but also because the field is vast and the collective mind of a group of data scientists can bring to bear a much broader array of particular solution techniques.
However, just because the market is difficult does not mean all is lost. Many data scientists want to have more individual influence than they would have at a corporate behemoth.
Many want more responsibility (and the concomitant experience) with the broader process of producing a data science solution. Some have visions of becoming Chief Scientist for a firm and understand that the path to Chief Scientist may be better paved with projects in smaller and more varied firms.
Some have visions of becoming entrepreneurs and understand that being an early data scientist for a startup can give them invaluable experience.
And some simply will enjoy the thrill of taking part in a fast-growing venture: working in a company growing at 20% or 50% a year is much different from working in a company growing at 5% or 10% a year (or not growing at all).
In all these cases, the firms that have an advantage in hiring are those that create an environment for nurturing data science and data scientists. If you do not have a critical mass of data scientists, be creative. Encourage your data scientists to become part of local data science technical communities and global data science academic communities.
A note on publishing
Science is a social endeavor, and the best data scientists often want to stay engaged in the community by publishing their advances.
Firms sometimes have trouble with this idea, feeling that they are “giving away the store” or tipping their hand to competitors by revealing what they are doing. On the other hand, if they do not, they may not be able to hire or retain the very best.
Publishing also has some advantages for the firm, such as increased publicity, exposure, external validation of ideas, and so on. There is no clear-cut answer, but the issue needs to be considered carefully.
Some firms file patents aggressively on their data science ideas, after which academic publication is natural if the idea is truly novel and important.
A firm’s data science presence can be bolstered by engaging academic data scientists. There are several ways of doing this. For those academics interested in practical applications of their work, it may be possible to fund their research programs.
Both of your authors, when working in the industry, funded academic programs and essentially extended the data science team that was focusing on their problems and interacting.
The best arrangement (by our experience) is a combination of data, money, and an interesting business problem; if the project ends up being a portion of the Ph.D. thesis of a student in a top-notch program, the benefit to the firm can far outweigh the cost.
Funding a Ph.D. student might cost a firm in the ballpark of $50K/year, which is a fraction of the fully loaded cost of a top data scientist. A key is to have enough understanding of data science to select the right professor—one with the appropriate expertise for the problem at hand.
Another tactic that can be very cost-effective is to take on one or more top-notch data scientists as scientific advisors.
If the relationship is structured such that the advisors truly interact on the solutions to problems, firms that do not have the resources or the clout to hire the very best data scientists can substantially increase the quality of the eventual solutions.
Such advisors can be data scientists at partner firms, data scientists from firms who share investors or board members, or academics who have some consulting time.
A different tack altogether is to hire a third party to conduct the data science. There are various third-party data science providers, ranging from massive firms specializing in business analytics (such as IBM), to data-science-specific consulting firms (such as Elder Research), to boutique data science firms who take on a very small number of clients to help them develop their data science capabilities (such as Data Scientists, LLC).
You can find a large list of data-science service companies, as well as a wide variety of other data science resources, at KDnuggets. A caveat about engaging data science consulting firms is that their interests are not always well aligned with their customers’ interests; this is obvious to seasoned users of consultants, but not to everyone.
Savvy managers employ all of these resources tactically. A chief scientist or empowered manager often can assemble for a project a substantially more powerful and diverse team than most companies can hire.
Examine Data Science Case Studies
Beyond building a solid data science team, how can a manager ensure that her firm is best positioned to take advantage of opportunities for applying data science? Make sure that there is an understanding of and appreciation for the fundamental principles of data science. Empowered employees across the firm often see novel applications.
After gaining command of the fundamental principles of data science, the best way to position oneself for success is to work through many examples of the application of data science to business problems.
Read case studies that actually walk through the data mining process. Formulate your own case studies. Actually, mining data is helpful, but even more important is working through the connection between the business problem and the possible data science solutions.
The more, different problems you work through, the better you will be at naturally seeing and capitalizing on opportunities for bringing to bear the information and knowledge “stored” in the data—often the same problem formulation from one problem can be applied by analogy to another, with only minor changes.
It is important to keep in mind that the examples we have presented in this blog were chosen or designed for illustration. In reality, the business and data science team should be prepared for all manner of mess and constraints and must be flexible in dealing with them.
Sometimes there is a wealth of data and data science techniques available to be brought to bear. Other times the situation seems more like the critical scene from the movie Apollo 13.
In the movie, a malfunction and explosion in the command module leave the astronauts stranded a quarter of a million miles from Earth, with the CO2 levels rising too rapidly for them to survive the return trip.
In a nutshell, because of the constraints placed by what the astronauts have on hand, the engineers have to figure out how to use a large cubic filter in place of a narrower cylindrical filter (to literally put a square peg in a round hole).
In the key scene, the head engineer dumps out onto a table all the “stuff” that’s there in the command module, and tells his team: “OK, people … we got to find a way to make this fit into the hole for this, using nothing but that.” Real data science problems often seem more like the Apollo 13 situation than a textbook situation.
For targeting consumers with online display advertisements, obtaining an adequate supply of the ideal training data would have been prohibitively expensive. However, data were available at much lower cost from various other distributions and for other target variables.
Their very effective solution cobbled together models built from these surrogate data, and “transferred” these models for use on the desired task. The use of these surrogate data allowed them to operate with a substantially reduced investment in data from the ideal (and expensive) training distribution.
Be Ready to Accept Creative Ideas from Any Source
Once different role players understand the fundamental principles of data science, creative ideas for new solutions can come from any direction—such as from executives examining potential new lines of business, from directors dealing with profit and loss responsibility, from managers looking critically at a business process, and from line employees with detailed knowledge of exactly how a particular business process functions.
Data scientists should be encouraged to interact with employees throughout the business, and part of their performance evaluation should be based on how well they produce ideas for improving the business with data science.
Incidentally, doing so can pay off in unintended ways: the data processing skills possessed by data scientists often can be applied in ways that are not so sophisticated but nevertheless can help other employees without those skills. Often a manager may have no idea that particular data can even be obtained—data that might help the manager directly, without sophisticated data science.
Be Ready to Evaluate Proposals for Data Science Projects
Ideas for improving business decisions through data science can come from any direction. Managers, investors, and employees should be able to formulate such ideas clearly, and decision makers should be prepared to evaluate them. Essentially, we need to be able to formulate solid proposals and to evaluate proposals.
The data mining process provides a framework to direct this. Each stage in the process reveals questions that should be asked both in formulating proposals for projects and in evaluating them:
• Is the business problem well specified? Does the data science solution solve the problem?
• Is it clear how we would evaluate a solution?
• Would we be able to see evidence of success before making a huge investment in deployment?
• Does the firm have the data assets it needs? For example, for supervised modeling, are there actually labeled training data? Is the firm ready to invest in the assets it does not have yet? Let’s walk through an illustrative example.
Example Data Mining Proposal
Your company has an installed user base of 900,000 current users of your Whiz-bang widget. You now have developed Whiz-bang 2.0, which has substantially lower operating costs than the original.
Ideally, you would like to convert (“migrate”) your entire user base over to version 2.0; however, using 2.0 requires that users master the new interface, and there is a serious risk that in attempting to do so, the customers will become frustrated and not convert, become less satisfied with the company, or in the worst case, switch to your competitor’s popular Boppo widget.
Marketing has designed a brand-new migration incentive plan, which will cost $250 per selected customer. There is no guarantee that a customer will choose to migrate even if she takes this incentive.
An external firm, Big Red Consulting, is proposing a plan to target customers carefully for Whiz-bang 2.0, and given your demonstrated fluency with the fundamentals of data science, you are called in to help assess Big Red’s proposal. Do Big Red’s choices seem correct?
Targeted Whiz-bang Customer Migration—prepared by Big Red Consulting, Inc.
We will develop a predictive model using modern data-mining technology. As discussed in our last meeting, we assume a budget of $5,000,000 for this phase of customer migration; adjusting the plan for other budgets is straightforward. Thus we can target 20,000 customers under this budget. Here is how we will select those customers:
We will use data to build a model of whether or not a customer will migrate given the incentive. The dataset will comprise a set of attributes of customers, such as the number and type of prior customer service interactions, level of usage of the widget, location of the customer, estimated technical sophistication, tenure with the firm, and other loyalty indicators, such as number of other firm products and services in use.
The target will be whether or not the customer will migrate to the new widget if he/she is given the incentive. Using this data, we will build a linear regression to estimate the target variable. The model will be evaluated based on its accuracy on these data; in particular, we want to ensure that the accuracy is substantially greater than if we targeted randomly.
To use the model: for each customer, we will apply the regression model to estimate the target variable. If the estimate is greater than 0.5, we will predict that the customer will migrate; otherwise, we will say the customer will not migrate.
We then will select at random 20,000 customers from those predicted to migrate, and these 20,000 will be the recommended targets.
Flaws in the Big Red Proposal
We can use our understanding of the fundamental principles and other basic concepts of data science to identify flaws in the proposal. Appendix A provides a starting guide for reviewing such proposals, with some of the main questions to ask. However, this blog as a whole really can be seen as a proposal review guide. Here are some of the most egregious flaws in Big Data’s proposal:
• The target variable definition is imprecise. For example, over what time period must the migration occur?
• The formulation of the data mining problem could be better-aligned with the business problem. For example, what if certain customers (or everyone) were likely to migrate anyway (without the incentive)? Then we would be wasting the cost of the incentive in targeting them.
Data Understanding/Data Preparation
• There aren’t any labeled training data! This is a brand-new incentive. We should invest some of our budgets in obtaining labels for some examples. This can be done by targeting a (randomly) selected subset of customers with the incentive.
• If we are worried about wasting the incentive on customers who are likely to migrate without it, we also should observe a “control group” over the period where we are obtaining training data.
This should be easy since everyone we don’t target to gather labels would be a “control” subject. We can build a separate model for migrating or not given no incentive and combine the models in an expected value framework.
• Linear regression is not a good choice for modeling a categorical target variable. Rather one should use a classification method, such as tree induction, logistic regression, k-NN, and so on.
• The evaluation shouldn’t be on the training data. Some sort of holdout approach should be used (e.g., cross-validation and/or a staged approach as discussed above).
• Is there going to be any domain-knowledge validation of the model? What if it is capturing some weirdness of the data collection process?
• The idea of randomly selecting customers with regression scores greater than 0.5 is not well considered. First, it is not clear that a regression score of 0.5 really corresponds to a probability of migration of 0.5. Second, 0.5 is rather arbitrary in any case.
Third, since our model is providing a ranking (e.g., by the likelihood of migration, or by expected value if we use the more complex formulation), we should use the ranking to guide our targeting: choose the top-ranked candidates, as the budget will allow.
Of course, this is just one example with a particular set of flaws. A different set of concepts may need to be brought to bear for a different proposal that is flawed in other ways.
A Firm’s Data Science Maturity
For a firm to realistically plan data science endeavors it should assess, frankly and rationally, its own maturity in terms of data science capability. It is beyond the scope of this blog to provide a self-assessment guide, but a few words on the topic are important.
Firms vary widely in their data science capabilities along with many dimensions. One dimension that is very important for strategic planning is the firm’s “maturity,” specifically, how systematic and well-founded are the processes used to guide the firm’s data science projects.
At one end of the maturity spectrum, a firm’s data science processes are completely ad hoc. In many firms, the employees engaged in data science and business analytics endeavors have no formal training in these areas, and the managers involved have little understanding of the fundamental principles of data science and data analytic thinking.
3. The reader interested in this notion of the maturity of a firm’s capabilities is encouraged to read about the Capability Maturity Model for software engineering, which is the inspiration for this discussion.
A note on “immature” firms
Being “immature” does not mean that a firm is destined to fail. It means that success is highly variable and is much more dependent on luck than in a mature firm. Project success will depend upon the heroic efforts of individuals who happen to have a natural acuity for data-analytic thinking.
An immature firm may implement not-so-sophisticated data science solutions on a large scale or may implement sophisticated solutions on a small scale. Rarely, though, will an immature firm implement sophisticated data science solutions on a large scale.
A firm with a medium level of maturity employs well-trained data scientists, as well as business managers and other stakeholders who understand the fundamental principles of data science.
Both sides can think clearly about how to solve business problems with data science, and both sides participate in the design and implementation of solutions that directly address the problems of the business.
At the high end of maturity are firms who continually work to improve their data science processes (and not just the solutions). Executives at such firms continually challenge the data science team to instill processes that will align their solutions better with the business problems.
At the same time, they realize that pragmatic trade-offs may favor the choice of a suboptimal solution that can be realized today over a theoretically much better solution that won’t be ready until next year.
Data scientists at such a firm should have the confidence that when they propose investments to improve data science processes, their suggestions will be met with open and informed minds. That’s not to say that every such request will be approved, but that the proposal will be evaluated on its own merits in the context of the business.
Note: Data science is neither operations nor engineering.
There is some danger in making an analogy to the Capability Maturity Model from software engineering— the danger that the analogy will be taken too literally. Trying to apply the same sort of processes that work for software engineering, or worse for manufacturing or operations, will fail for data science.
Moreover, misguided attempts to do so will send a firm’s best data scientists out the door before the management even knows what happened. The key is to understand data science processes and how to data science well and work to establish consistency and support. Remember that data science is more like R&D than like engineering or manufacturing.
As a concrete example, management should consistently make available the resources needed for solid evaluation of data science projects early and often. Sometimes this involves investing in data that would not otherwise have been obtained.
Often this involves assigning engineering resources to support the data science team. The data science team should in return work to provide management with evaluations that are as well aligned with the actual business problem(s) as possible.
As a concrete example, consider yet again our telecom churn problem and how firms of varying maturity might address it:
• An immature firm will have (hopefully) analytically adept employees implementing ad hoc solutions based on their intuitions about how to manage churn. These may work well or they may not. In an immature firm, it will be difficult for management to evaluate these choices against alternatives, or to determine when they’ve implemented a nearly optimal solution.
• A firm of medium maturity will have implemented a well-defined framework for testing different alternative solutions. They will test under conditions that mimic as closely as possible the actual business setting—for example, running the latest production data through a testbed platform that compares how different methods “would have done,” and considering carefully the costs and benefits involved.
• A very mature organization may have deployed the exact same methods as the medium-maturity firm for identifying the customers with the highest probability of leaving, or even the highest expected loss if they were to churn.
They would also be working to implement the processes, and gather the data, necessary to judge also the effect of the incentives and thereby work towards finding those individuals for which the incentives will produce the largest expected increase in value (over not giving the incentive).
Such a firm may also be working to integrate such a procedure into experimentation and/or optimization framework for assessing different offers or different parameters (like the level of discount) to a given offer. A frank self-assessment of data science maturity is difficult, but it is essential to getting the best out of one’s current capabilities, and to improving one’s capabilities.
The practice of data science can best be described as a combination of analytical engineering and exploration. The business presents a problem we would like to solve. Rarely is the business problem directly one of our basic data mining tasks. We decompose the problem into subtasks that we think we can solve, usually starting with existing tools.
For some of these tasks we may not know how well we can solve them, so we have to mine the data and conduct evaluation to see. If that does not succeed, we may need to try something completely different.
In the process, we may discover knowledge that will help us to solve the problem we had set out to solve, or we may discover something unexpected that leads us to other important successes.
Neither the analytical engineering nor the exploration should be omitted when considering the application of data science methods to solve a business problem. Omitting the engineering aspect usually makes it much less likely that the results of mining data will actually solve the business problem.
Omitting the understanding of the process as one of exploration and discovery often keeps an organization from putting the right management, incentives, and investments in place for the project to succeed.
The Fundamental Concepts of Data Science
Both the analytical engineering and the exploration and discovery are made more systematic and thereby more likely to succeed by the understanding and embracing of the fundamental concepts of data science. In this blog, we have introduced a collection of the most important fundamental concepts.
Some of these concepts we made into headliners for the blogs and others were introduced more naturally through the discussions (and not necessarily labeled as fundamental concepts).
These concepts span the process from envisioning how data science can improve business decisions, to applying data science techniques, to deploying the results to improve decision-making. The concepts also undergird a large array of business analytics.
We can group our fundamental concepts roughly into three types:
1. General concepts about how data science fits in the organization and the competitive landscape, including ways to attract, structure, and nurture data science teams, ways for thinking about how data science leads to competitive advantage, ways that competitive advantage can be sustained, and tactical principles for doing well with data science projects.
2. General ways of thinking data-analytically, which help us to gather appropriate data and consider appropriate methods. The concepts include the data mining process, the collection of different high-level data science tasks, as well as principles such as the following.
• Data should be considered an asset, and therefore we should think carefully about what investments we should make to get the best leverage from our asset
• The expected value framework can help us to structure business problems so we can see the component data mining problems as well as the connective tissue of costs, benefits, and constraints imposed by the business environment
• Generalization and overfitting: if we look too hard at the data, we will find patterns; we want patterns that generalize to data we have not yet seen
• Applying data science to a well-structured problem versus exploratory data mining require different levels of effort in different stages of the data mining process
3. General concepts for actually extracting knowledge from data, which undergird the vast array of data science techniques. These include concepts such as the following.
• Identifying informative attributes—those that correlate with or give us information about an unknown quantity of interest
• Fitting a numeric function model to data by choosing an objective and finding a set of parameters based on that objective
• Controlling complexity is necessary to find a good trade-off between generalization and overfitting
• Calculating similarity between objects described by data
Once we think about data science in terms of its fundamental concepts, we see the same concepts underlying many different data science strategies, tasks, algorithms, and processes.
As we have illustrated throughout the blog, these principles not only allow us to understand the theory and practice of data science much more deeply, they also allow us to understand the methods and techniques of data science very broadly, because these methods and techniques are quite often simply particular instantiations of one or more of the fundamental principles.
At a high level, we saw how structuring business problems using the expected value framework allows us to decompose problems into data science tasks that we understand better how to solve, and this applies across many different sorts of business problems.
For extracting knowledge from data, we saw that our fundamental concept of determining the similarity of two objects described by data is used directly, for example, to find customers similar to our best customers. It is used for classification and for regression, via nearest-neighbor methods.
It is the basis for clustering, the unsupervised grouping of data objects. It is the basis for finding documents most related to a search query. And it is the basis for more than one common method for making recommendations, for example by casting both customers and movies into the same “taste space,” and then finding movies most similar to a particular customer.
When it comes to measurement, we see the notion of lift—determining how much more likely a pattern is than would be expected by chance—appearing broadly across data science when evaluating very different sorts of patterns. One evaluates algorithms for targeting advertisements by computing the lift one gets for the targeted population.
One calculates lift for judging the weight of evidence for or against a conclusion. One cal‐ collates lift to help judge whether a repeated co-occurrence is interesting, as opposed to simply being a natural consequence of popularity.
Understanding the fundamental concepts also facilitates communication between business stakeholders and data scientists, not only because of the shared vocabulary but because both sides actually understand better.
Instead of missing important aspects of a discussion completely, we can dig in and ask questions that will reveal critical aspects that otherwise would not have been uncovered.
For example, let’s say your venture firm is considering investing in a data science-based company producing a personalized online news service. You ask how exactly they are personalizing the news.
They say they use support vector machines. Let’s even pretend that we had not talked about support vector machines in this blog. You should feel confident enough in your knowledge of data science now that you should not simply say “Oh, OK.” You should be able to confidently ask: “What’s that exactly?”
If they really do know what they are talking about, they should give you some explanation based upon our fundamental principles. You also are now prepared to ask, “What exactly are the training data you intend to use?”
Not only might that impress data scientists on their team, but it actually is an important question to be asked to see whether they are doing something credible, or just using “data science” as a smokescreen to hide behind.
You can go on to think about whether you really believe building any predictive model from these data—regardless of what sort of model it is—would be likely to solve the business problem they’re attacking. You should be ready to ask whether you really think they will have reliable training labels for such a task. And so on.