Business Problems and Data Science Solutions
If solving the business problem is the goal, the data comprise the available raw material from which the solution will be built. It is important to understand the strengths and limitations of the data because rarely is there an exact match with the problem.
In this blog, we explain how to uncover the structure of the business problem by using data science and technology. And then we explain how data mining and data science solve business problems.
It is not unusual for a business problem to contain several data mining tasks, often of different types, and combining their solutions will be necessary.
The analytic technologies that we can bring to bear are powerful but they impose certain requirements on the data they use. They often require data to be in a form different from how the data are provided naturally, and some conversion will be necessary.
Therefore a data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results.
We will define basic data formats and will only be concerned with data preparation details when they shed light on some fundamental principle of data science or are necessary to present a concrete example.
More generally, data scientists may spend considerable time early in the process defining the variables used later in the process. This is one of the main points at which human creativity, common sense, and business knowledge come into play.
Often the quality of the data mining solution rests on how well the analysts structure the problems and craft the variables (and sometimes it can be surprisingly hard for them to admit it).
One very general and important concern during data preparation is to beware of “leaks”. A leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.
As an example, when predicting whether at a particular point in time a website visitor would end her session or continue surfing to another page, the variable “total number of web pages visited in the session” is predictive.
However, the total number of web pages visited in the session would not be known until after the session was over—at which point one would know the value for the target variable!
The modeling stage is the primary place where data mining techniques are applied to the data. It is important to have some understanding of the fundamental ideas of data mining, including the sorts of techniques and algorithms that exist because this is the part of the craft where the most science and technology can be brought to bear.
The purpose of the evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on. If we look hard enough at any dataset we will find patterns, but they may not survive careful scrutiny.
We would like to have confidence that the models and patterns extracted from the data are true regularities and not just idiosyncrasies or sample anomalies.
It is possible to deploy results immediately after data mining but this is inadvisable; it is usually far easier, cheaper, quicker, and safer to test a model first in a controlled laboratory setting.
Equally important, the evaluation stage also serves to help ensure that the model satisfies the original business goals. Recall that the primary goal of data science for business is to support decision making and that we started the process by focusing on the business problem we would like to solve.
Usually, a data mining solution is only a piece of the larger solution, and it needs to be evaluated as such. Further, even if a model passes strict evaluation tests in “in the lab,” there may be external considerations that make it impractical.
For example, a common flaw with detection solutions (such as fraud detection, spam detection, and intrusion monitoring) is that they produce too many false alarms.
Evaluating the results of data mining includes both quantitative and qualitative assessments. Various stakeholders have interests in the business decision-making that will be accomplished or supported by the resultant models.
In many cases, these stakeholders need to “sign off” on the deployment of the models, and in order to do so need to be satisfied by the quality of the model’s decisions.
What that means varies from application to application, but often stakeholders are looking to see whether the model is going to do more good than harm, and especially that the model is unlikely to make catastrophic mistakes.
To facilitate such a qualitative assessment, the data scientist must think about the comprehensibility of the model to stakeholders (not just to the data scientists).
And if the model itself is not comprehensible (e.g., maybe the model is a very complex mathematical formula), how can the data scientists work to make the behavior of the model be comprehensible.
Finally, a comprehensive evaluation framework is important because getting detailed information on the performance of a deployed model may be difficult or impossible. Often there is only limited access to the deployment environment so making a comprehensive evaluation “in production” is difficult.
Deployed systems typically contain many “moving parts,” and assessing the contribution of a single part is difficult. Firms with sophisticated data science teams wisely build testbed environments that mirror production data as closely as possible, in order to get the most realistic evaluations before taking the risk of deployment.
Nonetheless, in some cases, we may want to extend evaluation into the development environment, for example by instrumenting a live system to be able to conduct randomized experiments.
We may also want to instrument deployed systems for evaluations to make sure that the world is not changing to the detriment of the model’s decision-making. For example, behavior can change—in some cases, like fraud or spam, in direct response to the deployment of models.
Additionally, the output of the model is critically dependent on the input data; input data can change in format and in substance, often without any alerting of the data science team. Raeder et al. present a detailed discussion of system design to help deal with these and other related evaluation-in-deployment issues.
In deployment, the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment.
The clearest cases of deployment involve implementing a predictive model in some information system or business process. In our churn example, a model for predicting the likelihood of churn could be integrated with the business process for churn management
2. For example, in one data mining project, a model was created to diagnose problems in local phone networks, and to dispatch technicians to the likely site of the problem. Before deployment, a team of phone company stakeholders requested that the model is tweaked so that exceptions were made for hospitals.
—for example, by sending special offers to customers who are predicted to be particularly at risk. A new fraud detection model may be built into a workforce management information system, to monitor accounts and create “cases” for fraud analysts to examine.
Increasingly, the data mining techniques are deployed. For example, for targeting online advertisements, systems are deployed that automatically build (and test) models in production when a new advertising campaign is presented.
Two main reasons for deploying the data mining system itself rather than the models produced by a data mining system are (i) the world may change faster than the data science team can adapt, as with fraud and intrusion detection, and (ii) a business has too many modeling tasks for their data science team to manually curate each model individually.
In these cases, it may be best to deploy the data mining phase into production. In doing so, it is critical to instrument the process alert the data science team of any seeming anomalies and to provide fail-safe operation.
Deployment can also be much less “technical.” In a celebrated case, data mining discovered a set of rules that could help to quickly diagnose and fix a common error in industrial printing.
The deployment succeeded simply by taping a sheet of paper containing the rules to the side of the printers. Deployment can also be much more subtle, such as a change to data acquisition procedures, or a change to strategy, marketing, or operations resulting from insight gained from mining the data.
Deploying a model into a production system typically requires that the model is recorded for the production environment, usually for greater speed or compatibility with an existing system.
This may incur substantial expense and investment. In many cases, the data science team is responsible for producing a working prototype, along with its evaluation. These are passed to a development team.
Practically speaking, there are risks with “over the wall” transfers from data science to develop. It may be helpful to remember the maxim: “Your model is not what the data scientists design, it’s what the engineers build.” From a management perspective, it is advisable to have members of the development team involved early on in the data science project.
They can begin as advisors, providing critical insight into the data science team. Increasingly in practice, these particular developers are “data science engineers”—software engineers who have particular expertise both in the production systems and in data science. These developers gradually assume more responsibility as the project matures.
At some point, the developers will take the lead and assume ownership of the product. Generally, the data scientists should still remain involved in the project into final deployment, as advisors or as developers depending on their skills.
Regardless of whether deployment is successful, the process often returns to the Business Understanding phase. The process of mining data produces a great deal of insight into the business problem and the difficulties of its solution.
A second iteration can yield an improved solution. Just the experience of thinking about the business, the data, and the performance goals often lead to new ideas for improving business performance, and even new lines of business or new ventures.
In practice, there should be shortcuts back from each stage to each prior one because the process always retains some exploratory aspects, and a project should be flexible enough to revisit prior steps based on discoveries made.
Data Science and Business Strategy
Thinking Data-Analytically, Redux managers have to understand the fundamental principles well enough to envision and/ or appreciate data science opportunities, to supply the appropriate resources to the data science teams, and to be willing to invest in data and experimentation.
Furthermore, unless the firm has on its management team a seasoned, practical data scientist, often the management must steer the data science team carefully to make sure that the team stays on track toward an eventually useful business solution.
This is very difficult if the managers don’t really understand the principles. Managers need to be able to ask probing questions of a data scientist, who often can get lost in technical details. We need to accept that each of us has strengths and weaknesses, and as data science projects span so much of a business, a diverse team is essential.
Just as we can’t expect a manager necessarily to have deep expertise in data science, we can’t expect a data scientist necessarily to have deep expertise in business solutions.
However, an effective data science team involves collaboration between the two, and each needs to have some understanding of the fundamentals of the other’s area of responsibility.
Just as it would be a Sisyphean task to manage a data science team where the team had no understanding of the fundamental concepts of business, it likewise is extremely frustrating at best, and often a tremendous waste, for data scientists to struggle under management that does not understand basic principles of data science.
For example, it is not uncommon for data scientists to struggle under management that (sometimes vaguely) sees the potential benefit of predictive modeling, but does not have enough appreciation for the process to invest in proper training data or in proper evaluation procedures.
Such a company may “succeed” in engineering a model that is predictive enough to produce a viable product or service but will be at a severe disadvantage to a competitor who invests in doing the data science well.
A solid grounding in the fundamentals of data science has much more far-reaching strategic implications. We know of no systematic scientific study, but broad experience has shown that as executives, managers, and investors increase their exposure to data science projects, they see more and more opportunities in turn.
We see extreme cases in companies like Google and Amazon (there is a vast amount of data science underlying web search, as well as Amazon’s product recommendations and other offerings). Both of these companies eventually built subsequent products offering “big data” and data-science related services to other firms.
Many, possibly most, data-science oriented start-ups use Amazon’s cloud storage and processing services for some tasks. Google’s “Prediction API” is increasing in sophistication and utility (we don’t know how broadly used it is).
Those are extreme cases, but the basic pattern is seen in almost every data-rich firm. Once the data science capability has been developed for one application, other applications throughout the business become obvious. Louis Pasteur famously wrote, “For‐ tune favors the prepared mind.”
Modern thinking on creativity focuses on the juxtaposition of a new way of thinking with a mind “saturated” with a particular problem. Working through case studies (either in theory or in practice) of data science applications helps prime the mind to see opportunities and connections to new problems that could benefit from data science.
With the increased understanding of the use of data science for helping to solve business problems, the firm subsequently applied similar ideas to decisions about how to allocate a massive capital investment to best improve its network, and how to reduce fraud in its burgeoning wireless business.
The progression continued. Data science projects for reducing fraud discovered that incorporating features based on social-network connections (via who-calls-whom data) into fraud prediction models improved the ability to discover fraud substantially.
Next, in telecommunications, such social features were added to models for churn prediction, with equally beneficial results. The ideas diffused to the online advertising industry, and there was a subsequent flurry of development of online advertising based on the incorporation of data on online social connections (at Facebook and at other firms in the online advertising ecosystem).
This progression was driven both by experienced data scientists moving among business problems as well as by data science savvy managers and entrepreneurs, who saw new opportunities for data science advances in the academic and business literature.
Achieving a Competitive Advantage with Data Science
Increasingly, firms are considering whether and how they can obtain competitive ad‐ vantage from their data and/or from their data science capability. This is important strategic thinking that should not be superficial, so let’s spend some time digging into it.
Data and data science capability are (complimentary) strategic assets. Under what conditions can a firm achieve competitive advantage from such an asset? First of all, the asset has to be valuable to the firm. This seems obvious, but note that the value of an asset to a firm depends on the other strategic decisions that the firm has made.
The lesson is that we need to think carefully in the business understanding phase as to how data and data science can provide value in the context of our business strategy, and also whether it would do the same in the context of our competitors’ strategies. This can identify both possible opportunities and possible threats.
We should think both about the data asset(s) and the data science capability. Do we have a unique data asset? If not, do we have an asset the utilization of which is better aligned with our strategy than with the strategy of our competitors? Or are we better able to take advantage of the data asset due to our better data science capability?
The flip side of asking about achieving a competitive advantage with data and data science is asking whether we are at a competitive disadvantage. It may be that the answers to the previous questions are affirmative for our competitors and not for us.
In what follows we will assume that we are looking to achieve a competitive advantage, but the arguments apply symmetrically if we are trying to achieve parity with a data-savvy competitor.
Sustaining Competitive Advantage with Data Science
The next question is: even if we can achieve a competitive advantage, can we sustain it? If our competitors can easily duplicate our assets and capabilities, our advantage may be short-lived.
This is an especially critical question if our competitors have greater resources than we do: by adopting our strategy, they may surpass us if they have greater resources.
One strategy for competing based on data science is to plan to always keep one step ahead of the competition: always be investing in new data assets, and always be devel‐ oping new techniques and capabilities. Such a strategy can provide for an exciting and possibly fast-growing business, but generally few companies are able to execute it.
For example, you must have confidence that you have one of the best data science teams, since the effectiveness of data scientists has a huge variance, with the best being much more talented than the average. If you have a great team, you may be willing to bet that you can keep ahead of the competition. We will discuss data science teams more below.
The alternative to always keeping one step ahead of the competition is to achieve sustainable competitive advantage due to a competitor’s inability to replicate, or their elevated expense of replicating the data asset or the data science capability. There are several avenues to such sustainability.
Formidable Historical Advantage
Historical circumstances may have placed our firm in an advantageous position, and it may be too costly for competitors to reach the same position. Amazon again provides an outstanding example. In the “Dotcom Boom” of the 1990s, Amazon was able to sell books below cost, and investors continued to reward the company.
This allowed Amazon to amass tremendous data assets (such as massive data on online consumers’ buying preferences and online product reviews), which then allowed them to create valuable data-based products (such as recommendations and product ratings).
These historical circumstances are gone: it is unlikely today that investors would provide the same level of support to a competitor that was trying to replicate Amazon’s data asset by selling books below cost for years on end (not to mention that Amazon has moved far beyond books).
This example also illustrates that the data products themselves can increase the cost to competitors of replicating the data asset. Consumers value the data-driven recommendations and product reviews/ratings that Amazon provides.
This creates switching costs: competitors would have to provide extra value to Amazon’s customers to entice them to shop elsewhere—either with lower prices or with some other valuable product or service that Amazon does not provide.
Thus, when the data acquisition is tied directly to the value provided by the data, the resulting virtuous cycle creates a catch-22 for competitors: competitors need customers in order to acquire the necessary data, but they need the data in order to provide equivalent service to attract the customers.
Entrepreneurs and investors might turn this strategic consideration around: what historical circumstances now exist that may not continue indefinitely, and which may allow me to gain access to or to build a data asset more cheaply that will be possible in the future? Or which will allow me to build a data science team that would be more costly (or impossible) to build in the future?
Unique Intellectual Property
Our firm may have unique intellectual property. Data science intellectual property can include novel techniques for mining the data or for using the results. These might be patented, or they might just be trade secrets.
In the former case, a competitor either will be unable to (legally) duplicate the solution or will have an increased expense of doing so, either by licensing our technology or by developing new technology to avoid infringing on the patent.
In the case of a trade secret, it may be that the competitor simply does not know how we have implemented our solution. With data science solutions, the actual mechanism is often hidden; with only the result being visible.
Unique Intangible Collateral Assets
Our competitors may not be able to figure out how to put our solution into practice. With successful data science solutions, the actual source of good performance (for example with effective predictive modeling) may be unclear. The effectiveness of a predictive modeling solution may depend critically on the problem engineering, the attributes created, the combining of different models, and so on.
It often is not clear to a competitor how performance is achieved in practice. Even if our algorithms are published in detail, many implementation details may be critical to getting a solution that works in the lab to work in production.
Furthermore, success may be based on intangible assets such as a company culture that is particularly suitable for the deployment of data science solutions. For example, a culture that embraces business experimentation and the (rigorous) supporting claims with data will naturally be an easier place for data science solutions to succeed.
Alternatively, if developers are encouraged to understand data science, they are less likely to screw up an otherwise top-quality solution. Recall our maxim: Your model is not what your data scientists design, it’s what your engineers implement.
Attracting and Nurturing Data Scientists and Their Teams
At the beginning of the blog, we noted that the two most important factors in ensuring that our firm gets the most from its data assets are: (i) the firm’s management must think data-analytically, and (ii) the firm’s management must create a culture where data science, and data scientists, will thrive.
As we mentioned above, there can be a huge difference between the effectiveness of a great data scientist and an average data scientist, and between a great data science team and an individually great data scientist. But how can one confidently engage top-notch data scientists? How can we create great teams?
This is a very difficult question to answer in practice. At the time of this writing, the supply of top-notch data scientists is quite thin, resulting in a very competitive market for them.
The best companies at hiring data scientists are the IBMs, Microsofts, and Googles of the world, who clearly demonstrate the value they place in data science via compensation, perks, and/or intangibles, such as one particular factor not to be taken lightly: data scientists like to be around other top-notch data scientists.
One might argue that they need to be around other top-notch data scientists, not only to enjoy their day-to-day work but also because the field is vast and the collective mind of a group of data scientists can bring to bear a much broader array of particular solution techniques.
However, just because the market is difficult does not mean all is lost. Many data scientists want to have more individual influence than they would have at a corporate behemoth.
Many want more responsibility (and the concomitant experience) with the broader process of producing a data science solution. Some have visions of becoming Chief Scientist for a firm and understand that the path to Chief Scientist may be better paved with projects in smaller and more varied firms.
Some have visions of becoming entrepreneurs and understand that being an early data scientist for a startup can give them invaluable experience.
And some simply will enjoy the thrill of taking part in a fast-growing venture: working in a company growing at 20% or 50% a year is much different from working in a company growing at 5% or 10% a year (or not growing at all).
In all these cases, the firms that have an advantage in hiring are those that create an environment for nurturing data science and data scientists. If you do not have a critical mass of data scientists, be creative. Encourage your data scientists to become part of local data science technical communities and global data science academic communities.
A note on publishing
Science is a social endeavor, and the best data scientists often want to stay engaged in the community by publishing their advances.
Firms sometimes have trouble with this idea, feeling that they are “giving away the store” or tipping their hand to competitors by revealing what they are doing. On the other hand, if they do not, they may not be able to hire or retain the very best.
Publishing also has some advantages for the firm, such as increased publicity, exposure, external validation of ideas, and so on. There is no clear-cut answer, but the issue needs to be considered carefully.
Some firms file patents aggressively on their data science ideas, after which academic publication is natural if the idea is truly novel and important.
A firm’s data science presence can be bolstered by engaging academic data scientists. There are several ways of doing this. For those academics interested in practical applications of their work, it may be possible to fund their research programs.
Both of your authors, when working in the industry, funded academic programs and essentially extended the data science team that was focusing on their problems and interacting.
The best arrangement (by our experience) is a combination of data, money, and an interesting business problem; if the project ends up being a portion of the Ph.D. thesis of a student in a top-notch program, the benefit to the firm can far outweigh the cost.
Funding a Ph.D. student might cost a firm in the ballpark of $50K/year, which is a fraction of the fully loaded cost of a top data scientist. A key is to have enough understanding of data science to select the right professor—one with the appropriate expertise for the problem at hand.
Another tactic that can be very cost-effective is to take on one or more top-notch data scientists as scientific advisors.
If the relationship is structured such that the advisors truly interact on the solutions to problems, firms that do not have the resources or the clout to hire the very best data scientists can substantially increase the quality of the eventual solutions.
Such advisors can be data scientists at partner firms, data scientists from firms who share investors or board members, or academics who have some consulting time.
Examine Data Science Case Studies
Beyond building a solid data science team, how can a manager ensure that her firm is best positioned to take advantage of opportunities for applying data science? Make sure that there is an understanding of and appreciation for the fundamental principles of data science. Empowered employees across the firm often see novel applications.
After gaining command of the fundamental principles of data science, the best way to position oneself for success is to work through many examples of the application of data science to business problems.
Read case studies that actually walk through the data mining process. Formulate your own case studies. Actually, mining data is helpful, but even more important is working through the connection between the business problem and the possible data science solutions.
The more, different problems you work through, the better you will be at naturally seeing and capitalizing on opportunities for bringing to bear the information and knowledge “stored” in the data—often the same problem formulation from one problem can be applied by analogy to another, with only minor changes.
For targeting consumers with online display advertisements, obtaining an adequate supply of the ideal training data would have been prohibitively expensive. However, data were available at much lower cost from various other distributions and for other target variables.
Be Ready to Accept Creative Ideas from Any Source
Once different role players understand the fundamental principles of data science, creative ideas for new solutions can come from any direction—such as from executives examining potential new lines of business, from directors dealing with profit and loss responsibility, from managers looking critically at a business process, and from line employees with detailed knowledge of exactly how a particular business process functions.
Data scientists should be encouraged to interact with employees throughout the business, and part of their performance evaluation should be based on how well they produce ideas for improving the business with data science.
Incidentally, doing so can pay off in unintended ways: the data processing skills possessed by data scientists often can be applied in ways that are not so sophisticated but nevertheless can help other employees without those skills. Often a manager may have no idea that particular data can even be obtained—data that might help the manager directly, without sophisticated data science.
Be Ready to Evaluate Proposals for Data Science Projects
Ideas for improving business decisions through data science can come from any direction. Managers, investors, and employees should be able to formulate such ideas clearly, and decision makers should be prepared to evaluate them. Essentially, we need to be able to formulate solid proposals and to evaluate proposals.
The data mining process provides a framework to direct this. Each stage in the process reveals questions that should be asked both in formulating proposals for projects and in evaluating them:
Is the business problem well specified? Does the data science solution solve the problem?
Is it clear how we would evaluate a solution?
Would we be able to see evidence of success before making a huge investment in deployment?
• Does the firm have the data assets it needs? For example, for supervised modeling, are there actually labeled training data? Is the firm ready to invest in the assets it does not have yet? Let’s walk through an illustrative example.
Example Data Mining Proposal
Your company has an installed user base of 900,000 current users of your Whiz-bang widget. You now have developed Whiz-bang 2.0, which has substantially lower operating costs than the original.
Ideally, you would like to convert (“migrate”) your entire user base over to version 2.0; however, using 2.0 requires that users master the new interface, and there is a serious risk that in attempting to do so, the customers will become frustrated and not convert, become less satisfied with the company, or in the worst case, switch to your competitor’s popular Boppo widget.
Marketing has designed a brand-new migration incentive plan, which will cost $250 per selected customer. There is no guarantee that a customer will choose to migrate even if she takes this incentive.
We will use data to build a model of whether or not a customer will migrate given the incentive. The dataset will comprise a set of attributes of customers, such as the number and type of prior customer service interactions, level of usage of the widget, location of the customer, estimated technical sophistication, tenure with the firm, and other loyalty indicators, such as number of other firm products and services in use.
The target will be whether or not the customer will migrate to the new widget if he/she is given the incentive. Using this data, we will build a linear regression to estimate the target variable. The model will be evaluated based on its accuracy on these data; in particular, we want to ensure that the accuracy is substantially greater than if we targeted randomly.
Data Understanding/Data Preparation
• There aren’t any labeled training data! This is a brand-new incentive. We should invest some of our budgets in obtaining labels for some examples. This can be done by targeting a (randomly) selected subset of customers with the incentive.
• If we are worried about wasting the incentive on customers who are likely to migrate without it, we also should observe a “control group” over the period where we are obtaining training data.
This should be easy since everyone we don’t target to gather labels would be a “control” subject. We can build a separate model for migrating or not given no incentive and combine the models in an expected value framework.
• Linear regression is not a good choice for modeling a categorical target variable. Rather one should use a classification method, such as tree induction, logistic regression, k-NN, and so on.
• The evaluation shouldn’t be on the training data. Some sort of holdout approach should be used (e.g., cross-validation and/or a staged approach as discussed above).
• Is there going to be any domain-knowledge validation of the model? What if it is capturing some weirdness of the data collection process?
• The idea of randomly selecting customers with regression scores greater than 0.5 is not well considered. First, it is not clear that a regression score of 0.5 really corresponds to a probability of migration of 0.5. Second, 0.5 is rather arbitrary in any case.
Third, since our model is providing a ranking (e.g., by the likelihood of migration, or by expected value if we use the more complex formulation), we should use the ranking to guide our targeting: choose the top-ranked candidates, as the budget will allow.
Of course, this is just one example with a particular set of flaws. A different set of concepts may need to be brought to bear for a different proposal that is flawed in other ways.
A Firm’s Data Science Maturity
For a firm to realistically plan data science endeavors it should assess, frankly and rationally, its own maturity in terms of data science capability. It is beyond the scope of this blog to provide a self-assessment guide, but a few words on the topic are important.
Firms vary widely in their data science capabilities along with many dimensions. One dimension that is very important for strategic planning is the firm’s “maturity,” specifically, how systematic and well-founded are the processes used to guide the firm’s data science projects.
At one end of the maturity spectrum, a firm’s data science processes are completely ad hoc. In many firms, the employees engaged in data science and business analytics endeavors have no formal training in these areas, and the managers involved have little understanding of the fundamental principles of data science and data analytic thinking.
3. The reader interested in this notion of the maturity of a firm’s capabilities is encouraged to read about the Capability Maturity Model for software engineering, which is the inspiration for this discussion.
A note on “immature” firms
Being “immature” does not mean that a firm is destined to fail. It means that success is highly variable and is much more dependent on luck than in a mature firm. Project success will depend upon the heroic efforts of individuals who happen to have a natural acuity for data-analytic thinking.
An immature firm may implement not-so-sophisticated data science solutions on a large scale or may implement sophisticated solutions on a small scale. Rarely, though, will an immature firm implement sophisticated data science solutions on a large scale.
A firm with a medium level of maturity employs well-trained data scientists, as well as business managers and other stakeholders who understand the fundamental principles of data science.
Both sides can think clearly about how to solve business problems with data science, and both sides participate in the design and implementation of solutions that directly address the problems of the business.
At the high end of maturity are firms who continually work to improve their data science processes (and not just the solutions). Executives at such firms continually challenge the data science team to instill processes that will align their solutions better with the business problems.
At the same time, they realize that pragmatic trade-offs may favor the choice of a suboptimal solution that can be realized today over a theoretically much better solution that won’t be ready until next year.
Data scientists at such a firm should have the confidence that when they propose investments to improve data science processes, their suggestions will be met with open and informed minds. That’s not to say that every such request will be approved, but that the proposal will be evaluated on its own merits in the context of the business.
Note: Data science is neither operations nor engineering.
There is some danger in making an analogy to the Capability Maturity Model from software engineering— the danger that the analogy will be taken too literally. Trying to apply the same sort of processes that work for software engineering, or worse for manufacturing or operations, will fail for data science.
Moreover, misguided attempts to do so will send a firm’s best data scientists out the door before the management even knows what happened. The key is to understand data science processes and how to data science well and work to establish consistency and support. Remember that data science is more like R&D than engineering or manufacturing.
As a concrete example, management should consistently make available the resources needed for solid evaluation of data science projects early and often. Sometimes this involves investing in data that would not otherwise have been obtained.
Often this involves assigning engineering resources to support the data science team. The data science team should in return work to provide management with evaluations that are as well aligned with the actual business problem(s) as possible.
As a concrete example, consider yet again our telecom churn problem and how firms of varying maturity might address it:
An immature firm will have (hopefully) analytically adept employees implementing ad hoc solutions based on their intuitions about how to manage churn. These may work well or they may not. In an immature firm, it will be difficult for management to evaluate these choices against alternatives, or to determine when they’ve implemented a nearly optimal solution.
A firm of medium maturity will have implemented a well-defined framework for testing different alternative solutions. They will test under conditions that mimic as closely as possible the actual business setting—for example, running the latest production data through a testbed platform that compares how different methods “would have done,” and considering carefully the costs and benefits involved.
A very mature organization may have deployed the exact same methods as the medium-maturity firm for identifying the customers with the highest probability of leaving, or even the highest expected loss if they were to churn.
They would also be working to implement the processes, and gather the data, necessary to judge also the effect of the incentives and thereby work towards finding those individuals for which the incentives will produce the largest expected increase in value (over not giving the incentive).
Such a firm may also be working to integrate such a procedure into experimentation and/or optimization framework for assessing different offers or different parameters (like the level of discount) to a given offer. A frank self-assessment of data science maturity is difficult, but it is essential to getting the best out of one’s current capabilities, and to improving one’s capabilities.
The practice of data science can best be described as a combination of analytical engineering and exploration. The business presents a problem we would like to solve. Rarely is the business problem directly one of our basic data mining tasks. We decompose the problem into subtasks that we think we can solve, usually starting with existing tools.
For some of these tasks we may not know how well we can solve them, so we have to mine the data and conduct evaluation to see. If that does not succeed, we may need to try something completely different.
In the process, we may discover knowledge that will help us to solve the problem we had set out to solve, or we may discover something unexpected that leads us to other important successes.
Neither analytical engineering nor exploration should be omitted when considering the application of data science methods to solve a business problem. Omitting the engineering aspect usually makes it much less likely that the results of mining data will actually solve the business problem.
Omitting the understanding of the process as one of exploration and discovery often keeps an organization from putting the right management, incentives, and investments in place for the project to succeed.
These concepts span the process from envisioning how data science can improve business decisions, to applying data science techniques, to deploying the results to improve decision-making. The concepts also undergird a large array of business analytics.
Implications for Managing the Data Science Team
It is tempting—but usually a mistake—to view the data mining process as a software development cycle. Indeed, data mining projects are often treated and managed as engineering projects, which is understandable when they are initiated by software departments, with data generated by a large software system and analytics results fed back into it.
Managers are usually familiar with software technologies and are comfortable in managing software projects. Milestones can be agreed upon and success is usually unambiguous.
Software managers might look at the CRISP data mining cycle and think it looks comfortably similar to a software development cycle, so they should be right at home managing an analytics project the same way.
This can be a mistake because data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based on exploration; it iterates on approaches and strategy rather than on software designs.
Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem. Engineering a data mining solution directly for deployment can be an expensive premature commitment.
Instead, analytics projects should prepare to invest in information to reduce uncertainty in various ways. Small investments can be made via pilot studies and throwaway prototypes. Data scientists should review the literature to see what else has been done and how it has worked.
On a larger scale, a team can invest substantially in building experimental testbeds to allow extensive agile experimentation. If you’re a software manager, this will look more like research and exploration than you’re used to, and maybe more than you’re comfortable with.
Software skills versus analytics skills
Although data mining involves software, it also requires skills that may not be common among programmers. In software engineering, the ability to write efficient, high-quality code from requirements may be paramount. Team members may be evaluated using software metrics such as the amount of code written or a number of bug tickets closed.
In analytics, it’s more important for individuals to be able to formulate problems well, to prototype solutions quickly, to make reasonable assumptions in the face of ill-structured problems, to design experiments that represent good investments, and to analyze results.
In building a data science team, these qualities, rather than traditional software engineering expertise, are skills that should be sought.
Other Analytics Techniques and Technologies
Business analytics involves the application of various technologies to the analysis of data. Many of these go beyond this blog’s focus on data-analytic thinking and the principles of extracting useful patterns from data.
Nonetheless, it is important to be acquainted with these related techniques, to understand what their goals are, what role they play, and when it may be beneficial to consult experts in them.
To this end, we present six groups of related analytic techniques. Where appropriate we draw comparisons and contrasts with data mining.
The main difference is that data mining focuses on the automated search for knowledge, patterns, or regularities from data. An important skill for a business analyst is to be able to recognize what sort of analytic technique is appropriate for addressing a particular problem.
The term “statistics” has two different uses in business analytics. First, it is used as a catchall term for the computation of particular numeric values of interest from data (e.g., “We need to gather some statistics on our customers’ usage to determine what’s going wrong here.”) These values often include sums, averages, rates, and so on.
Summary statistics should be chosen with close attention to the business problem to be solved (one of the fundamental principles we will present later), and also with attention to the distribution of the data they are summarizing.
The other use of the term “statistics” is to denote the field of study that goes by that name, for which we might differentiate by using the proper name, Statistics. The field of Statistics provides us with a huge amount of knowledge that underlies analytics and can be thought of as a component of the larger field of Data Science.
For example, Statistics helps us to understand different data distributions and what statistics are appropriate to summarize each. Statistics help us understand how to use data to test hypotheses and to estimate the uncertainty of conclusions.
In relation to data mining, hypothesis testing can help determine whether an observed pattern is likely to be a valid, general regularity as opposed to a chance occurrence in some particular dataset. Most relevant to this blog, many of the techniques for extracting models or patterns from data have their roots in Statistics.
For example, a preliminary study may suggest that customers in the Northeast have a churn rate of 22.5%, whereas the nationwide average churn rate is only 15%. This may be just a chance fluctuation since the churn rate is not constant; it varies over regions and over time, so differences are to be expected.
This contrasts with the (complimentary) process of data mining, which may be seen as hypothesis generation. Can we find patterns in data in the first place? Hypothesis generation should then be followed by careful hypothesis testing.
In addition, data mining procedures may produce numerical estimates, and we often also want to provide confidence intervals on these estimates. We will return to this when we discuss the evaluation of the results of data mining.
There are plenty of introductory books on statistics and statistics for business, and any treatment we would try to squeeze in would be either very narrow or superficial.
That said, one statistical term that is often heard in the context of business analytics is “correlation.” For example, “Are there any indicators that correlate with a customer’s later defection?”
As with the term statistics, “correlation” has both a general-purpose meaning (variations in one quantity tell us something about variations in the other), and a specific technical meaning (e.g., linear correlation based on a particular mathematical formula).
The notion of correlation will be the jumping off point for the rest of our discussion of data science for business.
A query is a specific request for a subset of data or for statistics about data, formulated in a technical language and posed to a database system. Many tools are available to answer one-off or repeating queries about data posed by an analyst.
These tools are usually frontends to database systems, based on Structured Query Language (SQL) or a tool with a graphical user interface (GUI) to help formulate queries (e.g., query-by-example, or QBE).
For example, if the analyst can define “profitable” in operational terms computable from items in the database, then a query tool could answer: “Who are the most profitable customers in the Northeast?”
The analyst may then run the query to retrieve a list of the most profitable customers, possibly ranked by profitability. This activity differs fundamentally from data mining in that there is no discovery of patterns or models.
Database queries are appropriate when an analyst already has an idea of what might be an interesting subpopulation of the data and wants to investigate this population or confirm a hypothesis about it.
If those are the people to be targeted with an offer, a query tool can be used to retrieve all of the information about them (“*”) from the
Query tools generally have the ability to execute sophisticated logic, including computing summary statistics over subpopulations, sorting, joining together multiple tables with related data, and more. Data scientists often become quite adept at writing queries to extract the data they need.
On-line Analytical Processing (OLAP) provides an easy-to-use GUI to query large data collections, for the purpose of facilitating data exploration. The idea of “on-line” processing is that it is done in realtime, so analysts and decision makers can find answers to their queries quickly and efficiently.
Unlike the “ad hoc” querying enabled by tools like SQL, for OLAP the dimensions of analysis must be pre-programmed into the OLAP system. If we’ve foreseen that we would want to explore sales volume by region and time, we could have these three dimensions programmed into the system, and drill down into populations, often simply by clicking and dragging and manipulating dynamic charts.
OLAP systems are designed to facilitate manual or visual exploration of the data by analysts. OLAP performs no modeling or automatic pattern finding.
As an additional contrast, unlike with OLAP, data mining tools generally can incorporate new dimensions of analysis easily as part of the exploration. OLAP tools can be a useful complement to data mining tools for discovery from business data.
Data warehouses collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database. Analytical systems can access data warehouses. Data warehousing may be seen as a facilitating technology of data mining.
It is not always necessary, as most data mining does not access a data warehouse, but firms that decide to invest in data warehouses often can apply data mining more broadly and more deeply in the organization.
For example, if a data warehouse integrates records from sales and billing as well as from human resources, it can be used to find characteristic patterns of effective salespeople.
Some of the same methods we discuss in this blog are at the core of a different set of analytic methods, which often are collected under the rubric regression analysis, and are widely applied in the field of statistics and also in other fields founded on econometric analysis. This blog will focus on different issues than usually encountered in a regression analysis blog or class.
Here we are less interested in explaining a particular dataset as we are in extracting patterns that will generalize to other data, and for the purpose of improving some business process. Typically, this will involve estimating or predicting values for cases that are not in the analyzed data set.
Therefore, we will spend some time talking about testing patterns on new data to evaluate their generality, and about techniques for reducing the tendency to find patterns specific to a particular set of data, but that do not generalize to the population from which the data come.
The topic of explanatory modeling versus predictive modeling can elicit deep-felt debate, which goes well beyond our focus. What is important is to realize that there is considerable overlap in the techniques used, but that the lessons learned from explanatory modeling do not all apply to predictive modeling.
So a reader with some background in regression analysis may encounter new and even seemingly contradictory lessons.
Machine Learning and Data Mining
The collection of methods for extracting (predictive) models from data, now known as machine learning methods, were developed in several fields contemporaneously, most notably Machine Learning, Applied Statistics, and Pattern Recognition.
Machine Learning as a field of study arose as a subfield of Artificial Intelligence, which was concerned with methods for improving the knowledge or performance of an intelligent agent over time, in response to the agent’s experience in the world.
Such improvement often involves analyzing data from the environment and making predictions about unknown quantities, and over the years this data analysis aspect of machine learning has come to play a very large role in the field.
As machine learning methods were deployed broadly, the scientific disciplines of Machine Learning, Applied Statistics, and Pattern Recognition developed close ties, and the separation between the fields has blurred.
The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns.
Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly. Nevertheless, it is worth pointing out some of the differences to give perspective.
Speaking generally, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It also is concerned with issues of agency and cognition—how will an intelligent agent using learned knowledge to reason and act in its environment—which are not concerns of Data Mining.
Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is.
As such, research focused on commercial applications and business issues of data analysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on.
Answering Business Questions with These Techniques
To illustrate how these techniques apply to business analytics, consider a set of questions that may arise and the technologies that would be appropriate for answering them.
These questions are all related but each is subtly different. It is important to understand these differences in order to understand what technologies one needs to employ and what people may be necessary to consult.
1. Who are the most profitable customers?
If “profitable” can be defined clearly based on existing data, this is a straightforward database query. A standard query tool could be used to retrieve a set of customer records from a database. The results could be sorted by cumulative transaction amount or some other operational indicator of profitability.
2. Is there really a difference between profitable customers and the average customer?
This is a question about conjecture or hypothesis (in this case, “There is a difference in value to the company between the profitable customers and the average customer”), and statistical hypothesis testing would be used to confirm or disconfirm it.
Statistical analysis could also derive a probability or confidence bound that the difference was real. Typically, the result would be like: “The value of these profitable customers is significantly different from that of the average customer, with probability < 5% that this is due to random chance.”
3. But who really are these customers? Can I characterize them?
We often would like to do more than just list out the profitable customers. We would like to describe the common characteristics of profitable customers. The characteristics of individual customers can be extracted from a database using techniques such as database querying, which also can be used to generate summary statistics.
A deeper analysis should involve determining what characteristics differentiate profitable customers from unprofitable ones. This is the realm of data science, using data mining techniques for automated pattern finding—which we discuss in depth in the subsequent blogs.
4. Will some particular new customer be profitable? How much revenue should I expect this customer to generate?
These questions could be addressed by data mining techniques that examine historical customer records and produce predictive models of profitability. Such techniques would generate models from historical data that could then be applied to new customers to generate predictions.
Note that this last pair of questions is subtly different data mining questions. The first, a classification question, may be phrased as a prediction of whether a given new customer will be profitable (yes/no or the probability thereof). The second may be phrased as a prediction of the value (numerical) that the customer will bring to the company. More on that as we proceed.
Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects.
The various fields of study related to data science have developed a set of canonical task types, such as classification, regression, and clustering. Each task type serves a different purpose and has an associated set of solution techniques.
A data scientist typically attacks a new project by decomposing it such that one or more of these canonical tasks is revealed, choosing a solution technique for each, then composing the solutions. Doing this expertly may take considerable experience and skill.
A successful data mining project involves an intelligent compromise between what the data can do (i.e., what they can predict, and how well) and the project goals. For this reason, it is important to keep in mind how data mining results will be used and use this to inform the data mining process itself.
Data mining differs from and is complementary to, important supporting technologies such as statistical hypothesis testing and database querying (which have their own blogs and classes).
Though the boundaries between data mining and related techniques are not always sharp, it is important to know about other techniques’ capabilities and strengths to know when they should be used.
To a business manager, the data mining process is useful as a framework for analyzing a data mining project or proposal. The process provides a systematic organization, including a set of questions that can be asked about a project or a proposed project to help understand whether the project is well conceived or is fundamentally flawed.
We will return to this after we have discussed in detail some more of the fundamental principles themselves—to which we turn now.