Data Science Python

Data Science with Python

Data Science with Python

“Data science” is a very popular term these days, and it gets applied to so many things that its meaning has become very vague. So I’d like to start this blog by giving you the definition of data science and discuss data science with python. I’ve found that this one gets right to the heart of what sets it apart from other disciplines. Here goes:

 

Data science means doing analytics work that, for one reason or another, requires a substantial amount of software engineering skills.

 

Data Science Fundamentals by Example is an excellent starting point for those interested in pursuing a career in data science. Like any science, the fundamentals of data science are prerequisite to competency. Without proficiency in mathematics, statistics, data manipulation, and coding, the path to success are “rocky” at best.

 

Data Science Techniques

Data Science Techniques

Doing data science means implementing flexible, scalable, extensible systems for data preparation, analysis, visualization, and modeling.

 

Many firms are moving away from internally owned, centralized computing systems toward distributed cloud-based services. Distributed hardware and software systems, including database systems, can be expanded more easily as the data management needs of organizations grow. Doing data science means being able to gather data from the full range of database systems, relational and nonrelational, commercial and open source.

 

We employ database query and analysis tools, gathering information across distributed systems, collating information, creating contingency tables, and computing indices of relationship across variables of interest. We use information technology and database systems as far as they can take us, and then we do more, applying what we know about statistical inference and the modeling techniques of predictive analytics.

 

Database Systems

Database Systems

Relational databases have a row-and-column table structure, similar to a spreadsheet. We access and manipulate these data using structured query language (SQL). Because they are transaction-oriented with enforced data integrity, relational databases provide the foundation for sales order processing and financial accounting systems.

 

Nonrelational databases focus on availability and scalability. They may employ key-value, column-oriented, document-oriented, or graph structures. Some are designed for online or real-time applications, where fast response times are key. Others are well suited for massive storage and offline analysis, with map-reduce providing a key data aggregation tool.

 

Statistical Inference

Statistics are functions of sample data and are more credible when samples are representative of the concerned population. Typically, large random samples, small standard errors, and narrow confidence intervals are preferred. The formal scientific method suggests that we construct theories and test those theories with sample data. The process involves drawing statistical inferences as point estimates, interval estimates, or tests of hypotheses about the population. Whatever the form of inference, we need sample data relating to questions of interest.

 

Classical and Bayesian statistics represent alternative approaches to inference, alternative ways of measuring uncertainty about the world.

 

1. Classical hypothesis testing involves making null hypotheses about population parameters and then rejecting or not rejecting those hypotheses based on sample data. Typical null hypotheses (as the word null would imply) states that there is no difference between proportions or groups, or no relationship between variables.

 

To test a null hypothesis, we compute a special statistic called a test statistic along with its associated p-value. Assuming that the null hypothesis is true, we can derive the theoretical distribution of the test statistic. We obtain a p-value by referring the sample test statistic to this theoretical distribution.

 

The p-value, itself a sample statistic, gives the probability of rejecting the null hypothesis under the assumption that it is true. Let us assume that the conditions for valid inference have been satisfied. Then, when we observe a very low p-value (0.05, 0.01, or 0.001, for instance), this indicates that either of the following two things must be true:

 

  • a. An event of very low probability has occurred under the assumption that the null hypothesis is true
  • b. The null hypothesis is false.

 

A low p-value leads us to reject the null hypothesis, and we say the research results are statistically significant. Some results are statistically significant and meaningful.

 

2. Bayesian approach treats parameters as random variables having probability distributions representing of our uncertainty about the world that can be reduced by collecting relevant sample data. Sample data and Bayes’ theorem are used to derive posterior probability distributions for these same parameters, which, in turn, are used to obtain conditional probabilities.

 

Regression and Classification

Classification

Data science involves a search for meaningful relationships between variables. We look for relationships between pairs of continuous variables using scatter plots and correlation coefficients. We look for relationships between categorical variables using contingency tables and the methods of categorical data analysis.

 

We use multivariate methods and multiway contingency tables to examine relationships among many variables. There are two main types of predictive models: regression and classification. Regression is a prediction of a response of meaningful magnitude. Classification involves prediction of a class or category.

 

The most common form of regression is a least-squares regression, also called ordinary least-squares regression, linear regression, or multiple regression. When we use ordinary least-squares regression, we estimate regression coefficients so that they minimize the sum of the squared residuals, where residuals are differences between the observed and predicted response values.

 

For regression problems, we think of the response as taking any value along the real number line, although in practice, the response may take a limited number of distinct values. The important thing for regression is that the response values have meaningful magnitude.

 

Poisson regression is useful for counts. The response has meaningful magnitude but takes discrete (whole number) values with a minimum value of zero. Log-linear models for frequencies grouped frequencies, and contingency tables for cross-classified observations fall within this domain.

 

Most traditional modeling techniques involve linear models or linear equations. The response or transformed response is on the left-hand side of the linear model. The linear predictor is on the right-hand side. The linear predictor involves explanatory variables and is linear in its parameters. That is, it involves the addition of coefficients or the multiplication of coefficients by the explanatory variables. The coefficients we fit linear models represent estimates of population parameters.

 

Generalized linear models, as their name would imply, are generalizations of the classical linear regression model. They include models for choices and counts, including logistic regression, multinomial logit models, log-linear models, ordinal logistic models, Poisson regression, and survival data models. To introduce the theory behind these important models, we begin by reviewing the classical linear regression model. Generalized linear models help us model what are obvious nonlinear relationships between explanatory variables and responses.

 

Linear regression is a special generalized linear model. It has normally distributed responses and an identity link relating the expected value of responses to the linear predictor. Linear regression coefficients may be estimated by ordinary least squares. For other members of the family of generalized linear models, we use maximum likelihood estimation. With the classical linear model, we have an analysis of variance and F-tests. With generalized linear models, we have an analysis of deviance and likelihood ratio tests, which are asymptotic chi-square tests.

 

The method of logistic regression, although called “regression,” is actually a classification method. It involves the prediction of a binary response. Ordinal and multinomial logit models extend logistic regression to problems involving more than two classes. The linear discriminant analysis is another classification method from the domain of traditional statistics.

 

Data Mining and Machine Learning

Data Mining

Machine learning refers to the methods or algorithms that are used as an alternative to traditional statistical methods. When we apply these methods in the analysis of data, these are termed data mining. Recommender systems, collaborative filtering, association rules, optimization methods based on heuristics, as well as a myriad of methods for regression, classification, and clustering are all examples of machine learning.

 

With traditional statistics, we define the model specification prior to working with the data and also make assumptions about the population distributions from which the data have been drawn. Machine learning, however, is data-adaptive: model specification is defined by applying algorithms to the data. With machine learning, a few assumptions are made about the underlying distributions of the data.

 

Cluster analysis is referred to as unsupervised learning to distinguish it from classification, which is supervised learning, guided by known, coded values of a response variable or class. Association rules modeling, frequent itemsets, social network analysis, link analysis, recommender systems, and many multivariate methods employed and data science represent unsupervised learning methods.

 

An important multivariate method, principal component analysis, draws on linear algebra and provides a way to reduce the number of measures or quantitative features we use to describe domains of interest. Long a staple of measurement experts and a prerequisite of factor analysis, principal component analysis has seen recent applications in latent semantic analysis, a technology for identifying important topics across a document corpus.

 

Data Visualization

Data Visualization

Statistical summaries fail to tell the story of data. To understand data, we must look beyond data tables, regression coefficients, and the results of statistical tests. Visualization tools help us learn from data. We explore data, discover patterns in data, identify groups of observations that go together and unusual observations or outliers. Data visualization is critical to the work of data science in the areas of discovery (exploratory data analysis), diagnostics (statistical modeling), and design (presentation graphics).

 

R is particularly strong in data visualization.

 

Text Analytics

Text analytics is an important and growing area of predictive analytics. Text analytics draws from a variety of disciplines, including linguistics, communication and language arts, experimental psychology, political discourse analysis, journalism, computer science, and statistics.

 

The output from these processes such as crawling, scraping, and parsing is a document collection or text corpus. This document collection or corpus is in the natural language. The two primary ways of analyzing a text corpus are the bag of words approach and natural language processing.

 

We parse the corpus further, creating commonly formatted expressions, indices, keys, and matrices that are more easily analyzed by computer. This additional parsing is sometimes referred to as text annotation. We extract features from the text, and then use those features in subsequent analyses. Natural language processing is more than a collection of individual words: Natural language conveys meaning.

 

Natural language documents contain paragraphs, paragraphs contain sentences, and sentences contain words. There are grammatical rules, with many ways to convey the same idea, along with exceptions to rules and rules about exceptions. Words used in combination and the rules of grammar comprise the linguistic foundations of text analytics.

 

Linguists study natural language, the words and, the rules that we use to form meaningful utterances. “Generative grammar” is a general term for the rules; “morphology,” “syntax,” and “semantics” are more specific terms. Computer programs for natural language processing use linguistic rules to mimic human communication and convert natural language into structured text for further analysis.

 

A key step in text analysis is the creation of a terms-by-documents matrix (sometimes called a lexical table). The rows of this data matrix correspond to words or word stems from the document collection, and the columns correspond to documents in the collection. The entry in each cell of a terms-by-documents matrix could be a binary indicator for the presence or absence of a term in a document, a frequency count of the number of times a term is used in a document, or a weighted frequency indicating the importance of a term in a document.

 

After being created, the terms-by-documents matrix is like an index, a mapping of document identifiers to terms (keywords or stems) and vice versa. For information retrieval systems or search engines, we might also retain information regarding the specific location of terms within documents.

 

Typical text analytics applications:

text analytics

1. Spam filtering has long been a subject of interest as a classification problem, and many e-mail users have benefitted from the efficient algorithms that have evolved in this area. In the context of information retrieval, search engines classify documents as being relevant to the search or not. Useful modeling techniques for text classification include logistic regression, linear discriminant function analysis, and classification trees, and support vector machines. Various ensemble or committee methods may be employed.

 

2 Automatic text summarization is an area of research and development that can help with information management. Imagine a text processing program with the ability to read each document in a collection and summarize it in a sentence or two, perhaps quoting from the document itself.

 

Today’s search engines are providing partial analysis of documents prior to their being displayed. They create automated summaries for fast information retrieval. They recognize common text strings associated with user requests. These applications of text analysis comprise tools of information search that we take for granted as part of our daily lives.

 

3. Sentiment analysis is measurement-focused text analysis. Sometimes called opinion mining, one approach to sentiment analysis is to draw on positive and negative word sets (lexicons, dictionaries) that convey human emotion or feeling. These word sets are specific to the language being spoken and the context of the application.

 

Another approach to sentiment analysis is to work directly with text samples and human ratings of those samples, developing text scoring methods specific to the task at hand. The objective of sentiment analysis is to score text for effect, feelings, attitudes, or opinions.

 

Sentiment analysis and text measurement in general hold promise as technologies for understanding consumer opinion and markets. Just as political researchers can learn from the words of the public, press, and politicians, business researchers can learn from the words of customers and competitors. There are customer service logs, telephone transcripts, and sales call reports, along with user group, listserv, and blog postings. And we have ubiquitous social media from which to build document collections for text and sentiment analysis.

 

4. Text measures flow from a measurement model (algorithms for scoring) and a dictionary, both defined by the researcher or analyst. A dictionary in this context is not a traditional dictionary; it is not an alphabetized list of words and their definitions. Rather, the dictionary used to construct text measures is a repository of word lists, such as synonyms and antonyms, positive and negative words, strong and weak sounding words, bipolar adjectives, parts of speech, and so on. The lists come from expert judgments about the meaning of words.

 

A text measure assigns numbers to documents according to rules, with the rules being defined by the word lists, scoring algorithms, and modeling techniques in predictive analytics.

 

Time Series and Market Research Models

Market Research

Sales and marketing data are organized by observational unit, time, and space. The observational unit is typically an economic agent (individual or firm) or a group of such agents as in an aggregate analysis. It is common to use geographical areas as a basis for aggregation. Alternatively, space (longitude and latitude) can be used directly in spatial data analyses. Time considerations are especially important in macroeconomic analysis, which focuses upon nationwide economic measures.

 

The term time-series regression refers to regression analysis in which the organizing unit of analysis is time. We look at relationships among economic measures organized in time. Much economic analysis concerns time-series regression. Special care must be taken to avoid what might be called spurious relationships, as much economic time series are correlated with one another because they depend upon underlying factors, such as population growth or seasonality. In time-series regression, we use standard linear regression methods.

 

We check the residuals from our regression to ensure that they are not correlated in time. If they are correlated in time (autocorrelated), then we use a method such as generalized least squares as an alternative to ordinary least squares. That is, we incorporate an error data model as part of our modeling process. Longitudinal data analysis or panel data analysis is an example of a mixed data method with a focus on data organized by cross-sectional units and time.

 

Sales forecasts can build on the special structure of sales data as they are found in business. These are data organized by time and location, where location might refer to geographical regions or sales territories, stores, departments within stores, or product lines. Sales forecasts are a critical component of business planning and a first step in the budgeting process. Models and methods that provide accurate forecasts can be of great benefit to management.

 

They help managers to understand the determinants of sales, including promotions, pricing, advertising, and distribution. They reveal the competitive position and market share. There are many approaches to forecasting. Some are judgmental, relying on expert opinion or consensus. There are top-down and bottom-up forecasts and various techniques for combining the views of experts. Other approaches depend on the analysis of past sales data.

 

1. Forecasting by time periods: These may be days, weeks, months, or whatever intervals make sense for the problem at hand. Time dependencies can be noted in the same manner as in traditional time-series models. Autoregressive terms are used in many contexts. Time-construed covariates, such as day of the week or month of the year, can be added to provide additional predictive power.

 

An analyst can work with time-series data, using past sales to predict future sales, noting overall trends and cyclical patterns in the data. Exponential smoothing, moving averages, and various regression and econometric methods may be used with time-series data.

 

2. Forecasting by location: Organizing data by location contributes to a model’s predictive power. Location may itself be used as a factor in models. In addition, we can search for explanatory variables tied to location. With geographic regions, for example, we might include consumer and business demographic variables known to relate to sales.

 

Sales dollars per time period is the typical response variable of interest in sales forecasting studies. Alternative response variables include sales volume and time-to-sale. Related studies of market share require information about the sales of other firms in the same product category.

 

When we use the term time-series analysis, however, we are not talking about time- series regression. We are talking about methods that start by focusing on one economic measure at a time and its pattern across time. We look for trends, seasonality, and cycles in that individual time series. Then, after working with that single time series, we look at possible relationships with other time series.

 

If we are concerned with forecasting or predicting the future, as we often are in predictive analytics, then we use methods of time- series analysis. Recently, there has been considerable interest in state space models for time series, which provide a convenient mechanism for incorporating regression components into dynamic time-series models.

 

There are myriad applications of time-series analysis in marketing, including marketing mix models and advertising research models. Along with sales forecasting, these fall under the general class of market response models. Marketing mix models look at the effects of price, promotion, and product placement in retail establishments. These are multiple time-series problems. Advertising research looks for the cumulative effect of advertising on brand and product awareness, as well as sales.

 

Much of this research employs defined measures such as “advertising stock,” which attempt to convert advertising impressions or rating points to a single measure in time. The thinking is that messages are most influential immediately after being received, the decline in influence with time, but do not decline completely until many units in time later.

 

Viewers or listeners remember advertisements long after initial exposure to those advertisements. Another way of saying this is to note that there is a carry-over effect from one time period to the next. Needless to say, measurement and modeling on the subject of advertising effectiveness present many challenges for the marketing data scientist.

 

Analytics

Analytics

Thus, business analytics, or simply analytics, is the use of data, information technology, statistical analysis, quantitative methods, and mathematical or computer-based models to help managers gain improved insight about their business operations and make better, fact-based decisions. Business analytics is a process of transforming data into actions through analysis and insights in the context of organizational decision-making and problem-solving.

 

Business analytics has traditionally been supported by various tools such as Microsoft Excel and various Excel add-ins, commercial statistical software packages such as SAS or Minitab, and more-complex business intelligence suites that integrate data with analytical software.

 

Tools and techniques of business analytics are used across many areas in a wide variety of organizations to improve the management of customer relationships, financial and marketing activities, human capital, supply chains, and many other areas. Leading banks use analytics to predict and prevent credit fraud. Manufacturers use analytics for production planning, purchasing, and inventory management. Retailers use analytics to recommend products to customers and optimize marketing promotions.

 

Pharmaceutical firms use it to get life-saving drugs to market more quickly. The leisure and vacation industries use analytics to analyze historical sales data, understand customer behavior, improve website design, and optimize schedules and bookings. Airlines and hotels use analytics to dynamically set prices over time to maximize revenue. Even sports teams are using business analytics to determine both game strategy and optimal ticket prices.

 

Top-performing organizations (those that outperform their competitors) are three times more likely to be sophisticated in their use of analytics than lower performers and are more likely to state that their use of analytics differentiates them from competitors. One of the emerging applications of analytics is helping businesses learn from social media and exploit social media data for strategic advantage.

 

Using analytics, firms can integrate social media data with traditional data sources such as customer surveys, focus groups, and sales data; understand trends and customer perceptions of their products, and create informative reports to assist marketing managers and product designers.

 

Descriptive Analytics

Descriptive Analytics

Descriptive analytics describes what happened in the past. Descriptive analytics is the most commonly used and most well-understood type of analytics. Most businesses start with descriptive analytics—the use of data to understand past and current business performance and make informed decisions. These techniques categorize, characterize, consolidate, and classify data to convert it into useful information for the purposes of understanding and analyzing business performance.

 

Descriptive analytics summarizes data into meaningful charts and reports, for example, about budgets, sales, revenues, or cost. This process allows managers to obtain standard and customized reports and then drill down into the data and make queries to understand the impact of an advertising campaign.

 

Descriptive analytics also helps companies to classify customers into different segments, which enables them to develop specific marketing campaigns and advertising strategies. For instance, an enterprise manager might want to review business performance to find problems or areas of opportunity and identify patterns and trends in data.

Descriptive analytics involves several techniques such as

  • 1.Visualizing and exploring data
  • 2.Descriptive statistical measures
  • 3.Probability distributions and data modeling
  • 4.Sampling and estimation
  • 5. Statistical inference

 

Predictive Analytics

Predictive Analytics

Predictive analytics seeks to predict the future by examining historical data, detecting patterns or relationships in these data, and then extrapolating these relationships forward in time. Predictive analytics models are very popular in predicting the behavior of customers based on the past buying history and perhaps some demographic variables. They typically use multiple variables to predict a particular dependent variable.

 

For example, a marketer might wish to predict the response of different customer segments to an advertising campaign, a T-shirt manufacturer might want to predict next season’s demand for a T-shirt of a specific color and size, or a commodity trader might wish to predict short-term movements in commodity prices.

 

Predictive analytics can predict risk and find relationships in data not readily apparent with traditional analyses. Using advanced techniques, predictive analytics can help to detect hidden patterns in large quantities of data to segment and group data into coherent sets to predict behavior and detect trends.

 

For instance, a bank manager might want to identify the most profitable customers or predict the chances that a loan applicant will default, or alert a credit-card customer to a potentially fraudulent charge.

Predictive analytics involves several techniques such as

  • 1.Trendlines and regression analysis
  • 2.Forecasting techniques
  • 3.Classification techniques using data mining
  • 4.Modeling and analysis using a spreadsheet
  • 5.Monte Carlo simulation and risk analysis

 

Prescriptive Analytics

Prescriptive Analytics

Prescriptive analytics determine actions to take to make the future happen. Prescriptive analytics is used in many areas of business including operations, marketing, and finance. Many problems, such as aircraft or employee scheduling and supply chain design, for example, simply involve too many choices or alternatives for a human decision-maker to effectively consider.

 

Randomized testing, in which a test group is compared to a control group with random assignment of subjects to each group, is a powerful method to establish the cause. On comparison of the groups, if one is better than the other with statistical significance, the thing that’s being tested in the test group should be prescribed. Prescriptive analytics uses optimization to identify the best alternatives to minimize or maximize some objective.

 

The mathematical and statistical techniques of predictive analytics can also be combined with optimization to make decisions that take into account the uncertainty in the data. For instance, a manager might want to determine the best pricing and advertising strategy to maximize revenue, the optimal amount of cash to store in ATMs, or the best mix of investments in a retirement portfolio to manage risk.

 

Prescriptive analytics involves several techniques such as

  • 1.Linear optimization
  • 2.Integer optimization
  • 3.Decision analysis

 

A useful selection of data analysis techniques:

1.Descriptive and visualization include simple descriptive statistics such as the following:

  • a.Averages and measures of variation
  • b.Counts and percentages
  • c.Cross-tabs and simple correlations

 

They are useful for understanding the structure of the data. Visualization is primarily a discovery technique and is useful for interpreting large amounts of data; visualization tools include histograms, box plots, scatter diagrams, and multidimensional surface plots.

 

2. Correlation analysis measures the relationship between two variables. The resulting correlation coefficient shows if changes in one variable will result in changes in the other. When comparing the correlation between two variables, the goal is to see if a change in the independent variable will result in a change in the dependent variable.

 

This information helps in understanding an independent variable’s predictive abilities. Correlation findings, just as regression findings, can be useful in analyzing causal relationships, but they do not by themselves establish causal patterns.

 

3. Cluster analysis seeks to organize information about variables so that relatively homogeneous groups, or “clusters,” can be formed. The clusters formed with this family of methods should be highly internally homogeneous (members are similar to one another) and highly externally heterogeneous (members are not like members of other clusters).

 

4. Discriminant analysis is used to predict membership in two or more mutually exclusive groups from a set of predictors when there is no natural ordering on the groups. Discriminant analysis can be seen as the inverse of a one-way multivariate analysis of variance (MANOVA) in that the levels of the independent variable (or factor) for MANOVA become the categories of the dependent variable for discriminant analysis, and the dependent variables of the MANOVA become the predictors for discriminant analysis.

 

5. Regression analysis is a statistical tool that uses the relation between two or more quantitative variables so that one variable (dependent variable) can be predicted from the other(s) (independent variables). But no matter how strong the statistical relations are between the variables, no cause-and-effect pattern is necessarily implied by the regression model. Regression analysis comes in many flavors, including simple linear, multiple linear, curvilinear, and multiple curvilinear regression models, as well as logistic regression, which is discussed next.

 

6. Neural networks (NN) are a class of systems modeled after the human brain. As the human brain consists of millions of neurons that are interconnected by synapses, NN is formed from large numbers of simulated neurons, connected to each other in a manner similar to brain neurons. As in the human brain, the strength of neuron interconnections may change (or be changed by the learning algorithm) in response to a presented stimulus or an obtained output, which enables the network to “learn.”

 

A disadvantage of NN is that building the initial neural network model can be especially time-intensive because input processing almost always means that raw data must be transformed. Variable screening and selection require large amounts of the analysts’ time and skill. Also, for the user without a technical background, figuring out how neural networks operate is far from obvious.

 

7. Case-based reasoning (CBR) is a technology that tries to solve a given problem by making direct use of past experiences and solutions. A case is usually a specific problem that was encountered and solved previously. Given a particularly new problem, CBR examines the set of stored cases and finds similar ones. If similar cases exist, their solution is applied to the new problem, and the problem is added to the case base for future reference.

 

A disadvantage of CBR is that the solutions included in the case database may not be optimal in any sense because they are limited to what was actually done in the past, not necessarily what should have been done under similar circumstances. Therefore, using them may simply perpetuate earlier mistakes.

 

8. Decision trees (DT) are like those used in decision analysis where each nonterminal node represents a test or decision on the data item considered. Depending on the outcome of the test, one chooses a certain branch. To classify a particular data item, one would start at the root node and follow the assertions down until a terminal node (or leaf) is reached; at that point, a decision is made. DT can also be interpreted as a special form of a rule set, characterized by their hierarchical organization of rules.

 

A disadvantage of DT is that trees use up data very rapidly in the training process. They should never be used with small data sets. They are also highly sensitive to noise in the data, and they try to fit the data exactly, which is referred to as “overfitting.” Overfitting means that the model depends too strongly on the details of the particular dataset used to create it. When a model suffers from overfitting, it is unlikely to be externally valid (i.e., it would not hold up when applied to a new data set).

 

9. Association rules (AR) are statements about relationships between the attributes of a known group of entities and one or more aspects of those entities that enable predictions to be made about aspects of other entities who are not in the group, but who possess the same attributes.

 

More generally, AR state a statistical correlation between the occurrences of certain attributes in a data item, or between certain data items in a data set. The general form of an AR is X1 … Xn => Y [C, S] which means that the attributes X1, …, Xn predict Y with a confidence C and a significance S.

 

A useful selection of data analysis tasks:

data analysis

1. Data Summarization gives the user an overview of the structure of the data and is generally carried out in the early stages of a project. This type of initial exploratory data analysis can help to understand the nature of the data and to find potential hypotheses for hidden information. Simple descriptive statistical and visualization techniques generally apply.

 

2. Segmentation separates the data into interesting and meaningful subgroups or classes. In this case, the analyst can hypothesize certain subgroups as relevant for the business question based on prior knowledge or based on the outcome of data description and summarization. Automatic clustering techniques can detect previously unsuspected and hidden structures in data that allow segmentation. Clustering techniques, visualization, and neural nets generally apply.

 

3. Classification assumes that a set of objects—characterized by some attributes or features—belongs to different classes. The class label is a discrete qualitative identifier—for example, large, medium, or small. The objective is to build classification models that assign the correct class to previously unseen and unlabeled objects. Classification models are mostly used for predictive modeling. Discriminant analysis, decision tree, rule induction methods, and genetic algorithms generally apply.

 

4. Prediction is very similar to classification. The difference is that in prediction, the class is not a qualitative discrete attribute but a continuous one. The goal of prediction is to find the numerical value of the target attribute for unseen objects; this problem type is also known as regression, and if the prediction deals with time- series data, then it is often called forecasting. Regression analysis, decision trees, and neural nets generally apply.

 

5. Dependency analysis deals with finding a model that describes significant dependencies (or associations) between data items or events. Dependencies can be used to predict the value of an item given information on other data items. Dependency analysis has close connections with classification and prediction because the dependencies are implicitly used for the formulation of predictive models. Correlation analysis, regression analysis, association rules, case-based reasoning, and visualization techniques generally apply.

 

The example code in this blog is all in Python, except for a few domain‐­specific languages such as SQL. My goal isn’t to push you to use Python; there are lots of good tools out there, and you can use whichever ones you want.

 

However, I wanted to use one language for all of my examples. This keeps the book readable, and it also lets readers follow the whole blog while only knowing one language. Of the various languages available, there are two reasons why I chose Python:

 

Python is the most popular language for data scientists. R is its only major competitor, at least when it comes to free tools. I have used both extensively, and I think that Python is flat‐out better (except for some obscure statistics packages that have been written in R and that are rarely needed anyway).

 

I like to say that for any task, Python is the second‐best language. It’s a jack-of-all-trades. If you only need to worry about statistics, or numerical computation, or web parsing, then there are better options out there. But if you need to do all of these things within a single project, then Python is your best option. Since data science is so inherently multidisciplinary, this makes it a perfect fit.

 

As a note of advice, it is much better to be proficient in one language, to the point where you can reliably churn out code that is of high quality than to be mediocre at several.

 

To install a Python module, pip is the preferred installer program. So, to install the matplotlib module from an Anaconda prompt: pip installs matplotlib. Anaconda is a widely popular open source distribution of Python (and R) for large-scale data processing, predictive analytics, and scientific computing that simplifies package management and deployment. I have worked with other distributions with unsatisfactory results, so I highly recommend Anaconda.

 

Python Fundamentals

Python

Python has several features that make it well suited for learning and doing data science. It’s free, relatively simple to code, easy to understand, and has many useful libraries to facilitate data science problem-solving. It also allows quick prototyping of virtually any data science scenario and demonstration of data science concepts in a clear, easy to understand manner.

 

The goal of this blog is not to teach Python as a whole, but present, explain, and clarify fundamental features of the language (such as logic, data structures, and libraries) that help prototype, apply, and/or solve data science problems.

 

Python fundamentals are covered with a wide spectrum of activities with associated coding examples as follows:

  • \  1.\  functions and strings
  • \ 2.\ lists, tuples, and dictionaries
  • \ 3.\ reading and writing data
  • \ 4.\ list comprehension
  • \ 5.\ generators
  • \ 6.\ data randomization
  • \ 7.\ MongoDB and JSON
  • \ 8.\ visualization

 

Functions and Strings

Strings

Python functions are first-class functions, which means they can be used as parameters, a return value, assigned to a variable, and stored in data structures. Simply, functions work like a type variable. Functions can be either custom or built-in. Custom are created by the programmer, while built-in are part of the language. Strings are very popular types enclosed in either single or double quotes.

 

The following code example defines custom functions and uses built-in ones:

def num_to_str(n):
return str(n)
def str_to_int(s):
return int(s)
def str_to_float(f):
return float(f)
if __name__ == "__main__":
# hash symbol allows single-line comments
'''
triple quotes allow multi-line comments
'''
float_num = 999.01 int_num = 87 float_str = '23.09' int_str = '19'
string = 'how now brown cow'
s_float = num_to_str(float_num) s_int = num_to_str(int_num) i_str = str_to_int(int_str) f_str = str_to_float(float_str) print (s_float, 'is', type(s_float)) print (s_int, 'is', type(s_int)) print (f_str, 'is', type(f_str)) print (i_str, 'is', type(i_str))
print ('\nstring', '"' + string + '" has', len(string), 'characters')
str_ls = string.split()
print ('split string:', str_ls)
print ('joined list:', ' '.join(str_ls))

 

A popular coding style is to present library importation and functions first, followed by the main block of code. The code example begins with three custom functions that convert numbers to strings, strings to numbers, and strings to float respectively. Each custom function returns a built-in function to let Python do the conversion.

 

The main block begins with comments. Single-line comments are denoted with the # (hash) symbol. Multiline comments are denoted with three consecutive single quotes.

 

The next five lines assign values to variables. The following four lines convert each variable type to another type. For instance, function num_to_str() converts variable float_num to string type. The next five lines print variables with their associated Python data type. Built-in function type() returns type of given object. The remaining four lines print and manipulate a string variable.

 

Lists, Tuples, and Dictionaries

Lists

Lists are ordered collections with comma-separated values between square brackets. Indices start at 0 (zero). List items need not be of the same type and can be sliced, concatenated, and manipulated in many ways.

 

The following code example creates a list, manipulates and slices it, creates a new list and adds elements to it from another list, and creates a matrix from two lists:

import numpy as np
if __name__ == "__main__":
ls = ['orange', 'banana', 10, 'leaf', 77.009, 'tree', 'cat'] print ('list length:', len(ls), 'items')
print ('cat count:', ls.count('cat'), ',', 'cat index:', ls.index('cat'))
print ('\nmanipulate list:')
cat = ls.pop(6)
print ('cat:', cat, ', list:', ls)
ls.insert(0, 'cat')
ls.append(99)
print (ls)
ls[7] = '11'
print (ls)
ls.pop(1)
print (ls)
ls.pop()
print (ls)
print ('\nslice list:')
print ('1st 3 elements:', ls[:3])
print ('last 3 elements:', ls[3:])
print ('start at 2nd to index 5:', ls[1:5])
print ('start 3 from end to end of list:', ls[-3:])
print ('start from 2nd to next to end of list:', ls[1:-1])
print ('\ncreate new list from another list:')
print ('list:', ls)
fruit = ['orange']
more_fruit = ['apple', 'kiwi', 'pear'] fruit.append(more_fruit) print ('appended:', fruit)
fruit.pop(1)
fruit.extend(more_fruit)
print ('extended:', fruit)
a, b = fruit[2], fruit[1]
print ('slices:', a, b)
print ('\ncreate matrix from two lists:')
matrix = np.array([ls, fruit])
print (matrix)
print ('1st row:', matrix[0])
print ('2nd row:', matrix[1])

 

The code example begins by importing NumPy, which is the fundamental package (library, module) for scientific computing. It is useful for linear algebra, which is fundamental to data science.

 

Think of Python libraries as giant classes with many methods. The main block begins by creating list ls, printing its length, number of elements (items), number of cat elements, an index of the cat element. The code continues by manipulating ls. First, the 7th element (index 6) is popped and assigned to the variable cat.

 

Remember, list indices start at 0. Function pop() removes cat from ls. Second, the cat is added back to ls at the 1st position (index 0) and 99 is appended to the end of the list. Function append() adds an object to the end of a list. Third, the string ‘11’ is substituted for the 8th element (index 7). Finally, the 2nd element and the last element are popped from ls. The code continues by slicing ls. First, print the 1st three elements with ls[:3].

 

Second, print the last three elements with ls[3:]. Third, the print starting with the 2nd element to elements with indices up to 5 with ls[1:5]. Fourth, print starting three elements from the end to the end with ls[-3:]. Fifth, print starting from the 2nd element to next to the last element with ls[1:-1]. The code continues by creating a new list from another.

 

First, create fruit with one element. Second append list more_fruit to fruit. Notice that append adds list more_fruit as the 2nd element of fruit, which may not be what you want. So, third, pop 2nd element of fruit and extend more_fruit to fruit. Function extend() unravels a list before it adds it.

 

This way, fruit now has four elements. Fourth, assign 3rd element to a and 2nd element to b and print slices. Python allows the assignment of multiple variables on one line, which is very convenient and concise. The code ends by creating a matrix from two lists—ls and fruit—and printing it. A Python matrix is a two-dimensional (2-D) array consisting of rows and columns, where each row is a list.

 

A tuple is a sequence of immutable Python objects enclosed by parentheses. Unlike lists, tuples cannot be changed. Tuples are convenient with functions that return multiple values.

 

The following code example creates a tuple, slices it, creates a list, and creates a matrix from tuple and list:

import numpy as np
if __name__ == "__main__":
tup = ('orange', 'banana', 'grape', 'apple', 'grape')
print ('tuple length:', len(tup))
print ('grape count:', tup.count('grape'))
print ('\nslice tuple:')
print ('1st 3 elements:', tup[:3])
print ('last 3 elements', tup[3:])
print ('start at 2nd to index 5', tup[1:5])
print ('start 3 from end to end of tuple:', tup[-3:])
print ('start from 2nd to next to end of tuple:', tup[1:-1]) print ('\ncreate list and create matrix from it and tuple:') fruit = ['pear', 'grapefruit', 'cantaloupe', 'kiwi', 'plum'] matrix = np.array([tup, fruit]) print (matrix)

 

The code begins by importing NumPy. The main block begins by creating tuple tup, printing its length, number of elements (items), number of grape elements, and index of grape. The code continues by slicing tup. First, print the 1st three elements with tup[:3]. Second, print the last three elements with tup[3:].

 

Third, the print starting with the 2nd element to elements with indices up to 5 with tup[1:5]. Fourth, print starting three elements from the end to the end with tup[-3:]. Fifth, print starting from the 2nd element to next to the last element with tup[1:-1]. The code continues by creating a new fruit list and creating a matrix from tup and fruit.

 

A dictionary is an unordered collection of items identified by a key/ value pair. It is an extremely important data structure for working with data. The following example is very simple, but the next section presents a more complex example based on a dataset.

 

The following code example creates a dictionary, deletes an element, adds an element, creates a list of dictionary elements, and traverses the list:

if __name__ == "__main__":
audio = {'amp':'Linn', 'preamp':'Luxman', 'speakers':'Energy', 'ic':'Crystal Ultra', 'pc':'JPS', 'power':'Equi-Tech', 'sp':'Crystal Ultra', 'cdp':'Nagra', 'up':'Esoteric'}
del audio['up']
print ('dict "deleted" element;')
print (audio, '\n')
print ('dict "added" element;')
audio['up'] = 'Oppo'
print (audio, '\n')
print ('universal player:', audio['up'], '\n')
dict_ls = [audio]
video = {'tv':'LG 65C7 OLED', 'stp':'DISH', 'HDMI':'DH Labs', 'cable' : 'coax'}
print ('list of dict elements;')
dict_ls.append(video)
for i, row in enumerate(dict_ls):
print ('row', i, ':')
print (row)

The main block begins by creating dictionary audio with several elements. It continues by deleting an element with key up and value Esoteric and displaying. Next, a new element with key up and element Oppo is added back and displayed. The next part creates a list with dictionary audio, creates dictionary video, and adds the new dictionary to the list.

 

The final part uses a for loop to traverse the dictionary list and display the two dictionaries. A very useful function that can be used with a loop statement is enumerated (). It adds a counter to an iterable. An iterable is an object that can be iterated. Function enumerate() is very useful because a counter is automatically created and incremented, which means less code.

 

Reading and Writing Data

Reading

The ability to read and write data is fundamental to any data science endeavor. All data files are available on the website. The most basic types of data are text and CSV (Comma Separated Values). So, this is where we will start.

 

The following code example reads a text file and cleans it for processing. It then reads the precleansed text file, saves it as a CSV file, reads the CSV file, converts it to a list of OrderedDict elements, and converts this list to a list of regular dictionary elements.

import csv
def read_txt(f):
with open(f, 'r') as f:
d = f.readlines()
return [x.strip() for x in d]
def conv_csv(t, c):
data = read_txt(t)
with open(c, 'w', newline='') as csv_file:
writer = csv.writer(csv_file)
for line in data:
ls = line.split()
writer.writerow(ls)
def read_csv(f):
contents = ''
with open(f, 'r') as f:
reader = csv.reader(f)
return list(reader)
def read_dict(f, h):
input_file = csv.DictReader(open(f), fieldnames=h)
return input_file
def od_to_d(od):
return dict(od)
if __name__ == "__main__":
f = 'data/names.txt'
data = read_txt(f)
print ('text file data sample:')
for i, row in enumerate(data):
if i < 3:
print (row)
csv_f = 'data/names.csv'
conv_csv(f, csv_f)
r_csv = read_csv(csv_f)
print ('\ntext to csv sample:')
for i, row in enumerate(r_csv):
if i < 3:
print (row)
headers = ['first', 'last']
r_dict = read_dict(csv_f, headers)
dict_ls = []
print ('\ncsv to ordered dict sample:')
for i, row in enumerate(r_dict):
r = od_to_d(row)
dict_ls.append(r)
if i < 3:
print (row)
print ('\nlist of dictionary elements sample:')
for i, row in enumerate(dict_ls):
if i < 3:
print (row)

 

The code begins by importing the CSV library, which implements classes to read and write tabular data in CSV format. It continues with five functions. Function read_txt() reads a text (.txt) file and strips (removes) extraneous characters with a list comprehension, which is an elegant way to define and create a list in Python.

 

List comprehension is covered later in the next section. Function conv_csv() converts a text to a CSV file and saves it to disk. Function read_csv() reads a CSV file and returns it as a list. Function read_dict() reads a CSV file and returns a list of OrderedDict elements.

 

An OrderedDict is a dictionary subclass that remembers the order in which its contents are added, whereas a regular dictionary doesn’t track insertion order. Finally, function od_to_d() converts an OrderedDict element to a regular dictionary element. Working with a regular dictionary element is much more intuitive in my opinion. The main block begins by reading a text file and cleaning it for processing.

 

However, no processing is done with this cleansed file in the code. It is only included in case you want to know how to accomplish this task. The code continues by converting a text file to CSV, which is saved to disk. The CSV file is then read from disk and a few records are displayed.

 

Next, a headers list is created to store keys for a dictionary yet to be created. List dict_ls is created to hold dictionary elements. The code continues by creating an OrderedDict list r_dict. The OrderedDict list is then iterated so that each element can be converted to a regular dictionary element and appended to dict_ls.

 

A few records are displayed during iteration. Finally, dict_ls is iterated and a few records are displayed. I highly recommend that you take some time to familiarize yourself with these data structures, as they are used extensively in data science application.

 

List Comprehension

List Comprehension

List comprehension provides a concise way to create lists. Its logic is enclosed in square brackets that contain an expression followed by a for clause and can be augmented by more for or if clauses.

The read_txt() function in the previous section included the following list comprehension:
[x.strip() for x in d]
The logic strips extraneous characters from string in iterable d. In this case, d is a list of strings.
The following code example converts miles to kilometers, manipulates pets, and calculates bonuses with list comprehension:
if __name__ == "__main__":
miles = [100, 10, 9.5, 1000, 30] kilometers = [x * 1.60934 for x in miles] print ('miles to kilometers:')
for i, row in enumerate(kilometers): print ('{:>4} {:>8}{:>8} {:>2}'.
format(miles[i],'miles is', round(row,2), 'km'))
print ('\npet:')
pet = ['cat', 'dog', 'rabbit', 'parrot', 'guinea pig', 'fish']
print (pet)
print ('\npets:')
pets = [x + 's' if x != 'fish' else x for x in pet] print (pets)
subset = [x for x in pets if x != 'fish' and x != 'rabbits' and x != 'parrots' and x != 'guinea pigs']
print ('\nmost common pets:') print (subset[1], 'and', subset[0]) sales = [9000, 20000, 50000, 100000] print ('\nbonuses:')
bonus = [0 if x < 10000 else x * .02 if x >= 10000 and x <=20000 else x * .03 for x in sales] print (bonus) print ('\nbonus dict:') people=['dave', 'sue', 'al', 'sukki'] d={} for i, row in enumerate(people): d[row]=bonus[i] print (d, '\n') print ('{:<5} {:<5}'.format('emp', 'bonus')) for k, y in d.items(): print ('{:<5} {:>6}'.format(k, y))

 

The main block begins by creating two lists—miles and kilometers. The kilometers list is created with list comprehension, which multiplies each mile value by 1.60934. At first, list comprehension may seem confusing, but practice makes it easier over time. The main block continues by printing miles and associated kilometers.

 

Function format() provides sophisticated formatting options. Each mile value is ({:>4}) with up to four characters right justified. Each string for miles and kilometers is right justified ({:>8}) with up to eight characters. Finally, each string for km is right justified ({:>2}) with up to two characters.

 

This may seem a bit complicated at first, but it is really quite logical (and elegant) once you get used to it. The main block continues by creating pet and pets lists. The pets list is created with a list comprehension, which makes a pet plural if it is not a fish. I advise you to study this list comprehension before you go forward because they just get more complex.

 

The code continues by creating a subset list with a list comprehension, which only includes dogs and cats. The next part creates two lists—sales and bonus. The bonus is created with list comprehension that calculates bonus for each sales value. If sales are less than 10,000, no bonus is paid. If sales are between 10,000 and 20,000 (inclusive), the bonus is 2% of sales. Finally, if sales if greater than 20,000, the bonus is 3% of sales.

 

At first, I was confused by this list comprehension but it makes sense to me now. So, try some of your own and you will get the gist of it. The final part creates a people list to associate with each sales value, continues by creating a dictionary to hold bonus for each person, and ends by iterating dictionary elements. The formatting is quite elegant.

 

The header left justifies emp and bonus properly. Each item is formatted so that the person is left justified with up to five characters ({:<5}) and the bonus is right justified with up to six characters ({:>6}).

 

Generators

Generators

A generator is a special type of iterator, but much faster because values are only produced as needed. This process is known as lazy (or deferred) evaluation. Typical iterators are much slower because they are fully built into memory. While regular functions return values, generators yield them. The best way to traverse and access values from a generator is to use a loop. Finally, a list comprehension can be converted to a generator by replacing square brackets with parentheses.

 

The following code example reads a CSV file and creates a list of OrderedDict elements. It then converts the list elements into regular dictionary elements. The code continues by simulating times for list comprehension, generator comprehension, and generators. During simulation, a list of times for each is created. Simulation is the imitation of a real-world process or system over time, and it is used extensively in data science.

import csv, time, numpy as np
def read_dict(f, h):
input_file = csv.DictReader(open(f), fieldnames=h)
return (input_file)
def conv_reg_dict(d):
return [dict(x) for x in d]
def sim_times(d, n):
i = 0
lsd, lsgc = [], []
while i < n:
start = time.clock()
[x for x in d]
time_d = time.clock() - start
lsd.append(time_d)
start = time.clock()
(x for x in d)
time_gc = time.clock() - start
lsgc.append(time_gc)
i += 1
return (lsd, lsgc)
def gen(d):
yield (x for x in d)
def sim_gen(d, n):
i = 0
lsg = []
generator = gen(d)
while i < n:
start = time.clock()
for row in generator:
None
time_g = time.clock() - start
lsg.append(time_g)
i += 1
generator = gen(d)
return lsg
def avg_ls(ls):
return np.mean(ls)
if __name__ == '__main__': f = 'data/names.csv' headers = ['first', 'last']
r_dict = read_dict(f, headers) dict_ls = conv_reg_dict(r_dict) n = 1000
ls_times, gc_times = sim_times(dict_ls, n)
g_times = sim_gen(dict_ls, n)
avg_ls = np.mean(ls_times)
avg_gc = np.mean(gc_times)
avg_g = np.mean(g_times)
gc_ls = round((avg_ls / avg_gc), 2)
g_ls = round((avg_ls / avg_g), 2)
print ('generator comprehension:')
print (gc_ls, 'times faster than list comprehension\n')
print ('generator:')
print (g_ls, 'times faster than list comprehension')

 

The code begins by importing csv, time, and numpy libraries. Function read_dict() converts a CSV (.csv) file to a list of OrderedDict elements. Function conv_reg_dict() converts a list of OrderedDict elements to a list of regular dictionary elements (for easier processing). Function sim_times() runs a simulation that creates two lists—lsd and lsgc.

 

List lsd contains n run times for list comprehension and list lsgc contains n run times for generator comprehension. Using simulation provides a more accurate picture of the true time it takes for both of these processes by running them over and over (n times). In this case, the simulation is run 1,000 times (n =1000). Of course, you can run the simulations as many or few times as you wish. 

 

Functions gen() and sim_gen() work together. Function gen() creates a generator. Function sim_gen() simulates the generator n times. I had to create these two functions because yielding a generator requires a different process than creating a generator comprehension. Function avg_ls() returns the mean (average) of a list of numbers. The main block begins by reading a CSV file (the one we created earlier in the blog) into a list of OrderedDict elements and converting it to a list of regular dictionary elements.

 

The code continues by simulating run times of list comprehension and generator comprehension 1,000 times (n = 1000). The 1st simulation calculates 1,000 runtimes for traversing the dictionary list created earlier for both list and generator comprehension and returns a list of those runtimes for each.

 

The 2nd simulation calculates 1,000 runtimes by traversing the dictionary list for a generator and returns a list of those runtimes. The code concludes by calculating the average runtime for each of the three techniques—list comprehension, generator comprehension, and generators—and comparing those averages.

 

The simulations verify that generator comprehension is more than ten times, and generators are more than eight times faster than list comprehension (runtimes will vary based on your PC). This makes sense because of list comprehension stores all data in memory, while generators evaluate (lazily) as data is needed. Naturally, the speed advantage of generators becomes more important with big data sets. Without simulation, runtimes cannot be verified because we are randomly getting internal system clock times.

 

Data Randomization

Data Randomization

A stochastic process is a family of random variables from some probability space into a state space (whew!). Simply, it is a random process through time. Data randomization is the process of selecting values from a sample in an unpredictable manner with the goal of simulating reality. Simulation allows the application of data randomization in data science.

 

The previous section demonstrated how simulation can be used to realistically compare iterables (list comprehension, generator comprehension, and generators).

 

In Python, pseudorandom numbers are used to simulate data randomness (reality). They are not truly random because the 1st generation has no previous number. We have to provide a seed (or random seed) to initialize a pseudorandom number generator. The random library implements pseudorandom number generators for various data distributions, and random.seed() is used to generate the initial (1st generation) seed number.

 

The following code example reads a CSV file and converts it to a list of regular dictionary elements. The code continues by creating a random number used to retrieve a random element from the list. Next, a generator of three randomly selected elements is created and displayed. The code continues by displaying three randomly shuffled elements from the list.

 

The next section of code deterministically seeds the random number generator, which means that all generated random numbers will be the same based on the seed. So, the elements displayed will always be the same ones unless the seed is changed.

 

The code then uses the system’s time to nondeterministically generate random numbers and display those three elements. Next, nondeterministic random numbers are generated by another method and those three elements are displayed. The final part creates a names list so random choice and sampling methods can be used to display elements.

import csv, random, time
def read_dict(f, h):
input_file = csv.DictReader(open(f), fieldnames=h)
return (input_file)
def conv_reg_dict(d):
return [dict(x) for x in d]
def r_inds(ls, n):
length = len(ls) - 1
yield [random.randrange(length) for _ in range(n)]
def get_slice(ls, n):
return ls[:n]
def p_line():
print ()
if __name__ == '__main__':
f = 'data/names.csv'
headers = ['first', 'last']
r_dict = read_dict(f, headers)
dict_ls = conv_reg_dict(r_dict)
n = len(dict_ls)
r = random.randrange(0, n-1)
print ('randomly selected index:', r)
print ('randomly selected element:', dict_ls[r])
elements = 3
generator = next(r_inds(dict_ls, elements))
p_line()
print (elements, 'randomly generated indicies:', generator)
print (elements, 'elements based on indicies:')
for row in generator:
print (dict_ls[row])
x = [[i] for i in range(n-1)]
random.shuffle(x)
p_line()
print ('1st', elements, 'shuffled elements:')
ind = get_slice(x, elements)
for row in ind:
print (dict_ls[row[0]])
seed = 1
random_seed = random.seed(seed)
rs1 = random.randrange(0, n-1)
p_line()
print ('deterministic seed', str(seed) + ':', rs1)
print ('corresponding element:', dict_ls[rs1])
t = time.time()
random_seed = random.seed(t)
rs2 = random.randrange(0, n-1)
p_line()
print ('non-deterministic time seed', str(t) + ' index:', rs2)
print ('corresponding element:', dict_ls[rs2], '\n')
print (elements, 'random elements seeded with time:')
for i in range(elements):
r = random.randint(0, n-1)
print (dict_ls[r], r)
random_seed = random.seed()
rs3 = random.randrange(0, n-1)
p_line()
print ('non-deterministic auto seed:', rs3)
print ('corresponding element:', dict_ls[rs3], '\n')
print (elements, 'random elements auto seed:')
for i in range(elements):
r = random.randint(0, n-1)
print (dict_ls[r], r)
names = []
for row in dict_ls:
name = row['last'] + ', ' + row['first'] names.append(name)
p_line()
print (elements, 'names with "random.choice()":')
for row in range(elements):
print (random.choice(names))
p_line()
print (elements, 'names with "random.sample()":')
print (random.sample(names, elements))

 

The code begins by importing CSV, random, and time libraries. Functions read_dict() and conv_reg_dict() have already been explained. Function r_inds() generates a random list of n elements from the dictionary list. To get the proper length, one is subtracted because of Python lists begin at index zero. Function get_slice() creates a randomly shuffled list of n elements from the dictionary list. Function p_line() prints a blank line. The main block begins by reading a CSV file and converting it into a list of regular dictionary elements.

 

The code continues by creating a random number with random.randrange() based on the number of indices from the dictionary list, and displays the index and associated dictionary element. Next, a generator is created and populated with three randomly determined elements. The indices and associated elements are printed from the generator. 

 

The next part of the code randomly shuffles the indices and puts them in list x. An index value is created by slicing three random elements based on the shuffled indices stored in list x. The three elements are then displayed. The code continues by creating a deterministic random seed using a fixed number (seed) in the function. So, the random number generated by this seed will be the same each time the program is run.

 

This means that the dictionary element displayed will also be the same. Next, two methods for creating nondeterministic random numbers are presented—random.seed(t) and random.seed()— where t varies by system time and using no parameter automatically varies random numbers. Randomly generated elements are displayed for each method. The final part of the code creates a list of names to hold just first and last names, so random.choice() and random.sample() can be used.

 

MongoDB and JSON

JSON

MongoDB is a document-based database classified as NoSQL. NoSQL (Not Only SQL database) is an approach to database design that can accommodate a wide variety of data models, including key-value, document, columnar, and graph formats. It uses JSON-like documents with schemas.

 

It integrates extremely well with Python. A MongoDB collection is conceptually like a table in a relational database, and a document is conceptually like a row. JSON is a lightweight data-interchange format that is easy for humans to read and write. It is also easy for machines to parse and generate.

 

Database queries from MongoDB are handled by PyMongo. PyMongo is a Python distribution containing tools for working with MongoDB. It is the most efficient tool for working with MongoDB using the utilities of Python. PyMongo was created to leverage the advantages of Python as a programming language and MongoDB as a database. The pymongo library is a native driver for MongoDB, which means it is it is built into Python language. Since it is native, the pymongo library is automatically available (doesn’t have to be imported into the code).

 

The following code example reads a CSV file and converts it to a list of regular dictionary elements. The code continues by creating a JSON file from the dictionary list and saving it to disk. Next, the code connects to MongoDB and inserts the JSON data. The final part of the code manipulates data from the MongoDB database. First, all data in the database is queried and a few records are displayed. Second, the database is rewound. Rewind sets the pointer to back to the 1st database record. Finally, various queries are performed.

import json, csv, sys, os sys.path.append(os.getcwd()+'/classes') import conn
def read_dict(f, h):
input_file = csv.DictReader(open(f), fieldnames=h)
return (input_file)
def conv_reg_dict(d):
return [dict(x) for x in d]
def dump_json(f, d):
with open(f, 'w') as f:
json.dump(d, f)
def read_json(f):
with open(f) as f:
return json.load(f)
if __name__ == '__main__':
f = 'data/names.csv'
headers = ['first', 'last']
r_dict = read_dict(f, headers)
dict_ls = conv_reg_dict(r_dict)
json_file = 'data/names.json'
dump_json(json_file, dict_ls)
data = read_json(json_file)
obj = conn.conn('test')
db = obj.getDB()
names = db.names
names.drop()
for i, row in enumerate(data):
row['_id'] = i
names.insert_one(row)
n = 3
print('1st', n, 'names:')
people = names.find()
for i, row in enumerate(people):
if i < n:
print (row)
people.rewind()
print('\n1st', n, 'names with rewind:')
for i, row in enumerate(people):
if i < n:
print (row)
print ('\nquery 1st', n, 'names') first_n = names.find().limit(n) for row in first_n:
print (row)
print ('\nquery last', n, 'names')
length = names.find().count()
last_n = names.find().skip(length - n)
for row in last_n:
print (row)
fnames = ['Ella', 'Lou']
lnames = ['Vader', 'Pole']
print ('\nquery Ella:')
query_1st_in_list = names.find( {'first':{'$in':[fnames[0]]}})
for row in query_1st_in_list:
print (row)
print ('\nquery Ella or Lou:')
query_1st = names.find( {'first':{'$in':fnames}} )
for row in query_1st:
print (row)
print ('\nquery Lou Pole:')
query_and = names.find( {'first':fnames[1], 'last':lnames[1]} )
for row in query_and:
print (row)
print ('\nquery first name Ella or last name Pole:') query_or = names.find( {'$or':[{'first':fnames[0]}, {'last':lnames[1]}]} )
for row in query_or:
print (row)
pattern = '^Sch'
print ('\nquery regex pattern:')
query_like = names.find( {'last':{'$regex':pattern}} )
for row in query_like:
print (row)
pid = names.count()
doc = {'_id':pid, 'first':'Wendy', 'last':'Day'} names.insert_one(doc)
print ('\ndisplay added document:')
q_added = names.find({'first':'Wendy'})
print (q_added.next())
print ('\nquery last n documents:') q_n = names.find().skip((pid-n)+1) for _ in range(n):
print (q_n.next())
Class conn:
class conn:
from pymongo import MongoClient
client = MongoClient('localhost', port=27017)
def __init__(self, dbname):
self.db = conn.client[dbname] def getDB(self):
return self.db

 

The code begins by importing json, csv, sys, and os libraries. Next, a path (sys.path.append) to the class conn is established. Method getcwd() (from the os library) gets the current working directory for classes. Class conn is then imported. I built this class to simplify connectivity to the database from any program.

 

The code continues with four functions. Functions read_dict() and conv_reg_dict() were explained earlier. Function dump_json() writes JSON data to disk. Function read_json() reads JSON data from disk. The main block begins by reading a CSV file and converting it into a list of regular dictionary elements.

 

Next, the list is dumped to disk as JSON. The code continues by creating a PyMongo connection instance test as an object and assigning it to variable obj. You can create any instance you wish, but the test is the default. Next, the database instance is assigned to db by method getDB() from obj. Collection names are then created in MongoDB and assigned to variable names. When prototyping, I always drop the collection before manipulating it.

 

This eliminates duplicate key errors. The code continues by inserting the JSON data into the collection. For each document in a MongoDB collection, I explicitly create primary key values by assigning sequential numbers to _id. MongoDB exclusively uses _id as the primary key identifier for each document in a collection. If you don’t name it yourself, a system identifier is automatically created, which is messy to work with in my opinion.

 

The code continues with PyMongo query names.find(), which retrieves all documents from the names collection. Three records are displayed just to verify that the query is working. To reuse a query that has already been accessed, rewind() must be issued. The next PyMongo query accesses and displays three (n = 3) documents.

 

The next query accesses and displays the last three documents. Next, we move into more complex queries. First, access documents with first name Ella. Second, access documents with first names Ella or Lou. Third, access document Lou Pole. Fourth, access documents with first name Ella or last name Pole. Next, a regular expression is used to access documents with last names beginning with

 

Sch. A regular expression is a sequence of characters that define a search pattern. Finally, add a new document, display it, and display the last three documents in the collection.

 

Understand the Data: Basic Questions

Questions

Once you have access to the data you’ll be using, it’s good to have a battery of standard questions that you always ask about it. This is a good way to hit the ground running with your analyses, rather than risk analysis paralysis. It is also a good safeguard to identify problems with the data as quickly as possible.

 

A few good generic questions to ask are as follows:

  • How big is the dataset?
  • Is this the entire dataset?
  • Is this data representative enough? For example, maybe data was only collected for a subset of users.

 

Are there likely to be gross outliers or extraordinary sources of noise? For example, 99% of the traffic from a web server might be a single denial-of-service attack.

 

  • Might there be artificial data inserted into the dataset? This happens a lot in industrial settings.
  • Are there any fields that are unique identifiers? These are the fields you might use for joining between datasets, etc.
  • Are the supposedly unique identifiers actually unique? What does it mean if they aren’t?
  • If there are two datasets A and B that need to be joined, what does it mean if something in A doesn’t match anything in B?
  • When data entries are blank, where does that come from?
  •  How common are blank entries?

 

SOWs will often include an appendix that describes the available data. If any of them can’t be answered up front, it is common to clear them up as the first round of analysis and make sure that everybody agrees that these answers are sane.

 

The most important question to ask about the data is whether it can solve the business problem that you are trying to tackle. If not, then you might need to look into additional sources of data or modify the work that you are planning.

 

Speaking from personal experience, I have been inclined to neglect these preliminary questions. I am excited to get into the actual analysis, so I’ve sometimes jumped right in without taking the time to make sure that I know what I’m doing.

 

For example, I once had a project where there was a collection of motors and time series data monitoring their physical characteristics: one time series per motor. My job was to find leading indicators of failure, and I started doing this by comparing the last day’s worth of time series for a given motor (i.e., the data taken right before it failed) against its previous data.

 

Well, I realized a couple weeks in that sometimes the time series stopped long before the motor actually failed, and in other cases, the time series data continued long after the motor was dead. The actual times the motors had died were listed in a separate table, and it would have been easy for me to double‐check early on that they corresponded to the ends of the time series.

 

Understand­ the Data: Data Wrangling

Data Wrangling

Data wrangling is the process of getting the data from its raw format into something suitable for more conventional analytics. This typically means creating a software pipeline that gets the data out of wherever it is stored, does any cleaning or filtering necessary, and puts it into a regular format.

 

Data wrangling is the main area where data scientists need skills that a traditional statistician or analyst doesn’t have. The data is often stored in a special-purpose database that requires specialized tools to access. There could be so much of it that Big Data techniques are required to process it. You might need to use performance tricks to make things run quickly. Especially with messy data, the preprocessing pipelines are often so complex that it is very difficult to keep the code organized.

 

Speaking of messy data, I should tell you this upfront: industrial datasets are always more convoluted than you would think they reasonably should be. The question is not whether the problems exist but whether they impact your work. My recipe for figuring out how a particular dataset is broken includes the following:

 

If the raw data is text, look directly at the plain files in a text editor or something similar. Things such as irregular date formats, irregular capitalizations, and lines that are clearly junk will jump out at you.

 

If there is a tool that is supposed to be able to open or process the data, make sure that it can actually do it. For example, if you have a CSV file, try opening it in something that reads data frames. Did it read all the rows in? If not, maybe some rows have the wrong number of entries. Did the column that is supposed to be a daytime get read in as a DateTime? If not, then maybe the formatting is irregular.

 

Do some histograms and scatterplots. Are these numbers realistic, given what you know about the real‐life situation? Are there any massive outliers?

 

Take some simple questions that you already know the (maybe approximate) answer to, answer them based on this data, and see if the results agree. For example, you might try to calculate the number of customers by counting how many unique customer IDs there are. If these numbers don’t agree, then you’ve probably misunderstood something about the data.

 

Understand the Data: Exploratory Analysis

Once you have the data digested into a usable format, the next step is exploratory analysis. This basically means poking around in the data, visualizing it in lots of different ways, trying out different ways to transform it, and seeing what there is to see. This stage is very creative, and it’s a great place to let your curiosity run a little wild. Feel free to calculate some correlations and similar things, but don’t break out the fancy machine learning classifiers. Keep things simple and intuitive.

 

There are two things that you typically get out of exploratory analysis:

 

You develop an intuitive feel for the data, including what the salient patterns look like visually. This is especially important if you’re going to be working with similar data a lot in the future.

 

You get a list of concrete hypotheses about what’s going on in the data. Oftentimes, a hypothesis will be motivated by a compelling graphics that you generated: a snapshot of a time series that shows an unmistakable pattern, a scatterplot demonstrating that two variables are related to each other, or a histogram that is clearly bimodal.

 

A common misconception is that data scientists don’t need visualizations. This attitude is not only inaccurate: it is very dangerous. Most machine learning algorithms are not inherently visual, but it is very easy to misinterpret their outputs if you look only at the numbers; there is no substitute for the human eye when it comes to making intuitive sense of things.

 

Extract Features

This stage has a lot of overlap with exploratory analysis and data wrangling. A feature is really just a number or a category that is extracted from your data and describes some entity. For example, you might extract the average word length from a text document or the number of characters in the document. Or, if you have temperature measurements, you might extract the average temperature for a particular location.

 

In practical terms, feature extraction means taking your raw datasets and distilling them down into a table with rows and columns. This is called “tabular data.” Each row corresponds to some real‐world entity, and each column gives a single piece of information (generally a number) that describes that entity. Virtually all analytics techniques, from lowly scatterplots to fancy neural networks, operate on tabular data.

 

Extracting good features is the most important thing for getting your analysis to work. It is much more important than good machine learning classifiers, fancy statistical techniques, or elegant code. Especially if your data doesn’t come with readily available features (as is the case with web pages, images, etc.), how you reduce it to numbers will make the difference between success and failure.

 

Feature extraction is also the most creative part of data science and the one most closely tied to domain expertise. Typically, a really good feature will correspond to some real‐world phenomenon. Data scientists should work closely with domain experts and understand what these phenomena mean and how to distill them into numbers.

 

Sometimes, there is also room for creativity as to what entities you are extracting features about. For example, let’s say that you have a bunch of transaction logs, each of which gives a person’s name and e‐mail address. Do you want to have one row per human or one row per e‐mail address? For many real‐world situations, you want one row per human (in which case, the number of unique e‐mail addresses they have might be a good feature to extract!), but that opens the very thorny question of how you can tell when two people are the same based on their names.

 

Most features that we extract will be used to predict something. However, you may also need to extract the thing that you are predicting, which is also called the target variable. For example, I was once tasked with predicting whether my client’s customers would lose their brand loyalty. There was no “loyalty” field in the data: it was just a log of various customer interactions and transactions. I had to figure out a way to measure “loyalty.”

 

Model

Once features have been extracted, most data science projects involve some kind of machine learning model. Maybe this is a classifier that guesses whether a customer is still loyal, a regression model that predicts a stock’s price on the next day, or a clustering algorithm that breaks customers into different segments.

 

In many data science projects, the modeling stage is quite simple: you just take a standard suite of models, plug your data into each one of them, and see which one works best. In other cases, a lot of care is taken to carefully tune a model and eek out every last bit of performance.

 

Really this should happen at every stage of a data science project, but it becomes especially crucial when analyzing the results of the modeling stage. If you have identified different clusters, what do they correspond to? Does your classifier work well enough to be useful? Is there anything interesting about the cases in which it fails?

 

This stage is what allows for course corrections in a project and gives ideas for what to do differently if there is another iteration. If your client is a human, it is common to use a variety of models, tuned in different ways, to examine different aspects of your data. If your client is a machine though, you will probably need to zero in on a single, canonical model that will be used in production.

 

Present Results

If your client is a human, then you will probably have to give either a slide deck or a written report describing the work you did and what your results were. You are also likely to have to do this even if your main clients are machines.

 

Communication in slide decks and prose is a difficult, important skill set in itself. But it is especially tricky with data science, where the material you are communicating is highly technical and you are presenting to a broad audience. Data scientists must communicate fluidly with business stakeholders, domain experts, software engineers, and business analysts. These groups tend to have different knowledge bases coming in, different things they will be paying attention to, and different presentation styles to which they are accustomed.

 

Deploy Code

Deploy Code

If your ultimate clients are computers, then it is your job to produce code that will be run regularly in the future by other people. Typically, this falls into one of two categories:

 

Batch analytics code. This will be used to redo an analysis similar to the one that has already been done, on data that will be collected in the future. Sometimes, it will produce some human‐readable analytics reports. Other times, it will train a statistical model that will be referenced by other code.

 

Real‐time code. This will typically be an analytical module in a larger software package, written in a high‐performance programming language and adhering to all the best practices of software engineering.

 

There are three typical deliverables from this stage:

 

The code itself.

Some documentation of how to run the code. Sometimes, this is a stand-alone work document, often called a “runbook.” Other times, the documentation is embedded in the code itself.

 

Usually, you need some way to test code that ensures that your code operates correctly. Especially for real‐time code, this will normally take the form of unit tests. For batch processes, it is sometimes a sample input dataset (designed to illustrate all the relevant edge cases) along with what the out-put should look like.

 

In deploying code, data scientists often take on a dual role as full‐fledged software engineers. Especially with very intricate algorithms, it often just isn’t practical to have one person spec it out and another implement the same thing for production.

 

Iterating

Data science is a deeply iterative process, even more so than typical software engineering. This is because in software, you generally at least know what you’re ultimately aiming to create, even if you take an iterative approach to write it. But in data science, it is usually an open question of what features will end up being used to extract and what model you will train. For this reason,­ the data science process should be built around the goal of being able to change things painlessly.

 

My recommendations are as follows:

Try to get preliminary results as quickly as possible after you’ve understood the data. A scatterplot or histogram that shows you that there is a clear pattern­ in the data. Maybe a simple model based on crude preliminary features that nonetheless works. Sometimes an analysis is doomed to failure because there just isn’t much signal in the data. If this is the case, you want to know sooner rather than later, so that you can change your focus.

 

Automate your analysis in a single script so that it’s easy to run the whole thing with one command. This is a point that I’ve learned the hard way: it is really, really easy after several hours at the command line to lose track of exactly what processing you did to get your data into its current form. Keep things reproducible from the beginning.

 

Keep your code modular and broken out into clear stages. This makes it easy to modify, add in, and take out steps as you experiment.

Notice how much of this comes down to considerations of software, not analytics. The code must be flexible enough to solve all manner of problems, powerful enough to do it efficiently, and comprehensible enough to edit quickly if objectives change. Doing this requires that data scientists use flexible, powerful programming languages, which I will discuss in the next blog.