Lecture notes on Data Science

what is data science and machine learning and what is data science specialization and what is data science and analytics pdf free download
FelixMorse Profile Pic
FelixMorse,New Zealand,Teacher
Published Date:18-07-2017
Your Website URL(Optional)
Comment
Applied Data Science Ian Langmore Daniel Krasner2Contents I Programming Prerequisites 1 1 Unix 2 1.1 History and Culture . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Standard streams . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.1 In a nutshell . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.2 More nuts and bolts . . . . . . . . . . . . . . . . . . . 10 1.6 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Version Control with Git 13 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 What is Git . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Setting Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Online Materials . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Basic Git Concepts . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Common Git Work ows . . . . . . . . . . . . . . . . . . . . . 15 2.6.1 Linear Move from Working to Remote . . . . . . . . . 16 2.6.2 Discarding changes in your working copy . . . . . . . 17 2.6.3 Erasing changes . . . . . . . . . . . . . . . . . . . . . 17 2.6.4 Remotes . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6.5 Merge con icts . . . . . . . . . . . . . . . . . . . . . . 18 3 Building a Data Cleaning Pipeline with Python 19 3.1 Simple Shell Scripts . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Template for a Python CLI Utility . . . . . . . . . . . . . . . 21 iii CONTENTS II The Classic Regression Models 23 4 Notation 24 4.1 Notation for Structured Data . . . . . . . . . . . . . . . . . . 24 5 Linear Regression 26 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Coecient Estimation: Bayesian Formulation . . . . . . . . . 29 5.2.1 Generic setup . . . . . . . . . . . . . . . . . . . . . . . 29 5.2.2 Ideal Gaussian World . . . . . . . . . . . . . . . . . . 30 5.3 Coecient Estimation: Optimization Formulation . . . . . . 33 5.3.1 The least squares problem and the singular value de- composition . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.2 Over tting examples . . . . . . . . . . . . . . . . . . . 39 5.3.3 L regularization . . . . . . . . . . . . . . . . . . . . . 43 2 5.3.4 Choosing the regularization parameter . . . . . . . . . 44 5.3.5 Numerical techniques . . . . . . . . . . . . . . . . . . 46 5.4 Variable Scaling and Transformations . . . . . . . . . . . . . 47 5.4.1 Simple variable scaling . . . . . . . . . . . . . . . . . . 48 5.4.2 Linear transformations of variables . . . . . . . . . . . 51 5.4.3 Nonlinear transformations and segmentation . . . . . 52 5.5 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.6 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6 Logistic Regression 55 6.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1.1 Presenter's viewpoint . . . . . . . . . . . . . . . . . . 55 6.1.2 Classical viewpoint . . . . . . . . . . . . . . . . . . . . 56 6.1.3 Data generating viewpoint . . . . . . . . . . . . . . . . 57 6.2 Determining the regression coecient w . . . . . . . . . . . . 58 6.3 Multinomial logistic regression . . . . . . . . . . . . . . . . . 61 6.4 Logistic regression for classi cation . . . . . . . . . . . . . . . 62 6.5 L1 regularization . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.6 Numerical solution . . . . . . . . . . . . . . . . . . . . . . . . 66 6.6.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . 67 6.6.2 Newton's method . . . . . . . . . . . . . . . . . . . . . 68 6.6.3 Solving the L1 regularized problem . . . . . . . . . . . 70 6.6.4 Common numerical issues . . . . . . . . . . . . . . . . 70 6.7 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.8 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73CONTENTS iii 7 Models Behaving Well 74 7.1 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 III Text Data 76 8 Processing Text 77 8.1 A Quick Introduction . . . . . . . . . . . . . . . . . . . . . . 77 8.2 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . 78 8.2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . 78 8.2.2 Unix Command line and regular expressions . . . . . . 79 8.2.3 Finite State Automata and PCRE . . . . . . . . . . . 82 8.2.4 Backreference . . . . . . . . . . . . . . . . . . . . . . . 83 8.3 Python RE Module . . . . . . . . . . . . . . . . . . . . . . . . 84 8.4 The Python NLTK Library . . . . . . . . . . . . . . . . . . . 87 8.4.1 The NLTK Corpus and Some Fun things to do . . . . 87 IV Classi cation 89 9 Classi cation 90 9.1 A Quick Introduction . . . . . . . . . . . . . . . . . . . . . . 90 9.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 9.2.1 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 93 9.3 Measuring Accuracy . . . . . . . . . . . . . . . . . . . . . . . 94 9.3.1 Error metrics and ROC Curves . . . . . . . . . . . . . 94 9.4 Other classi ers . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.4.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . 99 9.4.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . 101 9.4.3 Out-of-bag classi cation . . . . . . . . . . . . . . . . . 102 9.4.4 Maximum Entropy . . . . . . . . . . . . . . . . . . . . 103 V Extras 105 10 High(er) performance Python 106 10.1 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 107 10.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 10.3 Practical performance in Python . . . . . . . . . . . . . . . . 114 10.3.1 Pro ling . . . . . . . . . . . . . . . . . . . . . . . . . . 114 10.3.2 Standard Python rules of thumb . . . . . . . . . . . . 117iv CONTENTS 10.3.3 For loops versus BLAS . . . . . . . . . . . . . . . . . . 122 10.3.4 Multiprocessing Pools . . . . . . . . . . . . . . . . . . 123 10.3.5 Multiprocessing example: Stream processing text les 124 10.3.6 Numba . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.3.7 Cython . . . . . . . . . . . . . . . . . . . . . . . . . . 129CONTENTS v What is data science? With the major technological advances of the last two decades, coupled in part with the internet explosion, a new breed of analysist has emerged. The exact role, background, and skill-set, of a data scientist are still in the process of being de ned and it is likely that by the time you read this some of what we say will seem archaic. In very general terms, we view a data scientist as an individual who uses current computational techniques to analyze data. Now you might make the observation that there is nothing particularly novel in this, and subse- 1 quenty ask what has forced the de nition. After all statisticians, physicists, biologisitcs, nance quants, etc have been looking at data since their respec- tive elds emerged. One short answer comes from the fact that the data sphere has changed and, hence, a new set of skills is required to navigate it e ectively. The exponential increase in computational power has provided new means to investigate the ever growing amount of data being collected every second of the day. What this implies is the fact that any modern data analyst will have to make the time investment to learn computational techniques necessary to deal with the volumes and complexity of the data of today. In addition to those of mathemics and statistics, these software skills are domain transfereable and so it makes sense to create a job title that is also transferable. We could also point to the \data hype" created in industry as a culprit for the term data science with the science creating an aura of validity and facilitating LinkedIn headhunting. What skills are needed? One neat way we like to visualize the data science skill set is with Drew Conway's Venn DiagramCon, see gure 1. Math and statistics is what allows us to properly quantify a phenomenon observed in data. For the sake of narrative lets take a complex deterministic situation, such as whether or not someone will make a loan payment, and attempt to answer this question with a limited number of variables and an imperfect understanding of those variables in uence on the event we wish to predict. With the exception of your friendly real estate agent we generally acknowldege our lack of soothseer ability and make statements about the probability of this event. These statements take a mathematical form, for example + creditscore Pmakes-loan-payment =e : 1 William S. Cleveland decide to coin the term data science and write Data Science: An action plan for expanding the technical areas of the eld of statistics Cle. His report outlined six points for a university to follow in developing a data analyst curriculum.vi CONTENTS Figure 1: Drew Conway's Venn Diagram where the above quanti es the risk associated with this event. Deciding on the best coecients and can be done quite easily by a host of software packages. In fact anyone with decent hacking skills can do achieve the goal. Of course, a simple model such as this would convince no one and would call for substantive expertise (more commonly called domain knowledge) to make real progress. In this case, a domain expert would note that additional variables such as the loan to value ratio and housing price index are needed as they have a huge e ect on payment activity. These variables and many others would allow us to arrive at a \better" model + X Pmakes-loan-payment =e : (1) Finally we have arrived at a model capable of fooling someone We could keep adding variables until the model will almost certainly t the historic risk quite well. BUT, how do we know that this will allow us to quantify 2 risk in the future? To make some sense of our uncertainty about our model we need to know eactly what (1) means. In particular, did we include too many variables and over t? Did our method of solving (1) arrive at a good solution or just numerical noise? Most importantly, how appropriate is the logistic regression model to begin with? Answering these questions is often as much an art as a science, but in our experience, sucient mathematical understanding is necessary to avoid getting lost. 2 The distrinction between uncertainty and risk has been talked about quite extensively by Nassim TalebTal05, Tal10CONTENTS vii What is the motivation for, and focus of, this course? Just as com- mon as the hacker with no domain knowledge, or the domain expert with no statistical no-how is the traditional academic with meager computing skills. Academia rewards papers containing original theory. For the most part it does not reward the considerable e ort needed to produce high qual- ity, maintainable code that can be used by others and integrated into larger frameworks. As a result, the type of code typically put forward by academics is completely unuseable in industry or by anyone else for that matter. It is often not the purpose or worth the e ort to write production level code in an academic environment. The importance of this cannot be overstated. Consider a 20 person start-up that wishes to build a smart-phone app that recommends restaurants to users. The data scientist hired for this job will need to interact with the company database (they will likely not be handed a neat csv le), deal with falsely entered or inconveniently formatted data, and produce legible reports, as well as a working model for the rest of the company to integrate into its production framework. The scientist may be expected to do this work without much in the way of software support. Now, considering how easy it is to blindly run most predictive software, our hypo- thetical company will be tempted to use a programmer with no statistical knowledge to do this task. Of course, the programmer will fall into analytic traps such as the ones mentioned above but that might not deter anyone from being content with output. This anecdote seems construed, but in re- ality it is something we have seen time and time again. The current world of data analysis calls for a myriad of skills, and clean programming, database interaction and understand of architecture have all become the minimum to succeed. The purpose of this course is to take people with strong mathematical/s- 3 tatistical knowledge and teach them software development fundamentals . This course will cover  Design of small software packages  Working in a Unix environment  Designing software in teams  Fundamental statistical algorithms such as linear and logistic regres- sion 3 Our view of what constitutes the necessary fundamentals is strongly in uenced by the team at software carpentryWilaviii CONTENTS  Over tting and how to avoid it  Working with text data (e.g. regular expressions)  Time series  And more. . .Part I Programming Prerequisites 1Chapter 1 Unix Simplicity is the key to brilliance -Bruce Lee 1.1 History and Culture The Unix operating system was developed in 1969 at AT&T's Bell Labs. Today Unix lives on through its open source o spring, Linux. This Oper- ating system the dominant force in scienti c computing, super computing, and web servers. In addition, mac OSX (which is unix based) and a variety of user friendly Linux operating systems represent a signi cant portion of the personal computer market. To understand the reasons for this success, some history is needed. In the 1960s, MIT, AT&T Bell Labs, and General Electric developed a time-sharing (meaning di erent users could share one system) operating system called Multics. Multics was found to be too complicated. This \failure" led researchers to develop a new operating system that focused on simplicity. This operating system emphasized ease of communication among many simple programs. Kernighan and Pike summarized this as \the idea that the power of a system comes more from the relationships among programs than from the programs themselves." The Unix community was integrated with the Internet and networked com- 21.2. THE SHELL 3 Figure 1.1: Ubuntu's GUI and CLI puting from the beginning. This, along with the solid fundamental design, could have led to Unix becoming the dominant computing paradigm during the 1980's personal computer revolution. Unfortunately, in ghting and poor business decisions kept Unix out of the mainstream. Unix found a second life, not so much through better business decisions, but through the e orts of Richard Stallman and GNU Project. The goal was to produce a Unix-like operating system that depended only on free software. Free in this case meant, \users are free to run the software, share it, study it, and modify it." The GNU Project succeeded in creating a huge suite of utilities for use with an operating system (e.g. a C compiler) but were lacking the kernel (which handles communication between e.g. hardware and software, or among processes). It just so happened that Linux Torvalds had developed a kernel (the \Linux" kernel) in need of good utilities. Together the Linux operating system was born. 1.2 The Shell Modern Linux distributions, such as Ubuntu, come with a graphical user interface (GUI) every bit as slick as Windows or Mac OSX. Software is easy to install and with at most a tiny bit of work all non-proprietary applications work ne. The real power of Unix is realized when you start using the shell. Digression 1: Linux without tears4 CHAPTER 1. UNIX The easiest way to have access to the bash shell and a modern sci- enti c computing environment is to buy hardware that is pre-loaded with Linux. This way, the hardware vendor is takes responsibility for maintaining the proper drivers. Use caution when reading blogs talking about how \easy" it was to get some o -brand laptop computer work- ing with Linux. . . this could work for you, or you could be left with a giant headache. Currently there are a number of hardware vendors that ship machines with Linux: System76, ZaReason, and Dell (with their \Project Sputnik" campaign). Mac OSX is built on Unix, and also quali es as a linux machine of sorts. The disadvantage (of a mac) is price, and the fact that the package management system (for installing software) that comes with Ubuntu linux is the cleanest, easiest ever The shell allows you to control your computer using commands entered in a keyboard. This sort of interaction is called a command line interface (CLI). \The shell" in our case will refer to the Bourne again or bash shell. The bash shell provides an interface to your computer's OS along with a number of utilties and minilanguages. We will introduce you to the shell during the software carpentry bootcamp. For those unable to attend, we refer you to Why learn the shell?  The shell provides a number of utilities that allow you to perform tasks such as interact with your OS or modify a text le.  The shell provides a number minilanguages that allow you to automate these tasks.  Often programs must communicate with a user or another machine. A CLI is a very simple way to do this. Trust me, you don't want to create a GUI for every script you write.  Usually the only way to communicate with a remote computer/cluster is using a shell. Because of this, programs and work ows that only work in the shell are common. For this reason alone, a modern scientist must learn to use the shell. Shell utilities have a common format that is almost always adhered to. This format is: utilityname options arguments. The utilityname is the name of the utility, such as cut, which picks out a column of a csv le. The options modify the behavior of the program. In the case of cut this could mean1.3. STREAMS 5 specifying how the le is delimited (tabs, spaces, commas, etc. . . ) and which column to pick out. In general, options should in fact be optional in that the utility will work without them (but may not give the desired behavior). The arguments come last. These are not optional and can often be thought of as the external input to the program. In the case of cut this is the le from which to extract a column. Putting this together, if data.csv looks like: name,age,weight ian,1,11 chang,2,22 Then cut -d, -f1 data.csv (1.1) z z z utilityname arguments options produces (more speci cally, prints on the terminal screen) age 1 2 1.3 Streams A stream is general term for a sequence of data elements made available over time. This data is processed one element at a time. For example, consider the data le (which we will call data.csv): name,age,weight ian,1,11 chang,2,22 daniel,3,33 This data may exist in one contiguous block in memory/disk or not. In either case, to process this data as a stream, you should view it as a contiguous block that looks like name,age,weight\n ian,1,11\n chang,2,22\n daniel,3,33 The special characternn is called a newline character and represents the start of a new line. The command cut -d, -f2 data.csv will pick out the second column of data.csv, in other words, it returns6 CHAPTER 1. UNIX age 1 2 3 , or, thought of as a stream, age\n 1\n 2\n 3 This could be accomplished by reading the le in sequence, starting to store the characters in a bu er once the rst comma is hit, then printing when the second comma is hit. Since the newline is such a special character, many languages provide some means for the user to process each line as a separate item. This is a very simple way to think about data processing. This simplicity is advantageous and allows one to scale stream processing to massive scales. Indeed, the popular Hadoop MapReduce implementation requires that all small tasks operate on streams. Another advantage of stream processing is that memory needs are reduced. Programs that are able to read from stdin and write to stdout are known as lters. 1.3.1 Standard streams While stream is a general term, there are three streaming input and output channels available on (almost) every machine. These are standard input (stdin), standard output (stdout), and standard error (stderr). Together, these standard streams provide a means for a process to communicate with other processes, or a computer to communicate with other machines (see gure 1.3.1). Standard input is used to allow a process to read data from another source. A Python programmer could read from standard in, then print the same thing to standard out using for line in sys.stdin: sys.stdout.write(line) If data is owing into stdin, then this will result in the same data being written to stdout. If you launch a terminal, then stdout is (by default) connected to your terminal display. So if a program sends something to stdout it is displayed on your terminal. By default stdin is connected to your keyboard. Stderr operates sort of like stdout but all information carries the1.3. STREAMS 7 Figure 1.2: Illustration of the standard streams special tag, \this is an error message." Stderr is therefore used for printing error/debugging information. 1.3.2 Pipes The standard streams aren't any good if there isn't any way to access them. Unix provides a very simple means to connect the standard output of one process to the standard input of another. This construct called a pipe and is written with a vertical bar . Utilities tied together with pipes form what is known as a pipeline. Consider the following pipeline \ cat infile.csv cut -d, -f1 sort uniq -c The above line reads in a text le and prints it to standard out with cat, the pipe \j" redirects this standard out to the standard in of cut. cut in turn extracts the rst column and passes the result to sort, which sends its result to uniq. uniq -c counts the number of unique occurrences of each word. Let's decompose this step-by-step: First, print in le.csv to stdout (which is, by default, the terminal) using cat. \ cat infile.csv8 CHAPTER 1. UNIX ian,1 daniel,2 chang,3 ian,11 Second, pipe this to cut, which will extract the rst eld (the -f option) in this comma delimited (the -d, option) le. \ cat infile.csv cut -d, -f1 ian daniel chang ian Third, pipe the output of cut to sort \ cat infile.csv cut -d, -f1 sort chang daniel ian ian Third, redirect the output of sort to uniq. \ cat infile.csv cut -d, -f1 sort uniq -c 1 chang 1 daniel 2 ian It is important to note that uniq counts unique occurrences in consecutive lines of text. If we did not sort the input to uniq, we would have \ cat infile.csv cut -d, -f1 uniq -c 1 ian 1 daniel 1 chang 1 ian uniq processes text streams character-by-character and does not have the ability to look ahead and see that \ian" will occur a second time.1.4. TEXT 9 1.4 Text One surprising thing to some Unix newcomers is the degree to which simple plain text dominates. The preferred le format for most data les and streams is just plain text. Why not use a compressed binary format that would be quicker to read/write using a special reader application? The reason is in the question: A special reader application would be needed. As time goes on, many data formats and reader applications come in, and then out of favor. Soon your special 1 format data le needs a hard to nd application to read it . What about for communication between processes on a machine? The same situation arises: As soon as more than one binary format is used, it is possible for one of them to become obsolete. Even if both are well supported, every process needs to specify what format it is using. Another advantage of working with text streams is the fact that humans can visually inspect them for debugging purposes. While binary formats live and die on a quick (computer) time-scale, change in human languages changes on the scale of at least a generation. In fact, one summary of the Unix philosophy goes, \This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface." This, in addition to the fact that programming in general requires manip- ulation of text les, means that you are required to master decent text processing software. Here is a brief overview of some popular programs  Vim is a powerful text editor designed to allow quick editing of les and minimal hand movement.  Emacs is another powerful text editor. Some people nd that it re- quires users to contort their hands and leads to wrist problems.  Gedit, sublime text are decent text editors available for Linux and Mac. They are not as powerful as Vim/Emacs, but don't require any special skills to use.  nano is a simple unix text editor available on any system. If nano 1 Any user of Microsoft Word documents from the 90's should be familiar with the headaches that can arise from this situation.10 CHAPTER 1. UNIX doesn't work, try pico.  sed is a text stream processing command line utility available in your shell. It can do simple operations on one line of text at a time. It is useful because of its speed, and the fact that it can handle arbitrarily large les.  awk is an old school minilanguage that allows more complex opera- tions than sed. It is often acknowledged that awk syntax is too complex and that learning to write simple Python scripts is a better game plan. 1.5 Philosophy The Unix culture carries with it a philosophy about software design. The Unix operating system (and its core utilities) can be seen as examples of this. Let's go over some key rules. With the exception of the rule of collaboration, these appeared previously in Ray04. 1.5.1 In a nutshell Rule of Simplicity. Design for simplicity. Add complexity only when you must. Rule of Collaboration. Make programs that work together. Work to- gether with people to make programs 1.5.2 More nuts and bolts We can add more rules to the two main rules above, and provide hints as to how they will guide our software development. Our programs will be small, so (hopefully) few compromises will have to be made. Rule of Simplicity. This is sometimes expressed as K.I.S.S, or \Keep It Simple Stupid." All other philosophical points presented here can be seen as special cases of this. Complex programs are dicult to debug, implement, maintain, or extend. We will keep things simple by, for example: (i) writ- ing CLI utilities that do one thing well, (ii) avoiding objects unless using

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.