Big Data with Spark (2019)

Big Data with Spark

How Big Data works with Spark? (2019)

Spark is the leading Big Data processing technology these days in the Hadoop ecosystem. This blog explains various Big Data tools with Spark using Python.


I will then move on to some of the fundamental concepts that underlie Big Data frameworks and cluster computing in general, including the famed MapReduce (MR) programming paradigm.


What­ Is Big Data?

Big Data

“Big Data,” as the term is used today, is a bit of a misnomer. Massive datasets have been around for a long time, and nobody gave them a special name. Even today, the largest datasets around are generally well outside of the “big data” sphere.


They are generated from scientific experiments, especially particle accelerators, and processed on custom-made architectures of software and hardware.


Instead, Big Data refers to several related trends in datasets (one of which is size) and to the technologies for processing them. The datasets tend to have two properties:


They are, as the name suggests, big. There is no special cutoff for when a dataset is “big.” Roughly though, it happens when it is no longer practical to store or process it all on a single computer.


Instead, we use a cluster of computers, anywhere from a handful of them up to many thousands.


The focus is on making our processing scalable so that it can be distributed over a cluster of arbitrary size with various parts of the analysis going on in parallel. The nodes in the cluster can communicate, but it is kept to a minimum.


The second thing about Big Datasets is that they are often “unstructured.” This is a terribly misleading term. It doesn’t mean that there is no structure to the data, but rather that the dataset doesn’t fit cleanly into a traditional relational database, such as SQL.


Prototypical examples would be images, PDFs, HTML documents, Excel files that aren’t organized into clean rows and columns, and machine-generated log files.


Traditional databases pre-suppose a very rigid structure to the data they contain, and in exchange, they offer highly optimized performance.


In Big Data though, we need the flexibility to process data that come in any format, and we need to be able to operate on that data in ways that are less predefined.


You often pay through the nose for this flexibility when it comes to your software’s runtime since there are very few optimizations that can be prebuilt into the framework.


Big Data requires a few words of caution. The first is that you should be hesitant about using Big Data tools. They’re all the rage these days, so many people are jumping on the bandwagon blindly. But Big Data tools are almost always slower, harder to set up, and more finicky than their traditional counterparts.


This is partly because they’re new technologies that haven’t matured yet, but it’s also inherent to the problems they’re solving: they need to be so flexible to deal with unstructured data, and they need to run on a cluster of computers instead of a stand-alone machine.


So if your datasets will always be small enough to process with a single machine, or you only need operations that are supported by SQL, you should consider doing that instead.


The final word of caution is that even if you are using Big Data tools, you should probably still be using traditional technologies in conjunction with them. For example, I very rarely use Big Data to do machine learning or data visualization.


Typically, I use Big Data tools to extract the relevant features from my data. The extracted features take up much less space compared to the raw dataset, so I can then put the output onto a normal machine and do the actual analytics using something such as Pandas.


I don’t mean to knock Big Data tools. They really are fantastic. It’s just that there are so much hype and ignorance surrounding them: like all tools, they are great for some problems and terrible for others. With those disclaimers out of the way, let’s dive in.


Hadoop:­ The File System and the Processor


The modern field of Big Data largely started when Google published its seminal paper on MapReduce, a cluster computing framework it had created to process massive amounts of web data.


After reading the paper, an engineer named Doug Cutting decided to write a free, open-source implementation of the same idea.


Google’s MR was written in C++, but he decided to do it in Java. Cutting named this new implementation Hadoop, after his daughter’s stuffed elephant.


Hadoop caught on like wildfire and quickly became almost synonymous with Big Data. Many additional tools were developed that ran on Hadoop clusters or that made it easier to write MR jobs for Hadoop.


There are two parts to Hadoop. The first is the Hadoop Distributed File System (HDFS). It allows you to store data on a cluster of computers without worrying about what data is on which node.


Instead, you refer to locations in HDFS just as you would for files in a normal directory system. Under the hood, HDFS takes care of what data is stored on which node, keeping multiple copies of the data in case some node fails and other boilerplate.


The second part of Hadoop is the actual MR framework, which reads in data from HDFS, processes it in parallel, and writes its output to HDFS. I’m actually not going to say much about the Hadoop MR framework, because ironically it’s a bit of a dinosaur these days (shows you how quickly Big Data is evolving!).


There is a huge amount of overhead for its MR jobs (most damningly, it always reads its input from disk and writes output to disk, and disk IO is much more time-consuming than just doing things in RAM).


Additionally, it does a really lousy job of integrating with more conventional programming languages. The community’s focus has shifted toward other tools, which still operate on data in HDFS, most notably Spark, and I’ll dwell more on them.


Example­ PySpark Script


PySpark is the most popular way for Python users to work with Big Data. It operates as a Python shell, but it has a library called PySpark, which lets you plug into the Spark computational framework and parallelize your computations across a cluster.


The code reads similarly to normal Python, except that there is a SparkContext object whose methods let you access the Spark framework.


This script, whose content I will explain later, uses parallel computing to calculate the number of times every word appears in a text document.

# Create the SparkContext object
from pyspark import SparkConf, SparkContext conf = SparkConf()
sc = SparkContext(conf=conf)
Read file lines and parallelize them
over the cluster in a Spark RDD lines = open("myfile.txt ") lines_rdd = sc.parallelize(lines)
Remove punctuation, make lines lowercase def clean_line(s):
s2 = s.strip().lower()
s3 = s2.replace(".","").replace(",","") return s3
lines_clean =
# Break each line into words
words_rdd = lines_clean.flatmap(lambda l: l.split()) # Count words
def merge_counts(count1, count2):
return count1 + count2
words_w_1 = w: (w, 1))
counts = words_w_1.reduceByKey(merge_counts)
# Collect counts and display
for word, count in counts.collect():
print "%s: %i " % (word, count)


If Spark is installed on your computer and you are in the Spark home directory, you can run this script on the cluster with the following command:

bin/spark-submit --master yarn-client

Alternatively, you can run the same computation on just a single machine with the following command:

bin/spark-submit --master local


Spark Overview

Spark Overview

Spark is the leading Big Data processing technology these days in the Hadoop ecosystem, having largely replaced traditional Hadoop MR. It is usually more efficient, especially if you are chaining several operations together, and it’s tremendously easier to use.


From a user’s perspective, Spark is just a library that you import when you are using either Python or Scala.


Spark is written in Scala and runs faster when you call it from Scala, but this blog will introduce the Python API, which is called PySpark. The example script at the beginning of this blog was all PySpark. The Spark API itself (names of functions, variables, etc.) is almost identical between the Scala version and the Python version.


The central data abstraction in PySpark is a “resilient distributed dataset” (RDD), which is just a collection of Python objects.


These objects are distributed across different nodes in the cluster, and generally, you don’t need to worry about which ones are on which nodes. They can be strings, dictionaries, integers – more or less whatever you want.


An RDD is immutable, so its contents cannot be changed directly, but it has many methods that return new RDDs. For instance, in the aforementioned example script, we made liberal use of the “map” method.


If you have an RDD called X and a function called f, then will apply f to every element of X and return the results as a new RDD.


RDDs come in two types: keyed and unkeyed. Unkeyed RDDs support operations such as map(), which operate on each element of the RDD independently.


Often though, we want more complex operations, such as grouping all elements that meet some criteria or joining two different RDDs. These operations require coordination between different elements of an RDD, and for these operations, you need a keyed RDD.


If you have an RDD that consists of two-element tuples, the first element is considered the “key” and the second element the “value.” We created a keyed RDD and processed it in the aforementioned script with the following lines:


words_w_1 = w: (w, 1))

counts = words_w_1.reduceByKey(merge_counts)


Here words_w_1 will be a keyed RDD, where the keys are the words and the values are all 1. Every occurrence of a word in the dataset will give rise to a different element in words_w_1. The next line uses the reduceByKey method to group all values that share a key together and then condense them down to a single aggregate value.


I should note that the keyed and unkeyed RDDs are not separate classes in the PySpark implementation. It’s just that certain operations you can call (such as reduceByKey) will assume that the RDD is structured as key-value pairs, and it will fail at runtime if that is not the case.


Besides RDDs, the other key abstraction the user has to be aware of is the SparkContext class, which interfaces with the Spark cluster and is the entry point for Spark operations. Conventionally, the SparkContext in an application will be called sc.


Generally, PySpark operations come in two types:

Calling methods on the SparkContext, which create an RDD. In the example script, we used parallelize() to move data from local space into the cluster as an RDD. There are other methods that will create RDDs from data that is already distributed, by reading it out of HDFS or another storage medium.


Calling methods on RDDs, which either return new RDDs or produce an output of some kind.


Most operations in Spark are what’s called “lazy.”

 When you type lines_clean = no actual computation gets done. Instead, Spark will just keep track of how the RDD lines_clean is defined.


Similarly, lines_rdd quite possibly doesn’t exist either and is only implicitly defined in terms of some upstream process. As the script runs, the spark is piling up a large dependency structure of RDDs defined in terms of each other, but never actually creating them.


Eventually, you will call an operation that produces some output, such as saving an RDD into HDFS or pulling it down into local Python data structures. At that point, the dominos start falling, and all of the RDDs that you have previously defined will get created and fed into each other, eventually resulting in the final side effect.


By default, an RDD exists only long enough for its contents to be fed into the next stage of processing. If an RDD that you define is never actually needed, then it will never be brought into being.


The problem with lazy evaluation is that sometimes we want to reuse an RDD for a variety of different processes. This brings us to one of the most important aspects of Spark that differentiates it from traditional Hadoop MR: Spark can cache an RDD in the RAM of the cluster nodes so that it can be reused as much as you want.


By default, an RDD is an ephemeral data structure that only exists long enough for its contents to be passed into the next stage of processing, but a cached RDD can be experimented with in real time.


To cache an RDD in memory, you just call the cache() method on it. This method will not actually create the RDD, but it will ensure that the first time the RDD gets created, it is persisted in RAM.


There is one other problem with lazy evaluation. Say again that we write the line

lines_clean =


But imagine that the clean_line function will fail for some value in lines_rdd. We will not know this at the time: the error will only arise later in the script when lines_clean is finally forced to be created. If you are debugging a script, a tool that I use is to call the count() method on each RDD as soon as it is declared.


The count() method counts the elements in the RDD, which forces the whole RDD to be created and will raise an error if there are any problems.


The count() operation is expensive, and you should certainly not include those steps in code that gets run on a regular basis, but it’s a great debugging tool.


Spark Operations

Spark Operations

This section will give you a rundown of the main methods that you will call on the SparkContext object and on RDDs. Together, these methods are everything you will do in a PySpark script that isn’t pure Python.


The SparkContext object has the following methods:

sc.parallelize(my_list): Takes in a list of Python objects and distributes them across the cluster to create an RDD.

sc.textFile(“/some/place/in/hdfs”): Takes in the location of text files in HDFS and returns an RDD containing the lines of text.

sc.pickleFile(“/some/place/in/hdfs”): Takes a location in HDFS that stores Python objects that have been serialized using the pickle library. Deserializes the Python objects and returns them as an RDD. This is a really useful method.

addFile(“myfile.txt”): Copies myfile.txt from the local machine to every node in the cluster, so that they can all use it in their operations.

addPyFile(“”): Copies from the local machine to every node in the cluster, so that it can be imported as a library and used by any node in the cluster.


The main methods you will use on an RDD are as follows: Applies func to every element in the RDD and returns that result as an RDD.

rdd.filter(func): Returns an RDD containing only those elements x of rdd for which func(x) evaluates to True.

rdd.flatMap(func): Applies func to every element in the RDD. func(x) doesn’t

return just a single element of the new RDD: it returns a list of new elements,


so that one element in the original RDD can turn into many in the new one. Or, an element in the original RDD might result in an empty list and hence no elements of the output RDD.


rdd.take(5): Computes five elements of RDD and returns them as a Python list. Very useful when debugging, since it only computes those five elements.


rdd.collect(): Returns a Python list containing all the elements of the RDD. Make sure you only call this if the RDD is small enough that it will fit into the memory of a single computer.


Two­ Ways to Run PySpark

PySpark can be run either by submitting a stand-alone Python script or by opening up an interpreted session where you can enter your Python commands one at a time.

In the previous example, we ran our script by saying bin/spark-submit --master yarn-client

The spark-submit command is what we use for stand-alone scripts. If instead, we wanted to open up an interpreter, we would say

bin/pyspark --master yarn-client

This would open up a normal-looking Python terminal, from which we could import the PySpark libraries.


From the perspective of writing code, the key difference between a stand-alone script and an interpreter session is that in the script we had to explicitly create the SparkContext object, which we called sc. It was done with the following lines:

from pyspark import SparkConf, SparkContext conf = SparkConf()

sc = SparkContext(conf=conf)


If you open up an interpreter though, it will automatically contain the SparkContext object and call it SC. No need to create it manually.

The reason for this difference is that stand-alone scripts often need to set a lot of configuration parameters so that somebody who didn’t write them can still run them reliably.


Calling various methods on the SparkConf object sets those configurations. The assumption is that if you open an interpreter directly, then you will set the configurations yourself from the command line.


Configuring Spark

Configuring Spark

Clusters are finicky things. You need to make sure that every node has the data it needs, the files it relies on, no node gets overloaded, and so on. You need to make sure that you are using the right amount of parallelism because it’s easy to make your code slower by having it be too parallel.


Finally, multiple people usually share a cluster, so the stakes are much higher if you hog resources or crash it (which I have done – trust me, people get irate when the whole cluster dies).


All of this means that you need to have an eye toward how your job is configured. This section will give you the most crucial parts.

All of the configurations can be set from the command line. The ones you are most likely to have to worry about are the following:


Name: A human-readable name to give your process. This doesn’t affect the running, but it will show up in the cluster monitoring software so that your sysadmin can see what resources you’re taking up.


Master: This identifies the “master” process, which deals with parallelizing your job (or running it in local mode). Usually, “yarn-client” will send your job to the cluster for parallel processing, while “local” will run it locally.


There are other masters available sometimes, but local and yarn are the most common. Perhaps surprisingly, the default master is local rather than yarn; you have to explicitly tell PySpark to run in parallel if you want parallelism.


py-files: A comma-separated list of any Python library files that need to be copied to other nodes in the cluster. This is necessary if you want to use that library’s functionality in your PySpark methods because, under the hood, each node in the cluster will need to import the library independently.


Files: A comma-separated list of any additional files that should be put in the working directory on each node. This might include configuration files specific to your task that your distributed functionality depends on.


Num-executors: The number of executor processes to spawn in the cluster. They will typically be on separate nodes. The default is 2.


Executor-cores: The number of CPU cores each executor process should take up. The default is 1.


An example of how this might look at setting parameters from the command line is as follows:

bin/pyspark \
--name my_pyspark_process \
--master yarn-client \
--py-files, \
--files myfile.txt,otherfile.txt \
--num-executors 5 \
--executor-cores 2
If instead you want to set them inside a stand-alone script, it will be as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf()
conf.set("spark.num.executors", 5)
conf.set("spark.executor.cores", 2)
sc = SparkContext(conf=conf)
Under­ the Hood


In my mind, PySpark makes a lot more sense when you understand just a little bit about what’s going on under the hood. Here are some of the main points:


When you use the pyspark command, it will actually run the “python” command on your computer and just make sure that it links to the appropriate spark libraries.


This means that any Python libraries you have installed on your main node are available to you within your PySpark script.


When running in cluster mode, the Spark framework cannot run Python code directly. Instead, it will kick off a separate Python process on each node and run it as a separate subprocess.


If your code needs libraries or additional files that are not present on a node, then the process that is on that node will fail. This is the reason you must pay attention to what files get shipped around.


Whenever possible, a segment of your data is confined to a single Python process. If you call map(), flatMap(), or a variety of other PySpark operations, each node will operate on its own data. This avoids sending data over the network, and it also means that everything can stay at Python objects.


It is very computationally expensive to have the nodes shift data around between them. Not only do we have to actually move the data around, but also the Python objects must be serialized into a string-like format before we send them over the wire and then rehydrated on the other end.


Operations such as groupByKey() will require serializing data and move it between nodes. This step in the process is called a “shuffle.” The Python processes are not involved in the shuffle. They just serialize the data and then hand it off to the Spark framework.


Spark­ Tips and Gotchas

SparkĀ­ Tips

Here are a few parting tips for using Spark, which I have learned from experience and/or hard lessons:

RDDs of dictionaries make both code and data much more understandable. If you’re working with CSV data, always convert it to dictionaries as your first step. Yeah, it takes up more space because every dictionary has copies of the keys, but it’s worth it.


Store things in pickle files rather than in text files if you’re likely to operate on them later. It’s just so much more convenient.

Use take() while debugging your scripts to see what format your data is in (RDD of dictionaries? Of tuples? etc.).


Running count() on an RDD is a great way to force it to be created, which will bring any runtime errors to the surface sooner rather than later.


Do most of your basic debugging in local mode rather than in distributed mode, since it goes much faster if your dataset is small enough. Plus, you reduce the chances that something will fail because of the bad cluster configuration.


If things work fine in local mode but you’re getting weird errors in distributed mode, make sure that you’re shipping the necessary files across the cluster.


If you’re using the --files option from the command line to distribute files across the cluster, make sure that the list is separated by commas rather than colons. I lost two days of my life to that one…


Now that we have seen PySpark in action, let’s step back and consider some of what’s going on here in the abstract.


The­ MapReduce Paradigm

MapReduce Paradigm

MapReduce is the most popular programming paradigm for Big Data technologies. It makes programmers write their code in a way that can be easily parallelized across a cluster of arbitrary size or an arbitrarily large dataset.


Some variant of MR underlies many of the major Big Data tools, including Spark, and probably will for the foreseeable future, so it’s very important to understand it.


An MR job takes a dataset, such as a Spark RDD, as input. There are then two stages to the job:

Mapping. Every element of the dataset is mapped, by some function, to a collection of key-value pairs. In PySpark, you can do this with the flatMap method.


Reducing. For every unique key, a “reduce” process is kicked off. It is fed all of its associated values one at a time, in no particular order, and eventually, it produces some outputs. You can implement this using the reduceByKey method in PySpark.


And that’s all there is to it: the programmer writes the code for the mapper function, and they write the code for the reducer, and that’s it. No worrying about the size of the cluster, what data is where, and so on.


In the example script I gave, Spark will end up optimizing the code into a single MR job. Here is the code is rewritten so as to make it explicit:

def mapper(line):
l2 = l.strip().lower()
l3 = l2.replace(".","").replace(",","")
words = l3.split()
return [(w, 1) for w in words]
def reducer_func(count1, count2):
return count1 + count2
lines = open("myfile.txt")
lines_rdd = sc.parallelize(lines)
map_stage_out = lines_rdd.flatMap(mapper)
reduce_stage_out = \


What happens under the hood in an MR job is the following:

The input dataset starts off being distributed across several nodes in the cluster.

Each of these nodes will, in parallel, apply the mapping function to all of its pieces of data to get key-value pairs.


Each node will use the reducer to condense all of its key-value pairs for a particular word into just a single one, representing how often that word occurred in the node’s data. Again, this happens completely in parallel.


For every distinct key that is identified in the cluster, a node in the cluster is chosen to host the reduce process.

Every node will forward each of its partial counts to the appropriate reducer. This movement of data between nodes is often the slowest stage of the whole MR job – even slower than the actual processing.


Every reduces process runs in parallel on all of its associated values, calculating the final word counts.

The overall workflow is displayed in the following diagram:


There is one thing I have done here that breaks from classical MR. I have said that each node uses the reducer to condense all of its key-value pairs for a particular word into just one.


That stage is technically a performance optimization called a “combiner.” I was only able to use a combiner because my reducer was just doing addition, and it doesn’t matter what order you add things up in.


In the most general case, those mapper outputs are not condensed – they are all sent to whichever node is doing the reducing for that word. This puts a massive strain on the bandwidth between the clusters, so you want to use combiners whenever possible.


Performance Considerations


There are several guidelines applicable to any MR framework, including Spark:

If you are going to filter data out, do it as early as possible. This reduces network bandwidth.


The job only finishes when the last reduce process is done, so try to avoid a situation where one reducer is handling most of the key-value pairs. If possible, more reducers mean each one has to handle fewer key-value pairs.


In traditional coding, the name of the game in performance optimization is to reduce the number of steps your code takes. This is usually a secondary concern in MR. The biggest concern instead becomes the time it takes to move data from node to node across the network.


And the number of steps your code takes doesn’t matter so much – instead, it’s how many steps your worst node takes.


There is one other specific optimization with Spark in particular that I should mention, which doesn’t come up all that often but can be a huge deal when it does. Sometimes, reduceByKey is the wrong method to use. In particular, it is very inefficient when your aggregated values are a large, mutable data structure.


Take this dummy code, for example, which takes all occurrences of a word and puts them into a big list:

def mapper(line):

return [(w, [w]) for w in line.split()]

def red_func(lst1, lst2):

return lst1 + lst2

result = lines.flatMap(mapper).reduceByKey(red_func)


As I’ve written it, every time red_func is called, it is given two potentially very long lists. It will then create a new list in memory (which takes quite a bit of time) and then delete the original lists. This is horribly abusive to the memory, and I’ve seen jobs die because of it.


Intuitively, what you want to do is keep a big list and just append all the words to it, one at a time, rather than constantly creating new lists.


That can be accomplished with the aggregate key function, which is a little more complicated to use compared to reduceByKey, but much more efficient if you use it correctly. Example code is here:

def update_agg(agg_list, new_word): agg_list.append(new_word) return agg_list # same list!
def merge_agg_lists(agg_list1, agg_list2):
return agg_list1 + agg_list2
def reducer(l1, l2):
return l1 + l2
result = lines.flatMap(mapper).aggregateByKey( [], update_agg, merge_agg_lists)


In this case, each node in the cluster will start off with the empty list as its aggregate for a particular word. Then it will feed that aggregate, along with each instance of the word, into update_agg.


Then update_agg will append the new value to the list, rather than creating a new one, and return the updated list as a result. The function mergE_agg_lists still operates the original way, but it is only called a few times to merge the outputs of the different nodes.




Big data is a popular term that describes the exponential growth, availability, and use of information, both structured and unstructured. Big data continues to gain attention from the high-performance computing niche of the information technology market.


According to International Data Corporation (IDC), “in 2011, the amount of information created and replicated will surpass 1.8 zettabytes (1.8 trillion gigabytes), growing by a factor of nine in just five years. That’s nearly as many bits of information in the digital universe as stars in the physical universe.”


Big data provides both challenges and opportunities for data miners to develop improved models. Today’s massively parallel in-memory analytical computing appliances no longer hinder the size of data you can analyze.


A key advantage is you are able to analyze more of the population with a broad range of classical and modern analytics.


The goal of data mining is a generalization. Rather than rely on sampling, you can isolate hard-to-detect signals in the data, and you can also produce models that generalize better when deployed into operations. You can also more readily detect outliers that often lead to the best insights.


You can also try more configuration options for a specific algorithm, for example, neural network topologies including different activation and combination functions, because the models run in seconds or minutes instead of hours.


Enterprises are also moving toward creating large multipurpose analytical base tables that several analysts can use to develop a plethora of models for risk, marketing, and so on.


Developing standardized analytical tables that contain thousands of candidate predictors and targets supports what is referred to as model harvesting or a model factory. A small team of analysts at Cisco Systems currently build over 30,000 propensity-to-purchase models each quarter.


This seasoned team of analysts has developed highly repeatable data preparation strategies along with a sound modeling methodology that they can apply over and over.


Customer dynamics also change quickly, as does the underlying snapshot of the historical modeling data. So the analyst often needs to refresh (retrain) models at very frequent intervals. Now more than ever, analysts need the ability to develop models in minutes, if not seconds, vs. hours or days.


Using several champions and challenger methods is critical. Data scientists should not be restricted to using one or two modeling algorithms. Model development (including discovery) is also iterative by nature, so


Analytics is moving out of research and more into operations. data miners need to be agile when they develop models. The bottom line is that big data is only getting bigger, and data miners need to significantly reduce the cycle time it takes to go from analyzing big data to creating ready-to-deploy models.


Many applications can benefit from big data analytics. One of these applications is telematics, which is the transfer of data from any telecommunications device or chip. The volume of data that these devices generate is massive.


For example, automobiles have hundreds of sensors. Automotive manufacturers need scalable algorithms to predict vehicle performance and problems on demand.


Insurance companies are also implementing pay-as-you-drive plans, in which a GPS device that is installed in your car tracks the distance driven and automatically transmits the information to the insurer.


More advanced GPS devices that contain integrated accelerometers also capture date, time, location, speed, cornering, harsh braking, and even frequent lane changing. Data scientists and actuaries can leverage this big data to build more profitable insurance premium models. Personalized policies can also be written that reward truly safe drivers.


The smart energy grid is another interesting application area that encourages customer participation. Sensor systems called synchrophasor monitor in real time the health of the grid and collect many streams per second. The consumption in very short intervals can be modeled during peak and off-peak periods to develop pricing plan models.


Many customers are “peakier” than others and more expensive to service. Segmentation models can also be built to define custom pricing models that decrease usage in peak hours.


Example segments might be “weekday workers,” “early birds, home worker,” and “late-night gamers.”


There are so many complex problems that can be better evaluated now with the rise of big data and sophisticated analytics in a distributed, in-memory environment to make better decisions within tight time frames.


The underlying optimization methods can now solve problems in parallel through co-location of the data in memory. The data mining algorithms are vastly still the same; they are just able to handle more data and are much faster.


[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]




The remainder of the blog provides a summary of the most common data mining algorithms. The discussion is broken into two subsections, each with a specific theme: classical data mining techniques and machine learning methods.


The goal is to describe algorithms at a high level so that you can understand how each algorithm fits into the landscape of data mining methods.


Although there are a number of other algorithms and many variations of the techniques, these represent popular methods used today in real-world deployments of data mining systems.


Classical Data Mining Techniques

Classical Data Mining Techniques

Data mining methods have largely been contributed from statistics, machine learning, artificial intelligence, and database systems. By strict definition, statistics are not data mining. Statistical methods were being used long before the term data mining was coined to apply to business applications.


In classical inferential statistics, the investigator proposes some model that may explain the relationship between an outcome of interest (dependent response variable) and explanatory variables (independent variables).


Once a conceptual model has been proposed, the investigator then collects the data with the purpose of testing the model. Testing typically involves the statistical significance of the parameters associated with the explanatory variables.


For these tests to be valid, distributional assumptions about the response or the error terms in the model need to correct or not violate too severely. Two of the most broadly used statistical methods are multiple linear regression and logistic regression.


Multiple linear regression and logistic regression are also commonly used in data mining. A critical distinction between their inferential applications and their data mining applications is in how one determines the suitability of the model.


A typical data mining application is to predict an outcome, a target in data mining jargon, based on the explanatory variables, inputs, or features in data mining jargon. Because of the emphasis on prediction, the distributional assumptions of the target or errors are much less important.


Often the historical data that is used in data mining model development has a time dimension, such as monthly spending habits for each customer.


The typical data mining approach to account for variation over time is to construct inputs or features that summarize the customer behavior for different time intervals. Common summaries are recency, frequency, and monetary (RFM) value. This approach results in one row per customer in the model development data table.


An alternative approach is to construct multiple rows for each customer, where the rows include the customer behavior for previous months.


The multiple rows for each customer represent a time series of features of the customer’s past behavior. When you have data of this form, you should use a repeated measures or time series cross-sectional model.


Predictive modeling (supervised learning) techniques enable the analyst to identify whether a set of input variables is useful in predicting some outcome variable.


For example, a financial institution might try to determine whether knowledge of an applicant’s income and credit history (input variables) helps predict whether the applicant is likely to default on a loan (outcome variable). Descriptive techniques (unsupervised learning) enable you to identify underlying patterns in a data set.


Model overfitting happens when your model describes random noise or error instead of the true relationships in the data. Albert Einstein once said, “Everything should be as simple as it is, but not simple.” This is a maxim practice to abide by when you develop predictive models.


Simple models that do a good job of classification or prediction are easier to interpret and tend to generalize better when they are scored. You should develop your model from holdout evaluation sources to determine whether your model overfits.


Another strategy, especially for a small training dataset, is to use k-fold cross-validation, which is a method of estimating how well your model fits based on resampling.


You divide the data into k subsets of approximately the same size. You train your model k times, each time leaving out one of the subsets that are used to evaluate the model. You can then measure model stability across the k holdout samples.


k-Means Clustering

k-Means Clustering

k-Means clustering is a descriptive algorithm that scales well to large data. Cluster analysis has wide application, including customer segmentation, pattern recognition, biological studies, and web document classification.


k-Means clustering attempts to find k partitions in the data, in which each observation belongs to the cluster with the nearest mean. The basic steps for k-means are


  • 1. Select k observations arbitrarily as initial cluster centroids.
  • 2. Assign each observation to the cluster that has the closest centroid.
  • 3. Once all observations are assigned, recalculate the positions of the k centroids.


Repeat steps 2 and 3 until the centroids no longer change. This repetition helps minimize the variability within clusters and maximize the variability between clusters.


Note that the observations are divided into clusters so that every observation belongs to at most one cluster.


Some software packages select an appropriate value of k. You should still try to experiment with different values of k that result in good homogeneous clusters that are relatively stable when they are applied to new data.


k-Means cluster analysis cross symbols represent the cluster centroids.

You should try to select input variables that are both representative of your business problem and predominantly independent.


Outliers tend to dominate cluster formation, so consider removing outliers. Normalization is also recommended to standardize the values of all inputs from dynamic range into a specific range.


When using a distance measure, it is extremely important for the inputs to have comparable measurement scales. Most clustering algorithms work well with interval inputs. Most k-means implementations support several dimension encoding techniques for computing distances for nominal and ordinal inputs.


After you group the observations into clusters, you can profile the input variables to help further label each cluster.


One convenient way to profile the clusters is to use the cluster IDs as a target variable, and then use a decision tree and candidate inputs to classify cluster membership.


After the clusters have been identified and interpreted, you might decide to treat each cluster independently. You might also decide to develop separate predictive models for each cluster. Other popular clustering techniques include hierarchical clustering and expectation maximization (EM) clustering.


Association Analysis

Association Analysis

Association analysis (also called affinity analysis or market basket analysis) identifies groupings of products or services that tend to be purchased at the same time or purchased at different times by the same customer.


Association analysis falls within the descriptive modeling phase of data mining. Association analysis helps you answer questions such as the following:


  • What proportion of people who purchase low-fat yogurt and 2% milk also purchase bananas?
  • What proportion of people who have a car loan with a financial institution later obtain a home mortgage from that institution?
  • What percentage of people who purchase tires and wiper blades also get automotive service done at the same shop?


Multiple Linear Regression


Multiple linear regression models the relationship between two or more inputs (predictors) and a continuous target variable by fitting a learner model to the training data. The regression model is represented as

E (y) = β0 + β1x 1 + β2x2 + + βk xk


where E (y) is the expected target values, xs represent the k model inputs, β0 is the intercept that centers the range of predictions, and the remaining βs are the slope estimates that determine the trend strength between each k input and the target. Simple linear regression has one model input x.


The method of least squares is used to estimate the intercept and parameter estimates by the equation that minimizes the sums of squares of errors of the deviations of the observed values of y from the predicted values of y. The regression can be expressed as

yˆ = b0 + b1x1 + b2x2 + + bk xk where the squared error function is

Σ(Yi − Yiˆ)2 .


Multiple linear regression is advantageous because of familiarity and interpretability. Regression models also generate optimal unbiased estimates for unknown parameters.


Many phenomena cannot be described by the linear relationships between the target variables and the input variables. You can use polynomial regression (adding power terms, including interaction effects) to the model to approximate more complex nonlinear relationships.


The adjusted coefficient of determination (adjusted R2) is a commonly used measure of the goodness of fit of regression models. Essentially, it is the percentage of the variability that is explained by the model relative to the total variability, adjusted for the number of inputs in your model.


A traditional R2 statistic does not penalize the model for the number of parameters, so you almost always end up choosing the model that has the most parameters.


Another common criterion is the root mean square error (RMSE), which indicates the absolute fit of the data to the actual values. You usually want to evaluate the RMSE on holdout validation and test data sources; lower values are better.


Data scientists almost always use stepwise regression selection to fit a subset of the full regression model. Remember that the key goal of predictive modeling is to build a parsimonious model that generalizes well on unseen data. The three stepwise regression methods are as follows:

  • Forward selection enters inputs one at a time until no more significant variables can be entered.
  • Backward elimination removes inputs one at a time until there are no more nonsignificant inputs to remove.
  • Stepwise selection is a combination of forwarding selection and backward elimination.


The stepwise selection has plenty of critics, but it is sufficiently reasonable as long as you are not trying to closely evaluate the p-values or the parameter estimates. Most software packages include penalty functions, such as Akaike’s information criterion (AIC) or the Bayesian information criterion (BIC), to choose the best subset of predictors.


All possible subset regression combination routines are also commonly supported in data mining toolkits. These methods should be more computationally feasible for big data as high-performance analytical computing appliances continue to get more powerful.


Shrinkage estimators such as the least absolute shrinkage and selection operator (LASSO)  are preferred over true stepwise selection methods. They use information from the full model to provide a hybrid estimate of the regression parameters by shrinking the full model estimates toward the candidate submodel.


Multicollinearity occurs when one input is relatively highly correlated with at least another input. It is not uncommon in data mining and is not a concern when the goal is a prediction.


Multicollinearity tends to inflate the standard errors of the parameter estimates, and in some cases, the sign of the coefficient can switch from what you expect.


In other cases, the coefficients can even be doubled or halved. If your goal is model interpretability, then you want to detect collinearity by using measures such as tolerances and variance inflation factors.


At a minimum, you should evaluate a correlation matrix of the candidate inputs and choose one input over another correlated input based on your business or research knowledge. Other strategies for handling correlated inputs include centering the data or redefining the suspect variable (which is not always possible).


You can also generate principal components that are orthogonal transformations of the uncorrelated variables and capture p% of the variance of the predictor. The components are a weighted linear combination of the original inputs. The uncorrelated principal components are used as inputs to the regression model.


The first principal component explains the largest amount of variability. The second principal component is orthogonal to the first. You can select a subset of the components that describe a specific percentage of variability in the predictors (say 85%). Keep in mind that principal components handle continuous inputs.


Regression also requires a complete case analysis, so it does not directly handle missing data. If one or more inputs for a single observation have missing values, then this observation is discarded from the analysis. You can replace (impute) missing values with the mean, median, or other measures.


You can also fit a model using the input as the target and the remaining inputs as predictors to impute missing values. Software packages also support creating a missing indicator for each input, where the missing indicator is 1 when the corresponding input is missing, and 0 otherwise.


The missing indicator variables are used as inputs to the model. Missing values trends across customers can be predictive.

  • Multiple linear regression is predominantly used for continuous targets.
  • One of the best sources for regression modeling is made by Rawlings (1988).
  • Other important topics addressed include residual diagnostics and outliers.


Logistic Regression

Logistic regression is a form of regression analysis in which the target variable (response variable) is categorical. It is the algorithm in data mining that is most widely used to predict the probability that an event of interest will occur.


Logistic regression can be used to estimate fraud, bad credit status, purchase propensity, part failure status, churn, disease incidence, and many other binary target outcomes. Multinomial logistic regression supports more than two discrete categorical target outcomes.


Machine Learning

Machine Learning

Machine learning (ML) algorithms are quantitative techniques used for applications that are focused on classification, clustering, and prediction and are generally used for large datasets.


Machine learning algorithms also focus on automation, especially the newer methods that handle the data enrichment and variable selection layer. The algorithms commonly have built data handling features such as treating missing values, binning features, and preselecting variables.


The term data mining has been around for at least two decades. Data mining is the application of statistics, machine learning, artificial intelligence, optimization, and other analytical disciplines to actual research or commercial problems. Many ML algorithms draw heavily from statistical learning research.


Characterizing the distribution of variables and the errors from models was central in the works of Fisher, Karl Pearson, and the other seminal thinkers of statistics.


Neural Networks

Neural Networks

Artificial neural networks were originally developed by researchers who were trying to mimic the neurophysiology of the human brain. By combining many simple computing elements (neurons or units) into a highly interconnected system, these researchers hoped to produce complex phenomena such as intelligence.


Although there is controversy about whether neural networks are really intelligent, they are without question very powerful at detecting complex nonlinear relationships in high-dimensional data.


The term network refers to the connection of basic building blocks called neurons. The input units contain the values of the predictor variables. The hidden units do the internal, often highly flexible non-linear computations. The output units compute predictions and compare these with the values of the target.


A very simple network has one input layer that is connected to a hidden unit, which is then connected to an output unit.


You can design very complex networks—software packages naturally make this a lot easier—that contain perhaps several hundred hidden units. You can define hidden layers that enable you to specify different types of transformations.


The primary advantage of neural networks is that they are extremely powerful at modeling nonlinear trends. They are also useful when the relationship among the input variables (including interactions) is vaguely understood.


You can also use neural networks to fit ordinary least squares and logistic regression models in their simplest form.


Problems with Data Content

Data Content

Duplicate Entries

You should always check for duplicate entries in a dataset. Sometimes, they are important in some real-world way. In those cases, you usually want to condense them into one entry, adding an additional column that indicates how many unique entries there were.


In other cases, the duplication is purely a result of how the data was generated. For example, it might be derived by selecting several columns from a larger dataset, and there are no duplicates if you count the other columns.


Multiple Entries for a Single Entity

This case is a little more interesting than duplicate entries. Often, each real-world entity logically corresponds to one row in the dataset, but some entities are repeated multiple times with different data. The most common cause of this is that some of the entries are out of date, and only one row is currently correct.


In other cases, there actually should be duplicate entries. For example, each “entity” might be a power generator with several identical motors in it. Each motor could give its own status report, and all of them will be present in the data with the same serial number.


Another field in the data might tell you which motor is actually which. In the cases where the motor isn’t specified in a data field, the different rows will often come in a fixed order.


Another case where there can be multiple entries is if, for some reason, the same entity is occasionally processed twice by whatever gathered the data. This happens in many manufacturing settings because they will retool broken components and send them through the assembly line multiple times rather than scrap them outright.


Missing Entries

Missing Entries

Most of the time when some entities are not described in a dataset, they have some common characteristics that kept them out. For example, let’s say that there is a log of all transactions from the past year. We group the transactions by customer and add up the size of the transactions for each customer.


This dataset will have only one row per customer, but any customer who had no transactions in the past year will be left out entirely. In a case such as this, you can join the derived data up against some known set of all customers and fill in the appropriate values for the ones who were missing.


In other cases, missing data arises because data was never gathered in the first place for some entities. For example, maybe two factories produce a particular product, but only one of them gathers this particular data about them.



NULL entries typically mean that we don’t know a particular piece of information about some entity. The question is: why?

Most simply, NULLs can arise because the data collection process was botched in some way. What this means depends on the context.


When it comes time to do analytics, NULLs cannot be processed by many algorithms. In these cases, it is often necessary to replace the missing values with some reasonable proxy. What you will see most often is that it is guessed from other data fields, or you simply plug in the mean of all the non-null values.


In other cases, the NULL values arise because that data was never collected. For example, some measurements might be taken at one factory that produces widgets but not at another. The table of all collected data for all widgets will then contain NULLs for whenever the widget’s factory didn’t collect that data.


For this reason, whether a variable is NULL can sometimes be a very powerful feature. The factory that produced the widget is, after all, potentially a very important determinant for whatever it is you want to predict, independent of whatever other data you gathered.


Huge Outliers

Huge Outliers

Sometimes, a massive outlier in the data is there because there was truly an aberrant event. How to deal with that depends on the context.


Sometimes, the outliers should be filtered out of the dataset. In web traffic, for example, you are usually interested in predicting page views by humans. A huge spike in recorded traffic is likely to come from a bot attack, rather than any activities of humans.


In other cases, outliers just mean missing data. Some storage systems don’t allow the explicit concept of a NULL value, so there is some predetermined value that signifies missing data. If many entries have identical, seemingly arbitrary values, then this might be what’s happening.


Out-of-Date Data

In many databases, every row has a timestamp for when it was entered. When an entry is updated, it is not replaced in the dataset; instead, a new row is put in that has an up-to-date timestamp.


For this reason, many datasets include entries that are no longer accurate and only useful if you are trying to reconstruct the history of the database.


Artificial Entries

Artificial Entries

Many industrial datasets have artificial entries that have been deliberately inserted into the real data. This is usually done for purposes of testing the software systems that process the data.


Irregular Spacings

Many datasets include measurements taken at regular spacings. For example, you could have the traffic to a website every hour or the temperature of a physical object measured at every inch. Most of the algorithms that process data such as this assume that the data points are equally spaced, which presents a major problem when they are irregular.


If the data is from sensors measuring something such as temperature, then typically you have to use interpolation techniques to generate new values at a set of equally spaced points.


A special case of irregular spacings happens when two entries have identical timestamps but different numbers. This usually happens because the timestamps are only recorded to finite precision.


If two measurements happen within the same minute, and time is only recorded up to the minute, then their timestamps will be identical.


Formatting Issues

Formatting Is Irregular between Different Tables/Columns

This happens a lot, typically because of how the data was stored in the first place. It is an especially big issue when joinable/groupable keys are irregularly formatted between different datasets.


Extra Whitespace

For such a small issue, it is almost comical how often random whitespace con-founds analyses when people try to, say, join the identifier “ABC” against “ABC” for two different datasets. Whitespace is especially insidious because when you print the data to the screen to examine it, the whitespace might be impossible to discern.


In Python, every string object has a strip() method that removes whitespace from the front and end of a string.

The methods lstrip() and rstrip() will remove whitespace only from the front and end, respectively. 

If you pass a character as an argument into the strip functions, only that character will be stripped. For example,

" ABC\t".lstrip() 'ABC\t'
" ABC\t".rstrip() ' ABC'


Irregular Capitalization

Python strings have lower() and upper() methods, which will return a copy of the original string with all letters set to uppercase or lowercase.


Inconsistent Delimiters

Usually, a dataset will have a single delimiter, but sometimes, different tables will use different ones. The most common delimiters you will see are as follows: Commas, Tabs, Pipes (the vertical line “|”).


Irregular NULL Format

There are a number of different ways that missing entries are encoded into CSV files, and they should all be interpreted as NULLs when the data is read in. Some popular examples are the empty string “”, “NA,” and “NULL.” Occasionally, you will see others such as “unavailable” or “unknown” as well.


Invalid Characters

Some data files will randomly have invalid bytes in the middle of them. Some programs will throw an error if you try to open up anything that isn’t valid text. In these cases, you may have to filter out the invalid bytes.


The following Python code will create a string called s, which is not validly formatted text. The decode() method takes in two arguments. The first is the text format that the string should be coerced into (there are several, which I will discuss later in the blog on file formats).


The second is what should be done when such coercion isn’t possible; saying “ignore” means that invalid characters simply get dropped.

s = "abc\xFF"
print s # Note how last character isn’t a letter abc?
s.decode("ascii", "ignore")
Weird or Incompatible Datetimes
Datetimes are one of the most frequently mangled types of data field. Some of the date formats you will see are as follows:
August 1, 2013
AUG 1, ‘13


There is an important way that dates and times are different from other formatting issues. Most of the time you have two different ways of expressing the same information, and a perfect translation is possible from the one to the other. But with dates and times, the information content itself can be different.


For example, you might have just the date, or there could also be a time associated with it. If there is a time, does it go out to the minute, hour, second, or something else? What about time zones?


Most scripting languages include some kind of built-in DateTime data structure, which lets you specify any of these different parameters (and uses reasonable defaults if you don’t specify).


Generally speaking, the best way to approach DateTime data is to get it into the built-in data types as quickly as possible, so that you can stop worrying about string formatting.


The easiest way to parse dates in Python is with a package called dateutil, which works as follows:

import dateutil.parser as p

p.parse("August 13, 1985") datetime.datetime(1985, 8, 13, 0, 0)

p.parse("2013-8-13") datetime.datetime(2013, 8, 13, 0, 0)

p.parse("2013-8-13 4:15am") datetime.datetime(2013, 8, 13, 4, 15)


It takes in a string, uses some reasonable rules to determine how that string is encoding dates and times and coerces it into the DateTime data type. Note that it rounds down – August 13th becomes 12:00 AM on August 13th, and so on.


Operating System Incompatibilities

operating systems

Different operating systems have different file conventions, and sometimes, that is a problem when opening a file that was generated on one OS on a computer that runs a different one.


Probably, the most notable place where this occurs is newlines in text files. In Mac and Linux, a newline is conventionally denoted by the single character “\n.” On Windows, it is often two characters “\r\n.” Many data processing tools check what operating system they are being run on so that they know which convention to use.


Wrong Software Versions

Sometimes, you will have a file of a format that is designed to be handled by a specific software package. However, when you try to open it, a very mystifying error is thrown. This happens, for example, with data compression formats.


Oftentimes the culprit ends up being that the file was originally generated with one version of the software. However, the software has changed in the meantime, and you are now trying to open the file with a different version.


Example­ Formatting Script

Formatting Script

The following script illustrates how you can use hacked-together string formatting to clean up disgusting data and load it into a Pandas DataFrame. Let’s say we have the following data in a file:


Ms. Janice Joplin|65|January 19, 1943

Bob Dylan |74 Years| may 24 1941

Billy Ray Joel|66yo|Feb. 9, 1941


It’s clear to a human looking at the data what it’s supposed to mean, but it’s the kind of thing that might be terrible if you opened it with a CSV file reader.


The following code will take care of the pathologies and make things more explicit. It’s not exactly pretty or efficient, but it gets the job done, it’s easy to understand, and it would be easy to modify if it needed changing:

def get_first_last_name(s):
INVALID_NAME_PARTS = ["mr", "ms", "mrs",
"dr", "jr", "sir"]
parts = s.lower().replace(".","").strip().split() parts = [p for p in parts
if p not in INVALID_NAME_PARTS] if len(parts)==0:
raise ValueError(
"Name %s is formatted wrong" % s)
first, last = parts[0], parts[-1] first = first[0].upper() + first[1:] last = last[0].upper() + last[1:] return first, last
def format_age(s):
chars = list(s) # list of characters
digit_chars = [c for c in chars if c.isdigit()] return int("".join(digit_chars))
def format_date(s):
"jan": "01", "feb": "02", "may": "03"} s = s.strip().lower().replace(",", "") m, d, y = s.split()
if len(y) == 2: y = "19" + y
if len(d) == 1: d = "0" + d
return y + "-" + MONTH_MAP[m[:3]] + "-" + d
import pandas as pd
df = pd.read_csv("file.tsv", sep="|")
df["First Name"] = df["Name"].apply(
lambda s: get_first_last_name(s)[0]) df["Last Name"] = df["Name"].apply(
lambda s: get_first_last_name(s)[1]) df["Age"] = df["Age"].apply(format_age) df["Birthdate"] = df["Birthdate"].apply(
print df