Big Data Definition (Best Tutorial 2019)

Big Data Definition

Big Data Definition and What Big data Is

Big Data refers to several trends in data storage and processing, which have posed new challenges, provided new opportunities, and demanded new solutions.

 

Often, these Big Data problems required a level of software engineering expertise that normal statisticians and data analysts weren’t able to handle.

 

Big Data is also an area where low‐level software engineering concerns become especially important for data scientists. It’s always important that they think hard about the logic of their code, but performance concerns are a strictly secondary concern.

 

In Big Data though, it’s easy to accidentally add several hours to your code’s runtime or even have the code fail several hours in due to a memory error if you do not keep an eye on what’s going on inside the computer.

 

This blog will start with an overview of two pieces of Big Data software that are particularly important: the Hadoop file system, which stores data on clusters, and the Spark cluster computing framework, which can process that data.

 

I will then move on to some of the fundamental concepts that underlie Big Data frameworks and cluster computing in general, including the famed MapReduce (MR) programming paradigm.

 

What­ Is Big Data?

Big Data

“Big Data,” as the term is used today, is a bit of a misnomer. Massive datasets have been around for a long time, and nobody gave them a special name. Even today, the largest datasets around are generally well outside of the “big data” sphere.

 

They are generated from scientific experiments, especially particle accelerators, and processed on custom‐made architectures of software and hardware.

 

Instead, Big Data refers to several related trends in datasets (one of which is size) and to the technologies for processing them. The datasets tend to have two properties:

 

They are, as the name suggests, big. There is no special cutoff for when a dataset is “big.” Roughly though, it happens when it is no longer practical to store or process it all on a single computer.

 

Instead, we use a cluster of computers, anywhere from a handful of them up to many thousands.

 

The focus is on making our processing scalable so that it can be distributed over a cluster of arbitrary size with various parts of the analysis going on in parallel. The nodes in the cluster can communicate, but it is kept to a minimum.

 

The second thing about Big Datasets is that they are often “unstructured.” This is a terribly misleading term. It doesn’t mean that there is no structure to the data, but rather that the dataset doesn’t fit cleanly into a traditional relational database, such as SQL.

 

Prototypical examples would be images, PDFs, HTML documents, Excel files that aren’t organized into clean rows and columns, and machine‐generated log files.

 

Traditional databases pre-suppose a very rigid structure to the data they contain, and in exchange, they offer highly optimized performance.

 

In Big Data though, we need the flexibility to process data that come in any format, and we need to be able to operate on that data in ways that are less predefined.

 

You often pay through the nose for this flexibility when it comes to your software’s runtime since there are very few optimizations that can be prebuilt into the framework.

 

Big Data requires a few words of caution. The first is that you should be hesitant about using Big Data tools. They’re all the rage these days, so many people are jumping on the bandwagon blindly. But Big Data tools are almost always slower, harder to set up, and more finicky than their traditional counterparts.

 

This is partly because they’re new technologies that haven’t matured yet, but it’s also inherent to the problems they’re solving: they need to be so flexible to deal with unstructured data, and they need to run on a cluster of computers instead of a stand-alone machine.

 

So if your datasets will always be small enough to process with a single machine, or you only need operations that are supported by SQL, you should consider doing that instead.

 

The final word of caution is that even if you are using Big Data tools, you should probably still be using traditional technologies in conjunction with them. For example, I very rarely use Big Data to do machine learning or data visualization.

 

Typically, I use Big Data tools to extract the relevant features from my data. The extracted features take up much less space compared to the raw dataset, so I can then put the output onto a normal machine and do the actual analytics using something such as Pandas.

 

I don’t mean to knock Big Data tools. They really are fantastic. It’s just that there is so much hype and ignorance surrounding them: like all tools, they are great for some problems and terrible for others. With those disclaimers out of the way, let’s dive in.

 

Hadoop:­ The File System and the Processor

Hadoop

The modern field of Big Data largely started when Google published its seminal paper on MapReduce, a cluster computing framework it had created to process massive amounts of web data.

 

After reading the paper, an engineer named Doug Cutting decided to write a free, open-source implementation of the same idea.

 

Google’s MR was written in C++, but he decided to do it in Java. Cutting named this new implementation Hadoop, after his daughter’s stuffed elephant.

 

Hadoop caught on like wildfire and quickly became almost synonymous with Big Data. Many additional tools were developed that ran on Hadoop clusters or that made it easier to write MR jobs for Hadoop.

 

There are two parts to Hadoop. The first is the Hadoop Distributed File System (HDFS). It allows you to store data on a cluster of computers without worrying about what data is on which node.

 

Instead, you refer to locations in HDFS just as you would for files in a normal directory system. Under the hood, HDFS takes care of what data is stored on which node, keeping multiple copies of the data in case some node fails and other boilerplate.

 

The second part of Hadoop is the actual MR framework, which reads in data from HDFS, processes it in parallel, and writes its output to HDFS. I’m actually not going to say much about the Hadoop MR framework, because ironically it’s a bit of a dinosaur these days (shows you how quickly Big Data is evolving!).

 

There is a huge amount of overhead for its MR jobs (most damningly, it always reads its input from disk and writes output to disk, and disk IO is much more time‐consuming than just doing things in RAM).

 

Additionally, it does a really lousy job of integrating with more conventional programming languages. The community’s focus has shifted toward other tools, which still operate on data in HDFS, most notably Spark, and I’ll dwell more on them.

 

Example­ PySpark Script

PySpark

PySpark is the most popular way for Python users to work with Big Data. It operates as a Python shell, but it has a library called PySpark, which lets you plug into the Spark computational framework and parallelize your computations across a cluster.

 

The code reads similarly to normal Python, except that there is a SparkContext object whose methods let you access the Spark framework.

 

This script, whose content I will explain later, uses parallel computing to calculate the number of times every word appears in a text document.

# Create the SparkContext object
from pyspark import SparkConf, SparkContext conf = SparkConf()
sc = SparkContext(conf=conf)
Read file lines and parallelize them
over the cluster in a Spark RDD lines = open("myfile.txt ") lines_rdd = sc.parallelize(lines)
Remove punctuation, make lines lowercase def clean_line(s):
s2 = s.strip().lower()
s3 = s2.replace(".","").replace(",","") return s3
lines_clean = lines_rdd.map(clean_line)
# Break each line into words
words_rdd = lines_clean.flatmap(lambda l: l.split()) # Count words
def merge_counts(count1, count2):
return count1 + count2
words_w_1 = words_rdd.map(lambda w: (w, 1))
counts = words_w_1.reduceByKey(merge_counts)
# Collect counts and display
for word, count in counts.collect():
print "%s: %i " % (word, count)

 

If Spark is installed on your computer and you are in the Spark home directory, you can run this script on the cluster with the following command:

bin/spark-submit --master yarn-client http://myfile.py

Alternatively, you can run the same computation on just a single machine with the following command:

bin/spark-submit --master local http://myfile.py

 

Spark Overview

Spark Overview

Spark is the leading Big Data processing technology these days in the Hadoop ecosystem, having largely replaced traditional hadoop MR. It is usually more efficient, especially if you are chaining several operations together, and it’s tremendously easier to use.

 

From a user’s perspective, Spark is just a library that you import when you are using either Python or Scala.

 

Spark is written in Scala and runs faster when you call it from Scala, but this blog will introduce the Python API, which is called PySpark. The example script at the beginning of this blog was all PySpark. The Spark API itself (names of functions, variables, etc.) is almost identical between the Scala version and the Python version.

 

The central data abstraction in PySpark is a “resilient distributed dataset” (RDD), which is just a collection of Python objects.

 

These objects are distributed across different nodes in the cluster, and generally, you don’t need to worry about which ones are on which nodes. They can be strings, dictionaries, integers – more or less whatever you want.

 

An RDD is immutable, so its contents cannot be changed directly, but it has many methods that return new RDDs. For instance, in the aforementioned example script, we made liberal use of the “map” method.

 

If you have an RDD called X and a function called f, then X.map(f) will apply f to every element of X and return the results as a new RDD.

 

RDDs come in two types: keyed and unkeyed. Unkeyed RDDs support operations such as map(), which operate on each element of the RDD independently.

 

Often though, we want more complex operations, such as grouping all elements that meet some criteria or joining two different RDDs. These operations require coordination between different elements of an RDD, and for these operations, you need a keyed RDD.

 

If you have an RDD that consists of two‐element tuples, the first element is considered the “key” and the second element the “value.” We created a keyed RDD and processed it in the aforementioned script with the following lines:

 

words_w_1 = words_rdd.map(lambda w: (w, 1))

counts = words_w_1.reduceByKey(merge_counts)

 

Here words_w_1 will be a keyed RDD, where the keys are the words and the values are all 1. Every occurrence of a word in the dataset will give rise to a different element in words_w_1. The next line uses the reduceByKey method to group all values that share a key together and then condense them down to a single aggregate value.

 

I should note that the keyed and unkeyed RDDs are not separate classes in the PySpark implementation. It’s just that certain operations you can call (such as reduceByKey) will assume that the RDD is structured as key-value pairs, and it will fail at runtime if that is not the case.

 

Besides RDDs, the other key abstraction the user has to be aware of is the SparkContext class, which interfaces with the Spark cluster and is the entry point for Spark operations. Conventionally, the SparkContext in an application will be called sc.

 

Generally, PySpark operations come in two types:

Calling methods on the SparkContext, which create an RDD. In the example script, we used parallelize() to move data from local space into the cluster as an RDD. There are other methods that will create RDDs from data that is already distributed, by reading it out of HDFS or another storage medium.

 

Calling methods on RDDs, which either return new RDDs or produce an output of some kind.

 

Most operations in Spark are what’s called “lazy.” When you type lines_clean = lines_rdd.map(clean_line) no actual computation gets done. Instead, Spark will just keep track of how the RDD lines_clean is defined.

 

Similarly, lines_rdd quite possibly doesn’t exist either and is only implicitly defined in terms of some upstream process. As the script runs, the spark is piling up a large dependency structure of RDDs defined in terms of each other, but never actually creating them.

 

Eventually, you will call an operation that produces some output, such as saving an RDD into HDFS or pulling it down into local Python data structures. At that point, the dominos start falling, and all of the RDDs that you have previously defined will get created and fed into each other, eventually resulting in the final side effect.

 

By default, an RDD exists only long enough for its contents to be fed into the next stage of processing. If an RDD that you define is never actually needed, then it will never be brought into being.

 

The problem with lazy evaluation is that sometimes we want to reuse an RDD for a variety of different processes. This brings us to one of the most important aspects of Spark that differentiates it from traditional Hadoop MR: Spark can cache an RDD in the RAM of the cluster nodes so that it can be reused as much as you want.

 

By default, an RDD is an ephemeral data structure that only exists long enough for its contents to be passed into the next stage of processing, but a cached RDD can be experimented with in real time.

 

To cache an RDD in memory, you just call the cache() method on it. This method will not actually create the RDD, but it will ensure that the first time the RDD gets created, it is persisted in RAM.

 

There is one other problem with lazy evaluation. Say again that we write the line

lines_clean = lines_rdd.map(clean_line)

 

But imagine that the clean_line function will fail for some value in lines_rdd. We will not know this at the time: the error will only arise later in the script when lines_clean is finally forced to be created. If you are debugging a script, a tool that I use is to call the count() method on each RDD as soon as it is declared.

 

The count() method counts the elements in the RDD, which forces the whole RDD to be created and will raise an error if there are any problems.

 

The count() operation is expensive, and you should certainly not include those steps in code that gets run on a regular basis, but it’s a great debugging tool.

 

Spark Operations

Spark Operations

This section will give you a rundown of the main methods that you will call on the SparkContext object and on RDDs. Together, these methods are everything you will do in a PySpark script that isn’t pure Python.

 

The SparkContext object has the following methods:

sc.parallelize(my_list): Takes in a list of Python objects and distributes them across the cluster to create an RDD.

sc.textFile(“/some/place/in/hdfs”): Takes in the location of text files in HDFS and returns an RDD containing the lines of text.

sc.pickleFile(“/some/place/in/hdfs”): Takes a location in HDFS that stores Python objects that have been serialized using the pickle library. Deserializes the Python objects and returns them as an RDD. This is a really useful method.

 

addFile(“myfile.txt”): Copies myfile.txt from the local machine to every node in the cluster, so that they can all use it in their operations.

 

addPyFile(“http://mylib.py”): Copies http://mylib.py from the local machine to every node in the cluster, so that it can be imported as a library and used by any node in the cluster.

 

The main methods you will use on an RDD are as follows:

  • rdd.map(func): Applies func to every element in the RDD and returns that result as an RDD.
  • rdd.filter(func): Returns an RDD containing only those elements x of rdd for which func(x) evaluates to True.
  • rdd.flatMap(func): Applies func to every element in the RDD. func(x) doesn’t
  • return just a single element of the new RDD: it returns a list of new elements,

 

so that one element in the original RDD can turn into many in the new one. Or, an element in the original RDD might result in an empty list and hence no elements of the output RDD.

 

rdd.take(5): Computes five elements of RDD and returns them as a Python list. Very useful when debugging, since it only computes those five elements.

 

rdd.collect(): Returns a Python list containing all the elements of the RDD. Make sure you only call this if the RDD is small enough that it will fit into the memory of a single computer.

 

rdd.saveAsTextFile(“/some/place/in/hdfs”): Saves an RDD in HDFS as a text file. Useful for an RDD of strings.

 

rdd.saveAsPickleFile(“/some/place/in/hdfs”): Serializes every object in pickle format and stores them in hdfs. Useful for RDDs of complex Python objects, such as dictionaries and lists.

 

rdd.distinct(): Filters out all duplicates.

rdd1.union(rdd2): Combines elements of rdd1 and rdd2 into a single RDD.

rdd.cache(): Whenever RDD is actually created, it will be cached in RAM so that it doesn’t have to be re‐created later.

rdd.keyBy(func): This is a simple wrapper for making keyed RDDs, since it is such a common use case. This is equivalent to rdd.map(lambda x: (func(x), x)).

 

rdd1.join(rdd2): This works on two keyed RDDs. If (k, v1) is in rdd1 and (k, v2) is in rdd2, then (k, (v1, v2)) will be in the output RDD.

 

rdd.reduceByKey(func): For every unique key in the keyed rdd, this collects all of its associated values and aggregates them together using func.

 

rdd.groupByKey(func): func will be passed a tuple of two things – a key and an iterable object that will give it all of the values in RDD that share that key. Its output will be an element in the resulting RDD.

 

Two­ Ways to Run PySpark

PySpark can be run either by submitting a stand‐alone Python script or by opening up an interpreted session where you can enter your Python commands one at a time.

 

In the previous example, we ran our script by saying bin/spark-submit --master yarn-client http://myfile.py

 

The spark‐submit command is what we use for stand‐alone scripts. If instead, we wanted to open up an interpreter, we would say

bin/pyspark --master yarn-client

 

This would open up a normal‐looking Python terminal, from which we could import the PySpark libraries.

 

From the perspective of writing code, the key difference between a stand-alone script and an interpreter session is that in the script we had to explicitly create the SparkContext object, which we called sc. It was done with the following lines:

from pyspark import SparkConf, SparkContext conf = SparkConf()

sc = SparkContext(conf=conf)

 

If you open up an interpreter though, it will automatically contain the SparkContext object and call it sc. No need to create it manually.

The reason for this difference is that stand‐alone scripts often need to set a lot of configuration parameters so that somebody who didn’t write them can still run them reliably.

 

Calling various methods on the SparkConf object sets those configurations. The assumption is that if you open an interpreter directly, then you will set the configurations yourself from the command line.

 

Configuring Spark

Configuring Spark

Clusters are finicky things. You need to make sure that every node has the data it needs, the files it relies on, no node gets overloaded, and so on. You need to make sure that you are using the right amount of parallelism because it’s easy to make your code slower by having it be too parallel.

 

Finally, multiple people usually share a cluster, so the stakes are much higher if you hog resources or crash it (which I have done – trust me, people get irate when the whole cluster dies).

 

All of this means that you need to have an eye toward how your job is configured. This section will give you the most crucial parts.

All of the configurations can be set from the command line. The ones you are most likely to have to worry about are the following:

 

Name: A human‐readable name to give your process. This doesn’t affect the running, but it will show up in the cluster monitoring software so that your sysadmin can see what resources you’re taking up.

 

Master: This identifies the “master” process, which deals with parallelizing your job (or running it in local mode). Usually, “yarn‐client” will send your job to the cluster for parallel processing, while “local” will run it locally.

 

There are other masters available sometimes, but local and yarn are the most common. Perhaps surprisingly, the default master is local rather than yarn; you have to explicitly tell PySpark to run in parallel if you want parallelism.

 

py‐files: A comma-separated list of any Python library files that need to be copied to other nodes in the cluster. This is necessary if you want to use that library’s functionality in your PySpark methods because under the hood, each node in the cluster will need to import the library independently.

 

Files: A comma‐separated list of any additional files that should be put in the working directory on each node. This might include configuration files specific to your task that your distributed functionality depends on.

 

Num‐executors: The number of executor processes to spawn in the cluster. They will typically be on separate nodes. The default is 2.

 

Executor‐cores: The number of CPU cores each executor process should take up. The default is 1.

 

An example of how this might look setting parameters from the command line is as follows:

bin/pyspark \
--name my_pyspark_process \
--master yarn-client \
--py-files http://mylibrary.py,http://otherlibrary.py \
--files myfile.txt,otherfile.txt \
--num-executors 5 \
--executor-cores 2
If instead you want to set them inside a stand‐alone script, it will be as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf()
conf.setMaster("yarn-client")
conf.setAppName("my_pyspark_process")
conf.set("spark.num.executors", 5)
conf.set("spark.executor.cores", 2)
sc = SparkContext(conf=conf)
sc.addPyFile("http://mylibrary.py")
sc.addPyFile("http://otherlibrary.py")
sc.addFile("myfile.txt")
sc.addFile("otherfile.txt")
Under­ the Hood

 

In my mind, PySpark makes a lot more sense when you understand just a little bit about what’s going on under the hood. Here are some of the main points:

 

When you use the “pyspark” command, it will actually run the “python” command on your computer and just make sure that it links to the appropriate spark libraries (and that the SparkContext object is already in the namespace if you’re running in interactive mode).

 

This means that any Python libraries you have installed on your main node are available to you within your PySpark script.

 

When running in cluster mode, the Spark framework cannot run Python code directly. Instead, it will kick off a separate Python process on each node and run it as a separate subprocess.

 

If your code needs libraries or additional files that are not present on a node, then the process that is on that node will fail. This is the reason you must pay attention to what files get shipped around.

 

Whenever possible, a segment of your data is confined to a single Python process. If you call map(), flatMap(), or a variety of other PySpark operations, each node will operate on its own data. This avoids sending data over the network, and it also means that everything can stay at Python objects.

 

It is very computationally expensive to have the nodes shift data around between them. Not only do we have to actually move the data around, but also the Python objects must be serialized into a string‐like format before we send them over the wire and then rehydrated on the other end.

 

Operations such as groupByKey() will require serializing data and move it between nodes. This step in the process is called a “shuffle.” The Python processes are not involved in the shuffle. They just serialize the data and then hand it off to the Spark framework.

 

Spark­ Tips and Gotchas

Spark­ Tips

Here are a few parting tips for using Spark, which I have learned from experience and/or hard lessons:

RDDs of dictionaries make both code and data much more understandable. If you’re working with CSV data, always convert it to dictionaries as your first step. Yeah, it takes up more space because every dictionary has copies of the keys, but it’s worth it.

 

Store things in pickle files rather than in text files if you’re likely to operate on them later. It’s just so much more convenient.

Use take() while debugging your scripts to see what format your data is in (RDD of dictionaries? Of tuples? etc.).

 

Running count() on an RDD is a great way to force it to be created, which will bring any runtime errors to the surface sooner rather than later.

 

Do most of your basic debugging in local mode rather than in distributed mode, since it goes much faster if your dataset is small enough. Plus, you reduce the chances that something will fail because of the bad cluster configuration.

 

If things work fine in local mode but you’re getting weird errors in distributed mode, make sure that you’re shipping the necessary files across the cluster.

 

If you’re using the ‐‐files option from the command line to distribute files across the cluster, make sure that the list is separated by commas rather than colons. I lost two days of my life to that one…

 

Now that we have seen PySpark in action, let’s step back and consider some of what’s going on here in the abstract.

 

The­ MapReduce Paradigm

MapReduce Paradigm

MapReduce is the most popular programming paradigm for Big Data technologies. It makes programmers write their code in a way that can be easily parallelized across a cluster of arbitrary size or an arbitrarily large dataset.

 

Some variant of MR underlies many of the major Big Data tools, including Spark, and probably will for the foreseeable future, so it’s very important to understand it.

 

An MR job takes a dataset, such as a Spark RDD, as input. There are then two stages to the job:

Mapping. Every element of the dataset is mapped, by some function, to a collection of key-value pairs. In PySpark, you can do this with the flatMap method.

 

Reducing. For every unique key, a “reduce” process is kicked off. It is fed all of its associated values one at a time, in no particular order, and eventually, it produces some outputs. You can implement this using the reduceByKey method in PySpark.

 

And that’s all there is to it: the programmer writes the code for the mapper function, and they write the code for the reducer, and that’s it. No worrying about the size of the cluster, what data is where, and so on.

 

In the example script I gave, Spark will end up optimizing the code into a single MR job. Here is the code rewritten so as to make it explicit:

def mapper(line):
l2 = l.strip().lower()
l3 = l2.replace(".","").replace(",","")
words = l3.split()
return [(w, 1) for w in words]
def reducer_func(count1, count2):
return count1 + count2
lines = open("myfile.txt")
lines_rdd = sc.parallelize(lines)
map_stage_out = lines_rdd.flatMap(mapper)
reduce_stage_out = \
map_stage_out.reduceByKey(reducer_func)

 

What happens under the hood in an MR job is the following:

The input dataset starts off being distributed across several nodes in the cluster.

Each of these nodes will, in parallel, apply the mapping function to all of its pieces of data to get key-value pairs.

 

Each node will use the reducer to condense all of its key-value pairs for a particular word into just a single one, representing how often that word occurred in the node’s data. Again, this happens completely in parallel.

 

For every distinct key that is identified in the cluster, a node in the cluster is chosen to host the reduce process.

Every node will forward each of its partial counts to the appropriate reducer. This movement of data between nodes is often the slowest stage of the whole MR job – even slower than the actual processing.

 

Every reduces process runs in parallel on all of its associated values, calculating the final word counts.

The overall workflow is displayed in the following diagram:

 

There is one thing I have done here that breaks from classical MR. I have said that each node uses the reducer to condense all of its key-value pairs for a particular word into just one.

 

That stage is technically a performance optimization called a “combiner.” I was only able to use a combiner because my reducer was just doing addition, and it doesn’t matter what order you add things up in.

 

In the most general case, those mapper outputs are not condensed – they are all sent to whichever node is doing the reducing for that word. This puts a massive strain on the bandwidth between the clusters, so you want to use combiners whenever possible.

 

Performance Considerations

Performance

There are several guidelines applicable to any MR framework, including Spark:

If you are going to filter data out, do it as early as possible. This reduces network bandwidth.

 

The job only finishes when the last reduce process is done, so try to avoid a situation where one reducer is handling most of the key-value pairs. If possible, more reducers mean each one has to handle fewer key-value pairs.

 

In traditional coding, the name of the game in performance optimization is to reduce the number of steps your code takes. This is usually a secondary concern in MR. The biggest concern instead becomes the time it takes to move data from node to node across the network.

 

And the number of steps your code takes doesn’t matter so much – instead, it’s how many steps your worst node takes.

 

There is one other specific optimization with Spark in particular that I should mention, which doesn’t come up all that often but can be a huge deal when it does. Sometimes, reduceByKey is the wrong method to use. In particular, it is very inefficient when your aggregated values are large, mutable data structure.

 

Take this dummy code, for example, which takes all occurrences of a word and puts them into a big list:

def mapper(line):

return [(w, [w]) for w in line.split()]

def red_func(lst1, lst2):

return lst1 + lst2

result = lines.flatMap(mapper).reduceByKey(red_func)

 

As I’ve written it, every time red_func is called, it is given two potentially very long lists. It will then create a new list in memory (which takes quite a bit of time) and then delete the original lists. This is horribly abusive to the memory, and I’ve seen jobs die because of it.

 

Intuitively, what you want to do is keep a big list and just append all the words to it, one at a time, rather than constantly creating new lists.

 

That can be accomplished with the aggregateByKey function, which is a little more complicated to use compared to reduceByKey, but much more efficient if you use it correctly. Example code is here:

def update_agg(agg_list, new_word): agg_list.append(new_word) return agg_list # same list!
def merge_agg_lists(agg_list1, agg_list2):
return agg_list1 + agg_list2
def reducer(l1, l2):
return l1 + l2
result = lines.flatMap(mapper).aggregateByKey( [], update_agg, merge_agg_lists)

 

In this case, each node in the cluster will start off with the empty list as its aggregate for a particular word. Then it will feed that aggregate, along with each instance of the word, into update_agg.

 

Then update_agg will append the new value to the list, rather than creating a new one, and return the updated list as a result. The function mergE_agg_lists still operates the original way, but it is only called a few times to merge the outputs of the different nodes.

 

BIG DATA

BIG DATA

Big data is a popular term that describes the exponential growth, availability, and use of information, both structured and unstructured. Big data continues to gain attention from the high-performance computing niche of the information technology market.

 

According to International Data Corporation (IDC), “in 2011, the amount of information created and replicated will surpass 1.8 zettabytes (1.8 trillion gigabytes), growing by a factor of nine in just five years. That’s nearly as many bits of information in the digital universe as stars in the physical universe.”

 

Big data provides both challenges and opportunities for data miners to develop improved models. Today’s massively parallel in-memory analytical computing appliances no longer hinder the size of data you can analyze.

 

A key advantage is you are able to analyze more of the population with a broad range of classical and modern analytics.

 

The goal of data mining is a generalization. Rather than rely on sampling, you can isolate hard-to-detect signals in the data, and you can also produce models that generalize better when deployed into operations. You can also more readily detect outliers that often lead to the best insights.

 

You can also try more configuration options for a specific algorithm, for example, neural network topologies including different activation and combination functions, because the models run in seconds or minutes instead of hours.

 

Enterprises are also moving toward creating large multipurpose analytical base tables that several analysts can use to develop a plethora of models for risk, marketing, and so on.

 

Developing standardized analytical tables that contain thousands of candidate predictors and targets supports what is referred to as model harvesting or a model factory. A small team of analysts at Cisco Systems currently build over 30,000 propensity-to-purchase models each quarter.

 

This seasoned team of analysts has developed highly repeatable data preparation strategies along with a sound modeling methodology that they can apply over and over.

 

Customer dynamics also change quickly, as does the underlying snapshot of the historical modeling data. So the analyst often needs to refresh (retrain) models at very frequent intervals. Now more than ever, analysts need the ability to develop models in minutes, if not seconds, vs. hours or days.

 

Using several champions and challenger methods is critical. Data scientists should not be restricted to using one or two modeling algorithms. Model development (including discovery) is also iterative by nature, so

 

Analytics is moving out of research and more into operations. data miners need to be agile when they develop models. The bottom line is that big data is only getting bigger, and data miners need to significantly reduce the cycle time it takes to go from analyzing big data to creating ready-to-deploy models.

 

Many applications can benefit from big data analytics. One of these applications is telematics, which is the transfer of data from any telecommunications device or chip. The volume of data that these devices generate is massive.

 

For example, automobiles have hundreds of sensors. Automotive manufacturers need scalable algorithms to predict vehicle performance and problems on demand.

 

Insurance companies are also implementing pay-as-you-drive plans, in which a GPS device that is installed in your car tracks the distance driven and automatically transmits the information to the insurer.

 

More advanced GPS devices that contain integrated accelerometers also capture date, time, location, speed, cornering, harsh braking, and even frequent lane changing. Data scientists and actuaries can leverage this big data to build more profitable insurance premium models. Personalized policies can also be written that reward truly safe drivers.

 

The smart energy grid is another interesting application area that encourages customer participation. Sensor systems called synchrophasor monitor in real time the health of the grid and collect many streams per second. The consumption in very short intervals can be modeled during peak and off-peak periods to develop pricing plan models.

 

Many customers are “peakier” than others and more expensive to service. Segmentation models can also be built to define custom pricing models that decrease usage in peak hours.

 

Example segments might be “weekday workers,” “early birds, home worker,” and “late-night gamers.”

 

There are so many complex problems that can be better evaluated now with the rise of big data and sophisticated analytics in a distributed, in-memory environment to make better decisions within tight time frames.

 

The underlying optimization methods can now solve problems in parallel through co-location of the data in memory. The data mining algorithms are vastly still the same; they are just able to handle more data and are much faster.

 

DATA MINING METHODS

DATA MINING

The remainder of the blog provides a summary of the most common data mining algorithms. The discussion is broken into two subsections, each with a specific theme: classical data mining techniques and machine learning methods.

 

The goal is to describe algorithms at a high level so that you can understand how each algorithm fits into the landscape of data mining methods.

 

Although there are a number of other algorithms and many variations of the techniques, these represent popular methods used today in real-world deployments of data mining systems.

 

Classical Data Mining Techniques

Classical Data Mining Techniques

Data mining methods have largely been contributed from statistics, machine learning, artificial intelligence, and database systems. By strict definition, statistics are not data mining. Statistical methods were being used long before the term data mining was coined to apply to business applications.

 

In classical inferential statistics, the investigator proposes some model that may explain the relationship between an outcome of interest (dependent response variable) and explanatory variables (independent variables).

 

Once a conceptual model has been proposed, the investigator then collects the data with the purpose of testing the model. Testing typically involves the statistical significance of the parameters associated with the explanatory variables.

 

For these tests to be valid, distributional assumptions about the response or the error terms in the model need to correct or not violate too severely. Two of the most broadly used statistical methods are multiple linear regression and logistic regression.

 

Multiple linear regression and logistic regression are also commonly used in data mining. A critical distinction between their inferential applications and their data mining applications is in how one determines the suitability of the model.

 

A typical data mining application is to predict an outcome, a target in data mining jargon, based on the explanatory variables, inputs, or features in data mining jargon. Because of the emphasis on prediction, the distributional assumptions of the target or errors are much less important.

 

Often the historical data that is used in data mining model development has a time dimension, such as monthly spending habits for each customer.

 

The typical data mining approach to account for variation over time is to construct inputs or features that summarize the customer behavior for different time intervals. Common summaries are recency, frequency, and monetary (RFM) value. This approach results in one row per customer in the model development data table.

 

An alternative approach is to construct multiple rows for each customer, where the rows include the customer behavior for previous months.

 

The multiple rows for each customer represent a time series of features of the customer’s past behavior. When you have data of this form, you should use a repeated measures or time series cross-sectional model.

 

Predictive modeling (supervised learning) techniques enable the analyst to identify whether a set of input variables is useful in predicting some outcome variable.

 

For example, a financial institution might try to determine whether knowledge of an applicant’s income and credit history (input variables) helps predict whether the applicant is likely to default on a loan (outcome variable). Descriptive techniques (unsupervised learning) enable you to identify underlying patterns in a data set.

 

Model overfitting happens when your model describes random noise or error instead of the true relationships in the data. Albert Einstein once said, “Everything should be as simple as it is, but not simple.” This is a maxim practice to abide by when you develop predictive models.

 

Simple models that do a good job of classification or prediction are easier to interpret and tend to generalize better when they are scored. You should develop your model from holdout evaluation sources to determine whether your model overfits.

 

Another strategy, especially for a small training dataset, is to use k-fold cross-validation, which is a method of estimating how well your model fits based on resampling.

 

You divide the data into k subsets of approximately the same size. You train your model k times, each time leaving out one of the subsets that are used to evaluate the model. You can then measure model stability across the k holdout samples.

 

k-Means Clustering

k-Means Clustering

k-Means clustering is a descriptive algorithm that scales well to large data. Cluster analysis has wide application, including customer segmentation, pattern recognition, biological studies, and web document classification.

 

k-Means clustering attempts to find k partitions in the data, in which each observation belongs to the cluster with the nearest mean. The basic steps for k-means are

 

  • 1. Select k observations arbitrarily as initial cluster centroids.
  • 2. Assign each observation to the cluster that has the closest centroid.
  • 3. Once all observations are assigned, recalculate the positions of the k centroids.

 

Repeat steps 2 and 3 until the centroids no longer change. This repetition helps minimize the variability within clusters and maximize the variability between clusters.

 

Note that the observations are divided into clusters so that every observation belongs to at most one cluster.

 

Some software packages select an appropriate value of k. You should still try to experiment with different values of k that result in good homogeneous clusters that are relatively stable when they are applied to new data.

 

k-Means cluster analysis cross symbols represent the cluster centroids.

You should try to select input variables that are both representative of your business problem and predominantly independent.

 

Outliers tend to dominate cluster formation, so consider removing outliers. Normalization is also recommended to standardize the values of all inputs from dynamic range into a specific range.

 

When using a distance measure, it is extremely important for the inputs to have comparable measurement scales. Most clustering algorithms work well with interval inputs. Most k-means implementations support several dimension encoding techniques for computing distances for nominal and ordinal inputs.

 

After you group the observations into clusters, you can profile the input variables to help further label each cluster.

 

One convenient way to profile the clusters is to use the cluster IDs as a target variable, and then use a decision tree and candidate inputs to classify cluster membership.

 

After the clusters have been identified and interpreted, you might decide to treat each cluster independently. You might also decide to develop separate predictive models for each cluster. Other popular clustering techniques include hierarchical clustering and expectation maximization (EM) clustering.

 

Association Analysis

Association Analysis

Association analysis (also called affinity analysis or market basket analysis) identifies groupings of products or services that tend to be purchased at the same time or purchased at different times by the same customer.

 

Association analysis falls within the descriptive modeling phase of data mining. Association analysis helps you answer questions such as the following:

 

  • What proportion of people who purchase low-fat yogurt and 2% milk also purchase bananas?
  • What proportion of people who have a car loan with a financial institution later obtain a home mortgage from that institution?
  • What percentage of people who purchase tires and wiper blades also get automotive service done at the same shop?

 

Multiple Linear Regression

Regression

Multiple linear regression models the relationship between two or more inputs (predictors) and a continuous target variable by fitting a learner model to the training data. The regression model is represented as

E (y) = β0 + β1x 1 + β2x2 + + βk xk

 

where E (y) is the expected target values, xs represent the k model inputs, β0 is the intercept that centers the range of predictions, and the remaining βs are the slope estimates that determine the trend strength between each k input and the target. Simple linear regression has one model input x.

 

The method of least squares is used to estimate the intercept and parameter estimates by the equation that minimizes the sums of squares of errors of the deviations of the observed values of y from the predicted values of ŷ. The regression can be expressed as

yˆ = b0 + b1x1 + b2x2 + + bk xk where the squared error function is

Σ(Yi − Yiˆ)2 .

 

Multiple linear regression is advantageous because of familiarity and interpretability. Regression models also generate optimal unbiased estimates for unknown parameters.

 

Many phenomena cannot be described by the linear relationships between the target variables and the input variables. You can use polynomial regression (adding power terms, including interaction effects) to the model to approximate more complex nonlinear relationships.

 

The adjusted coefficient of determination (adjusted R2) is a commonly used measure of the goodness of fit of regression models. Essentially, it is the percentage of the variability that is explained by the model relative to the total variability, adjusted for the number of inputs in your model.

 

A traditional R2 statistic does not penalize the model for the number of parameters, so you almost always end up choosing the model that has the most parameters.

 

Another common criterion is the root mean square error (RMSE), which indicates the absolute fit of the data to the actual values. You usually want to evaluate the RMSE on holdout validation and test data sources; lower values are better.

 

Data scientists almost always use stepwise regression selection to fit a subset of the full regression model. Remember that the key goal of predictive modeling is to build a parsimonious model that generalizes well on unseen data. The three stepwise regression methods are as follows:

  • Forward selection enters inputs one at a time until no more significant variables can be entered.
  • Backward elimination removes inputs one at a time until there are no more nonsignificant inputs to remove.
  • Stepwise selection is a combination of forwarding selection and backward elimination.

 

The stepwise selection has plenty of critics, but it is sufficiently reasonable as long as you are not trying to closely evaluate the p-values or the parameter estimates. Most software packages include penalty functions, such as Akaike’s information criterion (AIC) or the Bayesian information criterion (BIC), to choose the best subset of predictors.

 

All possible subset regression combination routines are also commonly supported in data mining toolkits. These methods should be more computationally feasible for big data as high-performance analytical computing appliances continue to get more powerful.

 

Shrinkage estimators such as the least absolute shrinkage and selection operator (LASSO)  are preferred over true stepwise selection methods. They use information from the full model to provide a hybrid estimate of the regression parameters by shrinking the full model estimates toward the candidate submodel.

 

Multicollinearity occurs when one input is relatively highly correlated with at least another input. It is not uncommon in data mining and is not a concern when the goal is a prediction.

 

Multicollinearity tends to inflate the standard errors of the parameter estimates, and in some cases, the sign of the coefficient can switch from what you expect.

 

In other cases, the coefficients can even be doubled or halved. If your goal is model interpretability, then you want to detect collinearity by using measures such as tolerances and variance inflation factors.

 

At a minimum, you should evaluate a correlation matrix of the candidate inputs and choose one input over another correlated input based on your business or research knowledge. Other strategies for handling correlated inputs include centering the data or redefining the suspect variable (which is not always possible).

 

You can also generate principal components that are orthogonal transformations of the uncorrelated variables and capture p% of the variance of the predictor. The components are a weighted linear combination of the original inputs. The uncorrelated principal components are used as inputs to the regression model.

 

The first principal component explains the largest amount of variability. The second principal component is orthogonal to the first. You can select a subset of the components that describe a specific percentage of variability in the predictors (say 85%). Keep in mind that principal components handle continuous inputs.

 

Regression also requires a complete case analysis, so it does not directly handle missing data. If one or more inputs for a single observation have missing values, then this observation is discarded from the analysis. You can replace (impute) missing values with the mean, median, or other measures.

 

You can also fit a model using the input as the target and the remaining inputs as predictors to impute missing values. Software packages also support creating a missing indicator for each input, where the missing indicator is 1 when the corresponding input is missing, and 0 otherwise.

 

The missing indicator variables are used as inputs to the model. Missing values trends across customers can be predictive.

 

  • Multiple linear regression is predominantly used for continuous targets.
  • One of the best sources for regression modeling is made by Rawlings (1988).
  • Other important topics addressed include residual diagnostics and outliers.

 

Logistic Regression

Logistic regression is a form of regression analysis in which the target variable (response variable) is categorical. It is the algorithm in data mining that is most widely used to predict the probability that an event of interest will occur.

 

Logistic regression can be used to estimate fraud, bad credit status, purchase propensity, part failure status, churn, disease incidence, and many other binary target outcomes. Multinomial logistic regression supports more than two discrete categorical target outcomes.

 

Decision Trees

Decision Trees

A decision tree is another type of analytic approach developed independently in the statistics and artificial intelligence communities. The tree represents a segmentation of the data that is created by applying a series of simple rules.

 

Each rule assigns an observation to a segment based on the value of one input. One rule is applied after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and each segment is called a node.

 

The original segment contains the entire data set and is called the root node of the tree. A node with all its successors forms a branch of the node that created it. The final nodes are called leaves.

 

For each leaf, a decision is made and applied to all observations in the leaf. The type of decision depends on the context. In predictive modeling, the decision is simply the predicted value or the majority class value.

 

The decision tree partitions the data by recursively searching for candidate input variable thresholds at which to split a node and choosing the input and split point that lead to the greatest improvement in the prediction.

 

Classification and regression trees, chi-square automatic interaction detector, and C5.0 are the most well-known decision tree algorithms. Each of these algorithms supports one of several splitting criteria, such as the following:

 

  • Variance reduction for interval targets (CHAID)
  • f-Test for interval targets (CHAID)
  • Gini or entropy reduction (information gain) for a categorical target (CART™ and C5.0)
  • Chi-square for nominal targets (CHAID)

 

Machine Learning

Machine Learning

Machine learning (ML) algorithms are quantitative techniques used for applications that are focused on classification, clustering, and prediction and are generally used for large datasets.

 

Machine learning algorithms also focus on automation, especially the newer methods that handle the data enrichment and variable selection layer. The algorithms commonly have built data handling features such as treating missing values, binning features, and preselecting variables.

 

The term data mining has been around for at least two decades. Data mining is the application of statistics, machine learning, artificial intelligence, optimization, and other analytical disciplines to actual research or commercial problems. Many ML algorithms draw heavily from statistical learning research.

 

Characterizing the distribution of variables and the errors from models was central in the works of Fisher, Karl Pearson, and the other seminal thinkers of statistics.

 

Neural Networks

Neural Networks

Artificial neural networks were originally developed by researchers who were trying to mimic the neurophysiology of the human brain. By combining many simple computing elements (neurons or units) into a highly interconnected system, these researchers hoped to produce complex phenomena such as intelligence.

 

Although there is controversy about whether neural networks are really intelligent, they are without question very powerful at detecting complex nonlinear relationships in high-dimensional data.

 

The term network refers to the connection of basic building blocks called neurons. The input units contain the values of the predictor variables. The hidden units do the internal, often highly flexible non-linear computations. The output units compute predictions and compare these with the values of the target.

 

A very simple network has one input layer that is connected to a hidden unit, which is then connected to an output unit.

 

You can design very complex networks—software packages naturally make this a lot easier—that contain perhaps several hundred hidden units. You can define hidden layers that enable you to specify different types of transformations.

 

The primary advantage of neural networks is that they are extremely powerful at modeling nonlinear trends. They are also useful when the relationship among the input variables (including interactions) is vaguely understood.

 

You can also use neural networks to fit ordinary least squares and logistic regression models in their simplest form.

 

Problems with Data Content

Data Content

Duplicate Entries

You should always check for duplicate entries in a dataset. Sometimes, they are important in some real‐world way. In those cases, you usually want to condense them into one entry, adding an additional column that indicates how many unique entries there were.

 

In other cases, the duplication is purely a result of how the data was generated. For example, it might be derived by selecting several columns from a larger dataset, and there are no duplicates if you count the other columns.

 

Multiple Entries for a Single Entity

This case is a little more interesting than duplicate entries. Often, each real-world entity logically corresponds to one row in the dataset, but some entities are repeated multiple times with different data. The most common cause of this is that some of the entries are out of date, and only one row is currently correct.

 

In other cases, there actually should be duplicate entries. For example, each “entity” might be a power generator with several identical motors in it. Each motor could give its own status report, and all of them will be present in the data with the same serial number.

 

Another field in the data might tell you which motor is actually which. In the cases where the motor isn’t specified in a data field, the different rows will often come in a fixed order.

 

Another case where there can be multiple entries is if, for some reason, the same entity is occasionally processed twice by whatever gathered the data. This happens in many manufacturing settings because they will retool broken components and send them through the assembly line multiple times rather than scrap them outright.

 

Missing Entries

Missing Entries

Most of the time when some entities are not described in a dataset, they have some common characteristics that kept them out. For example, let’s say that there is a log of all transactions from the past year. We group the transactions by customer and add up the size of the transactions for each customer.

 

This dataset will have only one row per customer, but any customer who had no transactions in the past year will be left out entirely. In a case such as this, you can join the derived data up against some known set of all customers and fill in the appropriate values for the ones who were missing.

 

In other cases, missing data arises because data was never gathered in the first place for some entities. For example, maybe two factories produce a particular product, but only one of them gathers this particular data about them.

 

NULLs

NULL entries typically mean that we don’t know a particular piece of information about some entity. The question is: why?

Most simply, NULLs can arise because the data collection process was botched in some way. What this means depends on the context.

 

When it comes time to do analytics, NULLs cannot be processed by many algorithms. In these cases, it is often necessary to replace the missing values with some reasonable proxy. What you will see most often is that it is guessed from other data fields, or you simply plug in the mean of all the non‐null values.

 

In other cases, the NULL values arise because that data was never collected. For example, some measurements might be taken at one factory that produces widgets but not at another. The table of all collected data for all widgets will then contain NULLs for whenever the widget’s factory didn’t collect that data.

 

For this reason, whether a variable is NULL can sometimes be a very powerful feature. The factory that produced the widget is, after all, potentially a very important determinant for whatever it is you want to predict, independent of whatever other data you gathered.

 

Huge Outliers

Huge Outliers

Sometimes, a massive outlier in the data is there because there was truly an aberrant event. How to deal with that depends on the context.

 

Sometimes, the outliers should be filtered out of the dataset. In web traffic, for example, you are usually interested in predicting page views by humans. A huge spike in recorded traffic is likely to come from a bot attack, rather than any activities of humans.

 

In other cases, outliers just mean missing data. Some storage systems don’t allow the explicit concept of a NULL value, so there is some predetermined value that signifies missing data. If many entries have identical, seemingly arbitrary values, then this might be what’s happening.

 

Out‐of‐Date Data

In many databases, every row has a timestamp for when it was entered. When an entry is updated, it is not replaced in the dataset; instead, a new row is put in that has an up‐to‐date timestamp.

 

For this reason, many datasets include entries that are no longer accurate and only useful if you are trying to reconstruct the history of the database.

 

Artificial Entries

Artificial Entries

Many industrial datasets have artificial entries that have been deliberately inserted into the real data. This is usually done for purposes of testing the software systems that process the data.

 

Irregular Spacings

Many datasets include measurements taken at regular spacings. For example, you could have the traffic to a website every hour or the temperature of a physical object measured at every inch. Most of the algorithms that process data such as this assume that the data points are equally spaced, which presents a major problem when they are irregular.

 

If the data is from sensors measuring something such as temperature, then typically you have to use interpolation techniques to generate new values at a set of equally spaced points.

 

A special case of irregular spacings happens when two entries have identical timestamps but different numbers. This usually happens because the timestamps are only recorded to finite precision.

 

If two measurements happen within the same minute, and time is only recorded up to the minute, then their timestamps will be identical.

 

Formatting Issues

Formatting Is Irregular between Different Tables/Columns

This happens a lot, typically because of how the data was stored in the first place. It is an especially big issue when joinable/groupable keys are irregularly formatted between different datasets.

 

Extra Whitespace

For such a small issue, it is almost comical how often random whitespace con-founds analyses when people try to, say, join the identifier “ABC” against “ABC” for two different datasets. Whitespace is especially insidious because when you print the data to the screen to examine it, the whitespace might be impossible to discern.

 

In Python, every string object has a strip() method that removes whitespace from the front and end of a string. The methods lstrip() and rstrip() will remove whitespace only from the front and end, respectively. If you pass a character as an argument into the strip functions, only that character will be stripped. For example,

"ABC\t".strip()
'ABC'
" ABC\t".lstrip() 'ABC\t'
" ABC\t".rstrip() ' ABC'
"ABC".strip("C")
'AB'

 

Irregular Capitalization

Python strings have lower() and upper() methods, which will return a copy of the original string with all letters set to uppercase or lowercase.

 

Inconsistent Delimiters

Usually, a dataset will have a single delimiter, but sometimes, different tables will use different ones. The most common delimiters you will see are as follows: Commas, Tabs, Pipes (the vertical line “|”).

 

Irregular NULL Format

There are a number of different ways that missing entries are encoded into CSV files, and they should all be interpreted as NULLs when the data is read in. Some popular examples are the empty string “”, “NA,” and “NULL.” Occasionally, you will see others such as “unavailable” or “unknown” as well.

 

Invalid Characters

Some data files will randomly have invalid bytes in the middle of them. Some programs will throw an error if you try to open up anything that isn’t valid text. In these cases, you may have to filter out the invalid bytes.

 

The following Python code will create a string called s, which is not validly formatted text. The decode() method takes in two arguments. The first is the text format that the string should be coerced into (there are several, which I will discuss later in the blog on file formats).

 

The second is what should be done when such coercion isn’t possible; saying “ignore” means that invalid characters simply get dropped.

s = "abc\xFF"
print s # Note how last character isn’t a letter abc□
s.decode("ascii", "ignore")
u'abc'
Weird or Incompatible Datetimes
Datetimes are one of the most frequently mangled types of data field. Some of the date formats you will see are as follows:
August 1, 2013
AUG 1, ‘13
2013‐08‐13

 

There is an important way that dates and times are different from other formatting issues. Most of the time you have two different ways of expressing the same information, and a perfect translation is possible from the one to the other. But with dates and times, the information content itself can be different.

 

For example, you might have just the date, or there could also be a time associated with it. If there is a time, does it go out to the minute, hour, second, or something else? What about time zones?

 

Most scripting languages include some kind of built‐in DateTime data structure, which lets you specify any of these different parameters (and uses reasonable defaults if you don’t specify).

 

Generally speaking, the best way to approach DateTime data is to get it into the built‐in data types as quickly as possible, so that you can stop worrying about string formatting.

 

The easiest way to parse dates in Python is with a package called dateutil, which works as follows:

import dateutil.parser as p

p.parse("August 13, 1985") datetime.datetime(1985, 8, 13, 0, 0)

p.parse("2013-8-13") datetime.datetime(2013, 8, 13, 0, 0)

p.parse("2013-8-13 4:15am") datetime.datetime(2013, 8, 13, 4, 15)

 

It takes in a string, uses some reasonable rules to determine how that string is encoding dates and times and coerces it into the DateTime data type. Note that it rounds down – August 13th becomes 12:00 AM on August 13th, and so on.

 

Operating System Incompatibilities

operating systems

Different operating systems have different file conventions, and sometimes, that is a problem when opening a file that was generated on one OS on a computer that runs a different one.

 

Probably, the most notable place where this occurs is newlines in text files. In Mac and Linux, a newline is conventionally denoted by the single character “\n.” On Windows, it is often two characters “\r\n.” Many data processing tools check what operating system they are being run on so that they know which convention to use.

 

Wrong Software Versions

Sometimes, you will have a file of a format that is designed to be handled by a specific software package. However, when you try to open it, a very mystifying error is thrown. This happens, for example, with data compression formats.

 

Oftentimes the culprit ends up being that the file was originally generated with one version of the software. However, the software has changed in the meantime, and you are now trying to open the file with a different version.

 

Example­ Formatting Script

Formatting Script

The following script illustrates how you can use hacked‐together string formatting to clean up disgusting data and load it into a Pandas DataFrame. Let’s say we have the following data in a file:

Name|Age|Birthdate

Ms. Janice Joplin|65|January 19, 1943

Bob Dylan |74 Years| may 24 1941

Billy Ray Joel|66yo|Feb. 9, 1941

 

It’s clear to a human looking at the data what it’s supposed to mean, but it’s the kind of thing that might be terrible if you opened it with a CSV file reader.

 

The following code will take care of the pathologies and make things more explicit. It’s not exactly pretty or efficient, but it gets the job done, it’s easy to understand, and it would be easy to modify if it needed changing:

def get_first_last_name(s):
INVALID_NAME_PARTS = ["mr", "ms", "mrs",
"dr", "jr", "sir"]
parts = s.lower().replace(".","").strip().split() parts = [p for p in parts
if p not in INVALID_NAME_PARTS] if len(parts)==0:
raise ValueError(
"Name %s is formatted wrong" % s)
first, last = parts[0], parts[-1] first = first[0].upper() + first[1:] last = last[0].upper() + last[1:] return first, last
def format_age(s):
chars = list(s) # list of characters
digit_chars = [c for c in chars if c.isdigit()] return int("".join(digit_chars))
def format_date(s):
MONTH_MAP = {
"jan": "01", "feb": "02", "may": "03"} s = s.strip().lower().replace(",", "") m, d, y = s.split()
if len(y) == 2: y = "19" + y
if len(d) == 1: d = "0" + d
return y + "-" + MONTH_MAP[m[:3]] + "-" + d
import pandas as pd
df = pd.read_csv("file.tsv", sep="|")
df["First Name"] = df["Name"].apply(
lambda s: get_first_last_name(s)[0]) df["Last Name"] = df["Name"].apply(
lambda s: get_first_last_name(s)[1]) df["Age"] = df["Age"].apply(format_age) df["Birthdate"] = df["Birthdate"].apply(
format_date).astype(pd.datetime)
print df

 

Visualizations and Simple Metrics

Visualizations

A rule of thumb for data science deliverables is this: if there isn’t a picture, then you’re doing it wrong.

Typically, a good analytics project starts (after cleaning and understanding the data) with exploratory visualizations that help you develop hypotheses and get a feel for the data, and it ends with carefully manicured figures that make the final results visually obvious. The actual number crunching is hidden in the middle, sometimes almost as an aside.

 

I’ve had a number of projects where there was never even any actual machine learning: people needed to know whether there was a signal in the data and which directions were most promising for further work (which would potentially include machine learning), and graphics showed that more clearly than a number ever could.

 

This fact is very underappreciated outside of the data analysis community. Many people think of data scientists as numerical badasses, working black magic from a command line.

 

But that’s just not the way the human brain processes data, generate hypotheses or develops familiarity with an area. Pictures are plans A–C for everything except the last stages of statistically validating results.

 

I’ve often joked that if humans were able to visualize things in a thousand dimensions, then my job as a data scientist would consist entirely of generating and looking at scatterplots.

 

This blog will take you through several of the most important visualizations. You’ve probably seen most of this before, but it’s always good to revisit the basics.

 

We will also cover some exploratory metrics (such as correlations), which capture, in crude numerical form, some of the patterns that are clear from a good visual.

 

There are many techniques not covered in this blog, and you would do well to learn them. However, my experience is that these core ones will cover most of your needs.

 

I strongly recommend memorizing the syntax for basic visualizations in your programming language of choice. In exploratory analysis especially, it’s useful to be able to chug through various ways of visualizing your data without needing to consult a reference on the syntax.

 

There are, however, still times when we need a number. There are two reasons for this:

Our eyes can trick us, so it’s important to have a cold hard statistic too.

 

Often, you don’t have time to sift through every possible picture, and you need some way to put a number on it so that the computer can make decisions of some sort automatically (even if the decision is only which pictures are worth your time to look at).

 

Besides visualization techniques, this blog will cover some standard statistical metrics that strive to capture, in numerical form, some of the meaning that you can get out of a picture.

 

A­ Note on Python’s Visualization Tools

Visualization Tools

The main visualization tool for Python is a library called matplotlib. While matplotlib is powerful and flexible, it is probably the weakest link in Python’s technology stack.

 

The graphs can be a bit cartoonish, in some ways the syntax is nonintuitive, and the interactivity (zooming in, etc.) leaves something to be desired. Most of the appearance issues can be fixed by tweaking a graphic’s configuration, but the default settings are not great.

 

I’m sticking with matplotlib for this blog because it is by far the most standard tool, it is sufficient for most data science (especially if you learn some of the ways you can make the plots look prettier), and it integrates well with the other libraries.

 

But there are other libraries out there that are gaining ground, especially browser‐based ones such as Bokeh and Modern Visualization for the Data Era.

 

Example code in this blog will use Pandas whenever possible. However, Pandas’ visualizations are a wrapper‐around matplotlib, and sometimes, we have to use matplotlib directly.

 

Typically, you make an image by calling the plot() method on a Pandas object, and Pandas does all the image formatting under the hood. Then you use matplotlib’s pyplot module for things such as setting the title and the final act of either displaying the image or saving it to a file.

 

Example Code

Example Code

To illustrate the visualization techniques we discuss in this blog, we will apply them to the famous Iris dataset, which you may have seen in a statistics textbook.

 

It describes physical measurements taken of flower specimens, drawn from three different species of iris. There are 150 data points, 50 from each species, and each data point gives the length and width of the pedals and sepals.

 

The following code sets the stage for all of the example code in this blog. It imports the relevant libraries and creates a DataFrame containing the sample dataset (which comes built‐in to scikit‐learn):

import pandas as pd
from matplotlib import pyplot as plt import sklearn.datasets
def get_iris_df():
ds = sklearn.datasets.load_iris()
df = pd.DataFrame(ds['data'],
columns = ds['feature_names'])
code_species_map = dict(zip(
range(3), ds['target_names'])) df['species'] = [code_species_map[c]
for c in ds['target']]
return df
df = get_iris_df()

 

Interlude: Feature Extraction Ideas

Before we jump into specific machine learning technique, I want to come back to feature extraction. A machine learning analysis will be only as good as the features that you plug into it.

 

The best features are the ones that carefully reflect the thing you are studying, so you’re likely going to have to bring a lot of domain expertise to your problems.

 

However, I can give some of the “usual suspects”: classical ways to extract features from data that apply in a wide range of contexts and are at the very least worth taking a look at. This interlude will go over several of them and lead to some discussion about applying them in real contexts.

 

Standard Features

Here are several types of feature extraction that are real classics, along with some of the real‐world considerations of using them:

 

Is_null: One of the simplest, and surprisingly effective, features is just whether the original data entry is missing. This is often because the entry is null for an important reason.

 

For example, maybe some data wasn’t gathered for widgets produced by a particular factory. Or, with humans, maybe demographic data is missing because some demographic groups are less likely to report it.

 

Dummy variables: A categorical variable is one that can take on a finite number of values. A column for a US state, for example, has 50 possible values. A dummy variable is a binary variable that says whether the categorical column is a particular value.

 

Then you might have a binary column that says whether or not a state is Washington, another column that says whether it is Texas, and so on.

 

This is also called one‐hot encoding because every row in your dataset will have 1 in exactly one of the dummy variables for the states. There are two big issues to consider when using dummy variables:

 

You might have a LOT of categories, some of which are very rare. In this case, it’s typical to pick some threshold and only have dummy variables for the more common values, then have another dummy variable that will be 1 for anything else.

 

Often, you only learn what the possible values are by looking at training data, and then you will have to extract the same dummy features from other data (maybe testing data) later on. In that case, you will have to have some protocol for dealing with entries that were not present in the training data.

 

Ranks: A blunt‐force way to correct for outliers in a column of data is to sort the values and instead use their ordinal ranks. There are two big problems with this: It’s an expensive computation since the whole list must be sorted, and it cannot be done in parallel if your data is distributed across a cluster.

 

Ranks are a huge problem when it comes to testing/training data. If you rank all your points before dividing into training/testing, then information about the testing data will be implicit in the training data: a huge no‐no.

 

A workaround is to give each testing data point the rank that it would have had in the training data, but this is computationally expensive.

 

Binning: Both of the problems associated with ranks can be addressed by choosing several histogram bins of roughly equal size that your data can be put into. You might have a bin for anything below the 25th percentile, another for the 25th to the 50th percentile, and so on.

 

Then rather than a percentile rank for your data points, just say which bin they fall into. The downside is that this takes away the fine resolution that you get with percentile ranks.

 

Logarithms: It is common to take the logarithm of a raw number and use that as a feature. It dampens down large outliers and increases the prominence of small values. If your data contains any 0s, it’s common to add 1 before taking the log.

 

Features­ That Involve Grouping

Grouping

Oftentimes, a dataset will include multiple rows for a single entity that we are describing. For example, our dataset might have one row per transaction and a column that says the customer we had the transaction with, but we are trying to extract features about the customers.

 

In these cases, we have to aggregate the various rows for a given customer in some way. Several brute‐force aggregate metrics you use could include the following:

 

  • The number of rows
  • The average, min, max, mean, median, and so on, for a particular column
  • If a column is nonnumerical, the number of distinct entries that it contains
  • If a column is nonnumerical, the number of entries that were identical to the most common entry
  • The correlation between two different columns.

 

Preview of More Sophisticated Features

Many of the more advanced blogs in this blog will talk about fancy methods for feature extraction. Here is a quick list of some of the very interesting ones:

 

If your data point is an image, you can extract some measure of the degree to which it resembles some other image. The classical way to do this is called Principal Component Analysis (PCA). It also works for numerical arrays of time series data or sensor measurements.

 

You can cluster your data and use the cluster of each point as a categorical feature.

 

If the data is text, you can extract the frequency of each word. The problem with this is that it often gives you prohibitively many features, and some additional method may be required to condense them down.

 

Defining the Feature You Want to Predict

 Predict traffic

Finally, it’s worth noting that you may find yourself extracting the most important feature of all: the feature that you’re using machine learning to try and predict. To show you how this might work, here are several examples from my own career:

 

I had to predict human traffic to a collection of websites. However, the traffic logs were polluted by bots, and it became a separate problem to filter out the bot traffic and then estimate how well our cleaned traffic corresponded to flesh‐and‐blood humans.

 

We did this by comparing our traffic estimate to those from Google for a few select sites sometimes we were over, and sometimes we were under, but we concluded that it was a good enough match to move forward with the project.

 

I have studied customer “churn,” that is when customers take their business elsewhere. Intuitively, the “ground truth” is a feeling of loyalty that exists in the minds of customers, and you have to figure out how to gauge that based on their purchasing behavior.

It’s hard to distinguish between churn and the customer just not needing your services for a time period.

 

Interlude: Feature Extraction Ideas

Feature Extraction Ideas

When you are trying to predict events based on time series data, you often have to predict whether or not an event is imminent at some point in time. This requires deciding how far in the future an event can be before it is considered “imminent.”

 

Alternatively, you can have a continuous‐valued number that says how long until the next event, possibly having it top out at some maximum value so as to avoid outliers (or you could take the logarithm of the time until the next event – that would dampen outliers too).

 

Data Encodings and File Formats

Coming from a background of academic physics, my first years in data science were one big exercise in discovering new data formats that I probably should have already known about.

 

It was a bit demoralizing at the time, so let me make something clear up front: people are always dreaming up new data types and formats, and you will forever be playing catch‐up on them.

 

However, there are several formats that are common enough you should know them. It seems that every new format that comes out is easily understood as a variation of a previous format, so you’ll be on good footing going forward. There are also some broad principles that underlie all formats, and I hope to give you a flavor of them.

 

First, I will talk about specific file formats that you are likely to encounter as a data scientist. This will include sample code for parsing them, discussions about when they are useful, and some thoughts about the future of data formats.

 

For the second half of the blog, I will switch gears to a discussion of how data is laid out in the physical memory of a computer. This will involve peaking under the hood of the computer to look at performance considerations and give you a deeper understanding of the file formats we just discussed.

 

This section will come in handy when you are dealing with particularly gnarly data pathologies or writing code that aims for speed when you are chugging through a dataset.

 

Typical File Format Categories

File Format

There are many, many different specific file formats out there. However, they fall under several broad categories. This section will go over the most important ones for a data scientist. The list is not exhaustive, and neither are the categories mutually exclusive, but this should give you a broad lay of the land.

 

Text Files

Most raw data files seen by data scientists are, at least in my experience, text files. This is the most common format for CSV files, JSON, XML and web pages. Pulls from databases, data from the web, and log files generated by machines are all typically text.

 

The advantage of a text file is that it is readable by a human being, meaning that it is very easy to write scripts that generate it or parse it. Text files work best for data with a relatively simple format.

 

There are limitations though. In particular, the text is a notoriously inefficient way to store numbers. The string "938238234232425123" takes up 18 bytes, but the number it represents would be stored in memory as 8 bytes.

 

Not only is this a price to pay for storage, but the number must be converted from text to a new format before a machine can operate on it.

 

Dense Numerical Arrays

Dense Numerical Arrays

If you are storing large arrays of numbers, it is much more space and performance‐efficient to store them in something such as the native format that computers use for processing numbers.

 

Most image files or sound files consist mostly of dense arrays of numbers, packed adjacent to each other in memory. Many scientific datasets fall into this category too. In my experience, you don’t see these datasets as often in data science, but they do come up.

 

Program‐Specific Data Formats

Many computer programs have their own specialized file format. This category would include things such as Excel files, db files, and similar formats. Typically, you will need to look up a tool to open one of these files.

 

In my experience, opening them often takes a while computationally, since there are often a lot of bells and whistles built into the program that may or may not be present in this particular dataset.

 

This makes it a pain to reparse them every time you rerun your analysis scripts – often, it takes much longer than the actual analysis does. What I typically do is make CSV versions of them right up front and use those as the input to my analyses.

 

Compressed or Archived Data

Archived Data

Many data files, when stored in a particular format, take up a lot more space compared to the file in question logically needs;

 

for example, if most lines in a large text file are exactly the same or a dense numerical array consists mostly of 0s. In these cases, we want to compress the large file into a smaller one, so that it can be stored and transferred more easily.

 

A related problem is when we have a large collection of files that we want to condense into a single file for easier management, often called data archiving. There are a variety of ways that we can encode the raw data into these more manageable forms.

 

There is a lot more to data compression than just reducing the size. A “perfect” algorithm would have the following properties:

It generally reduces the size of the data, easing storage requirements.

If it can’t compress the data much (or at all), then at least it doesn’t balloon it to take up much MORE space.

 

You can decompress it quickly. If you do this really well, it might take you less time to load the compressed data compared to the raw data itself, even with the decompression step. This is because decompression in RAM can be fairly quick, but it takes a long time to pull extra data off the disk.

 

You can decompress it “one line at a time,” rather than loading the entire file. This helps you deal with corrupt data and typically makes decompression go faster since you’re operating on fewer data at a time.

 

You can recompress it quickly.

In the real world, there is a wide range of compression algorithms available, which balance these interests in a lot of different ways. Compression becomes especially important in Big Data settings, where datasets are typically large and reloaded from disk every time the code runs.

 

CSV Files

CSV Files

CSV files are the workhorse data format for data science. “CSV” usually stands for “comma‐separated value,” but it really should be “character‐separated value” since characters other than commas do get used.

 

Sometimes, you will see “.tsv” if tabs are used or “.psv” if pipes (the “|” character) are used. More often though, in my experience, everything gets called CSV regardless of the delimiter.

 

CSV files are pretty straightforward conceptually – just a table with rows and columns. There are a few complications you should be aware of though:

 

Headers. Sometimes, the first line gives names for all the columns, and sometimes, it gets right into the data.

 

Quotes. In many files, the data elements are surrounded in quotes or another character. This is done largely so that commas (or whatever the delimiting character is) can be included in the data fields.

 

Nondata rows. In many file formats, the data itself is CSV, but there are a certain number of nondata lines at the beginning of the file. Typically, these encode metadata about the file and need to be stripped out when the file is loaded into a table.

 

Comments. Many CSV files will contain human‐readable comments, as source code does. Typically, these are denoted by a single character, such as the # in Python.

Blank lines. They happen.

Lines with the wrong number of columns. These happen too.

The following Python code shows how to read a basic CSV file into a data frame using Pandas:

import pandas

df = pandas.read_csv("myfile.csv")

)

If your CSV file has weird complexities associated with it, then read_csv has a number of optional arguments that let you deal with them. Here is a more complicated call to read_csv:

import pandas
df = pandas.read_csv("myfile.csv",
sep = "|", # the delimiter. Default is the comma
header = False,
quotechar = ’"’,
compression = "gzip",
comment = ’#’
)

In my work, the optional arguments I use most are sep and header.

 

JSON Files

JSON is probably my single favorite data format, for its dirt simplicity and flexibility. It is a way to take hierarchical data structures and serialize them into a plain text format. Every JSON data structure is either of the following:

 

An atomic type, such as a number, a string, or a Boolean.

A JSONObject, which is just a map from strings to JSON data structures. This is similar to Python dictionaries, except that there are keys in the JSONObject.

 

An array of JSON data structures. This is similar to a Python list. 

Here is an example of some valid JSON, which encodes a JSONObject map with a lot of substructures:

{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021‐3100"
},
"children":["alice","john",{"name":"alice","birth_order":2}],
"spouse": null
}

 

Note a few things about this example:

The fact that I’ve made it all pretty with the newlines and indentations is purely to make it easier to read. This could have all been on one long line and any JSON parser would parse it equally well. A lot of programs for viewing JSON will automatically format it in this more legible way.

 

The overall object is conceptually similar to a Python dictionary, where the keys are all strings and the values are JSON objects. The overall object could have been an array too though.

 

A difference between JSON objects and Python dictionaries is that all the field names have to be strings. In Python, the keys can be any hashable type.

 

The fields in the object can be ordered arrays, such as “children.” These arrays are analogous to Python lists.

You can mix and match types in the object, just as in Python.

You can have Boolean types. Note though that they are declared in lower case.

  • There are also numerical types.
  • There is a null supported
  • You can nest the object arbitrarily deeply.

 

Parsing JSON is a cinch in Python. You can either “load” a JSON string into a Python object (a dictionary at the highest level, with JSON arrays mapping to Python lists, etc.) or “dump” a Python dictionary into a JSON string.

 

The JSON string can either be a Python string or be stored in a file, in which case you write from/to a file object. The code looks as follows:

import json

json_str = """{"name": "Field", "height":6.0}"""

my_obj = json.loads(json_str)

my_obj

{u'name': u'Field', u'height': 6.0} >>> str_again = json.dumps(my_obj)

 

Data Encodings and File Formats

Historically, JSON was invented as a way to serialize objects from the JavaScript language. Think of the keys in a JSONObject as the names of the members in an object. However, JSON does NOT support notions such as pointers, classes, and functions.

 

XML Files

XML is similar to JSON: a text‐based format that lets you store hierarchical data in a format that can be read by both humans and machines. However, it’s significantly more complicated than JSON – part of the reason that JSON has been eclipsing it as a data transfer standard on the web.

 

Let’s jump in with an example:

<GroupOfPeople>
<person gender="male">
<Name>Field Cady</Name>
<Profession>Data Scientist</Profession>
</person>
<person gender="female">
<Name>Ryna</Name>
<Profession>Engineer</Profession>
</person>
</GroupOfPeople>

 

Everything enclosed in angle brackets is called a “tag.” Every section of the document is blogended by a matching pairs of tags, which tell what type of section it is. The closing tag contains a slash “/” after the “<”.

 

The opening tag can contain other pieces of information about the section – in this case, “gender” is such an attribute. Because you can have whatever tag names or additional attributes you like, XML lends itself to making domain‐specific description languages.

 

XML sections must be fully nested into each other, so something such as the following is invalid:

<a><b></a></b>

because the “b” section begins in the middle of the “a” section but doesn’t end until the “a” is already over.

 

For this reason, it is conventional to think of an XML document as a tree structure. Every nonleaf node in the tree corresponds to a pair of opening/closing tags, of some type and possibly with some attributes, and the leaf nodes are the actual data.

 

Sometimes, we want the start and end tag of a section to be adjacent to each other. In this case, there is a little bit of syntactic sugar, where you put the closing “/” before the closing angle bracket. So,

<foo a="bar"></foo>
is equivalent to
<foo a="bar"/>

 

A big difference between JSON and XML is that the content in XML is ordered. Every node in the tree has its children in a particular order – the order in which they come in the document. They can be of any types and come in any order, but there is AN order.

 

Processing XML is a little more finicky than processing JSON, in my experience. This is for two reasons:

 

It’s easier to refer to a named field in a JSON object than to search through all the children of an XML node and find the one you’re looking for.

 

XML nodes often have additional attributes, which are handled separately from the node’s children.

This isn’t inherent to the data formats, but in practice, JSON tends to be used in small snippets, for smaller applications where the data has a regular structure.

 

So, you typically know exactly how to extract the data you’re looking for. In contrast, XML is liable to be a massive document with many parts, and you have to sift through the whole thing.

 

In Python, the XML library offers a variety of ways of processing XML data. The simplest is the ElementTree sublibrary, which gives us direct access to the parse tree of the XML.

 

It is shown in this code example, where we parse XML data into a string object, access and modify the data, and then reencode it back to an XML string:

import xml.etree.ElementTree as ET
xml_str = """
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/> </country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/> </country>
</data>
"""
>>> root = ET.fromstring(xml_str)
>>> root.tag 'data'
>>> root[0] # gives the zeroth child <Element 'country' at 0x1092d4410>
>>> root.attrib # dictionary of node’s attributes {}
>>> root.getchildren()
[<Element 'country' at 0x1092d4410>, <Element 'country' at 0x1092d47d0>, <Element 'country' at 0x1092d4910>]
>>> del root[0] # deletes the zeroth child from the tree
>>> modified_xml_str = ET.tostring(root)

 

The “right” way to manage XML data is called the “Document Object Model.” It is a little more standardized across programming languages and web browsers, but it is also more complicated to master. The ElementTree is fine for simple applications and capable of doing whatever you need it to do.

 

HTML Files

HTML Files

By far the most important variant of XML is HTML, the language for describing pages on the web. Practically speaking, the definition of “valid” HTML is that your web browser will parse it as intended.

 

There are differences between browsers, some intentional and some not, and that’s why the same page might look different in Chrome and Internet Explorer.

 

But browsers have largely converged on a standard version of HTML (the most recent official standard is HTML5), and to a first approximation, that standard is a variant of XML. Many web pages could be parsed with an XML parser library.

 

I mentioned in the last section that XML can be used to create domain‐specific languages, each of which is defined by its own set of valid tags and their associated attributes. This is the way HTML works. Some of the more notable tags are given in the following table:

 

Tag Meaning Example

<a> Hyperlink Click <a href="www.google.com">here</a> to go to Google<img> Image <img class='lazy' data-src="smiley.gif">
<h1>–<h6> Headings of text <h1>The Title</h1>
<div> Division. It doesn’t get rendered but <div class="main‐text">My helps to organize the document. body of text</div>
Often, the “class” attribute is used to associate the contents of the division with a desired style of text formatting
<ul> and <li> Unordered lists (usually rendered as bulleted lists) and list items Here is a list:
<ul>
<li>Item One</li>
<li>Item Two</li>
</ul>

 

The practical problem with processing HTML data is that, unlike JSON or even XML, HTML documents tend to be extremely messy. They are often individually made, edited by humans, and tweaked until they look “just right.”

 

This means that there is almost no regularity in structure from one HTML document to the next, so the tools for processing HTML lean toward combing through the entire document to find what it is you’re looking for.

 

The default HTML tool for Python is the HTMLParser class, which you use by creating a subclass that inherits from it. An HTMLParser works by walking through the document, performing some action each time it hits a start or an end tag or another piece of text.

 

These actions will be user‐defined methods on the class, and they work by modifying the parser’s internal state.

 

When the parser has walked through the entire document, its internal state can be queried for whatever it is you were looking for. One very important note is that it’s up to the user to keep track of things such as how deeply nested you are within the document’s sections.

 

To illustrate, the following code will pull down the HTML for a Wikipedia page, step through its content, and count all hyperlinks that are embedded in the body of the text (i.e., they are within paragraph tags):

 

from HTMLParser import HTMLParser import urllib

TOPIC = "Dangiwa_Umar"
url = "https://en.wikipedia.org/wiki/%s" % TOPIC class LinkCountingParser(HTMLParser):
in_paragraph = False
link_count = 0
def handle_starttag(self, tag, attrs): if tag=='p': self.in_paragraph = True elif tag=='a' and self.in_paragraph:
self.link_count += 1 def handle_endtag(self, tag):
if tag=='p': self.in_paragraph = False
html = urllib.urlopen(url).read()
parser = LinkCountingParser()
parser.feed(html)
print "there were", parser.link_count, \ "links in the article"

 

Tar Files

Tar is the most popular example of an “archive file” format. The idea is to take an entire directory full of data, possibly including nested subdirectories, and combine it all into a single file that you can send in an e‐mail, store somewhere, or whatever you want.

 

There are a number of other archive file formats, such as ISO, but in my experience, tar is the most common example.

 

Besides their widespread use for archiving files, Tar files are also used by the Java programming language and its relatives.

 

Compiled Java classes are stored into JAR files, but JAR files are created just by tarring individual Java class files together. The JAR is the same format as Tar, except that you are only combining Java class files rather than arbitrary file types.

 

Tarring a directory doesn’t actually compress the data – it just combines the files into one file that takes up about as much space as the data did originally.

 

So in practice, Tar files are almost always then zipped. GZipping, in particular, is popular. The “.tgz” file extension is used as a shorthand for “.tar.gz”, that is, the directory has been put into a Tar file, which was then compressed using the GZIP algorithm.

 

Tar files are typically opened from the command line, such as the following:

  • $ # This will expand the contents of
  • $ # my_directory.tar into the local directory $ tar -xvf my_directory.tar
  • $ # This command will untar and unzip
  • $ # a directory with has been tarred and g-zipped $ tar -zxf file.tar.gz
  • $ # This command will tar the Homework3 directory $ # into the file ILoveHomework.tar $ tar -cf ILoveHomework.tar Homework

 

GZip Files

Gzip is the most common compression format that you will see on Unix‐like systems such as Mac and Linux. Often, it’s used in conjunction with Tar to archive the contents of an entire directory. Encoding data with gzip is comparatively slow, but the format has the following advantages:

  • It compresses data super well.
  • Data can be decompressed quickly.
  • It can also be decompressed one line at a time, in case you only want to operate only on part of the data without decompressing the whole file.

 

Under the hood, gzip runs on a compression algorithm called DEFLATE. A compressed gzip file is broken into blocks. The first part of each block contains some data about the block, including how the rest of the block is encoded (it will be some type of Huffman code, but you don’t need to worry about the details of those).

 

Once the gzip program has parsed this header, it can read the rest of the block 1 byte at a time. This means there is minimal RAM being used up, so all the decompression can go on near the top of the RAM cache and hence proceed at breakneck speed.

 

The typical commands for gzipping/unzipping from the shell are simple:

$ gunzip myfile.txt.gz # creates raw file myfile.txt $ gzip myfile.txt # compresses the file into my file. txt.gz

However, you can typically also just double‐click on a file – most operating systems can open gzip files natively.

 

Zip Files

Zip files are very similar to Gzip files. In fact, they even use the same DEFLATE algorithm under the hood! There are some differences though, such as the fact that ZIP can compress an entire directory rather than just individual files.

 

Zipping and unzipping files is as easy with ZIP as with GZIP:

$ # This puts several files into a single zip file $ zip filename.zip input1.txt input2.txt resume.doc pic1.jpg

$ # This will open the zip file and put

$ # all of its contents into the current directory $ unzip filename.zip

 

Image­ Files: Rasterized, Vectorized, and/or Compressed

Image files can be broken down into two broad categories: rasterized and vectorized. Rasterized files break an image down into an array of pixels and encode things such as the brightness or color of each individual pixel.

 

Sometimes, the image file will store the pixel array directly, and other times, it will store some compressed version of the pixel array. Almost all machine‐generated data will be rasterized.

 

Vectorized files, on the other hand, are a mathematical description of what the image should look like, complete with perfect circles, straight lines, and so on. They can be scaled to any size without losing resolution. Vectorized files are more likely to be company logos, animations, and similar things.

 

The most common vectorized image format you’re likely to run into is SVG, which is actually just an XML file under the hood (as I mentioned before, XML is great for domain‐specific languages!). However, in daily work as a data scientist, you’re most likely to encounter rasterized files.

 

A rasterized image is an array of pixels that, depending on the format, can be combined with metadata and then possibly subjected to some form of compression (sometimes using the DEFLATE algorithm, such as GZIP). There are several considerations that differentiate between the different formats available:

 

Lossy versus lossless. Many formats (such as BMP and PNG) encode the pixel array exactly – these are called lossless. But others (such as JPEG) allow you to reduce the size of the file by degrading the resolution of your image.

 

Grayscale versus RBG. If images are black‐and‐white, then you only need one number per pixel. But if you have a colored image, then there needs to be some way to specify the color. Typically, this is done by using RGB encoding, where a pixel is specified by how much red, how much green, and how much blue it contains.

 

Transparency. Many images allow pixels to be partly transparent. The “alpha” of a pixel ranges from 0 to 1, with 0 being completely transparent and 1 being completely opaque.

 

Some of the most important image formats you should be aware of are as follows:

 

JPEG. This is probably the single most important one in web traffic, prized for its ability to massively compress an image with almost invisible degradation. It is a lossy compression format, stores RGB colors, and does not allow for transparency.

 

PNG. This is maybe the next most ubiquitous format. It is lossless and allows for transparency pixels. Personally, I find the transparent pixels make PNG files super useful when I’m putting together slide decks.

 

TIFF. Tiff files are not common on the Internet, but they are a frequent format for storing high‐resolution pictures in the context of photography or science. They can be lossy or lossless.

 

The following Python code will read an image file. It takes care of any decompression or format‐specific stuff under the hood and returns the image as a NumPy array of integers.

 

It will be a three‐dimensional array, with the first two dimensions corresponding to the normal width and height.

 

The image is read in as RBG by default, and the third dimension of the array indicates whether we are measuring the red, blue, or green content. The integers themselves will range from 0 to 255 since each is encoded with a single byte.

from scipy.ndimage import imread img = imread('mypic.jpg')

 

If you want to read the image as grayscale, you can pass mode = "F" and get a two‐dimensional array. If you instead want to include the alpha opacity as a fourth value for each pixel, pass in mode = "RGBA."

Recommend