Data Science Programming (Best Tutorial 2019)

data science programming

Programming Languages for Data Science Tutorial 2019

There are many Data Science programming language options available for data scientists. This Tutorial explains the most popular ones such as Python and Scala with best examples.



The most of example code in this tutorial is generally in Python, for a number of reasons. In my opinion, it is the best programming language available for general-purpose use, but that’s largely a matter of personal taste.


It is also a very popular choice among data scientists, who feel like it balances the flexibility of a conventional scripting language with the numerical muscles of a good mathematics package (at least, when it’s paired with its scientific computing libraries).


Python was developed by Guido van Rossum and first released in 1991. The language itself is a high-level scripting language, with functionality similar to Perl and Ruby and with an unusually clean and self-consistent syntax.

Outside of the core language, Python has several open-source technical computing libraries that make it a powerful tool for analytics.



Aside from Python, R is probably the most popular programming language among data scientists. Python is a scripting language designed for computer programmers, which has been augmented with libraries for technical computing.


In contrast, R was designed by and for statisticians, and it is natively integrated with graphics capabilities and extensive statistical functions. It is based on S, which was developed at Bell Labs in 1976.


R was brilliant for its time and a huge step up from the Fortran routines that it was competing with. In fact, many of Python’s technical computing libraries are just ripping off the good ideas in R.


But almost 40 years later, R is showing its age. Specifically, there are areas where the syntax is very clunky, the support for strings is terrible, and the type system is antiquated.


In my mind, the main reason to use R is just that there are so many special libraries that have been written for it over the years, and Python has not covered all the little special use cases yet.


I no longer use R for my own work, but it is a major force in the data science community and will continue to be for the foreseeable future. In the statistics community, R is still the lingua franca. You should know about it, even if you don’t use it yourself.


MATLAB and Octave

The data science community skews strongly toward open-source software, so good proprietary programs such as MATLAB often get less credit than they deserve.


Developed and sold by the MathWorks Corporation, MATLAB is an excellent package for numerical computing. It has a more consistent (and, in my opinion, nicer) syntax compared to R and more numerical muscle compared to Python.


A lot of people coming from physics or mechanical/electrical engineering backgrounds are well-versed in MATLAB. It is not as well-suited to large software frameworks or string-based data munging, but it is best-in-class for numerical computing.


If you like MATLAB’s syntax, but don’t like paying for software, then you could also consider Octave. It is an open-source version of MATLAB. It doesn’t capture all of MATLAB’s functionality and certainly doesn’t have the same support infrastructure, but it’s a fine option.




SAS (Statistical Analysis Software) is a proprietary statistics framework that dates back to the 1960s. Similar to R, there is a tremendous amount of entrenched legacy code written in SAS and a wide range of functionality that has been put into it.


However, the language itself is very alien to somebody more used to the modern language. SAS can be great for the business statistics applications that it is so popular in, but I don’t recommend it for general purpose data science.




Scala is an up-and-coming language that shows a lot of promise. It is not currently a suitable general-purpose tool for data scientists because it doesn’t have the library support for analytics and visualizations. However, that could easily change in the same way that it did with Python.


Scala is similar to Java under the hood but has a much simpler syntax with a lot of powerful features borrowed from other languages (especially functional languages). It works both for general-purpose scripting and for large-scale production software. Many of the most popular Big Data technologies are written in Scala.


Python Crash Course

Python Crash Course

This section will give a quick tutorial on the Python language. My goal is to get you up-and-running quickly with the basics of the language, especially so that you can understand the example code in the blog.


The tutorial is far from exhaustive. There are many aspects of Python that I don’t discuss, and in particular, I ignore most of its many built-in libraries. Some of this material will be covered later in the blog when it becomes relevant.


The next section will give you an introduction to Python’s technical libraries, which elevate it from a solid scripting language to a one-stop-shop for data science.


A Note on Versions

There are a number of versions of the Python language out there. As of this writing, the Python 2.7 series was by far the most popular for data scientists. The main reason for this is that all of the numerical computing libraries work with it.


 In 2008, Python 3.0 was released, and it broke backward compatibility with Python 2.7. This was a big deal because the Python community tends to be very careful about keeping things mutually consistent.


This blog is written assuming Python 2.7, but most of what I will say applied equally well to 3.x. Several of the key places where 3.x differs are as follows:

Print is treated as a function. So you would say

>>> print("hello world")

instead of

>>> print "hello world"

Arithmetic operations are treated as decimal operations even when they are done on integers. That way 3/2 will equal 1.5 rather than 1. If you want to do integer division, then say //


String and Unicode are removed as separate classes. It’s all Unicode now, and you can use the ByteArray type if you want to manipulate bytes directly.

“Hello World” Script


A common way to learn a new programming language is to first write a “hello world!” program: this is a program that just prints the text “hello world!” to the screen.


If you can write it and run it, then you know you have your software environment set up correctly and you know how to use it. After that point, you’re ready to roll with the serious code.


There are two ways you can run Python code, and I’ll walk you through hello world in both of them. Either you can open up the Python interpreter and enter your commands one at a time, which is very useful for exploring data and experimenting with what you want to do, or you can put your code into a file and run it all at once.


To run code in the interpreter on a Mac or Linux system, do the following:


Go to the command terminal.

Type “python” and press enter.

This will display the command prompt >>>. 

Type “print ‘hello world’” and press enter.

The phrase “hello world” should print on the screen.

The whole thing should appear as follows: >>> print “hello world!” hello world!


Congratulations! You’ve just run a line of Python code.

Press Ctrl-d to close the interpreter.


The process is very similar if you are working in a Windows environment. In place of the command terminal you are likely to use PowerShell — it is the Windows equivalent of a bash terminal.


For editing your source code, Visual Studio is a powerful IDE that is ubiquitous among Windows programmers. Personally, I tend to write my scripts in plain text editors if it’s at all practical, but especially for larger codebases, a good IDE becomes invaluable.


More Complicated Script


Ok, now that you’ve got Python running, let’s jump into the deep end. Here is a more complicated Python script. It has a data structure that describes the employees of a company. It goes through the employee records, gives each one a 5% raise, and updates the record with the name of the state they live in.


It then prints out information describing the updated employee data. Don’t worry if you can’t read the whole thing right now: I’ll explain what all the parts are. After we walk through this script, I’ll give a more comprehensive overview of Python’s data types and how to work with them; the script doesn’t show it all.

STATE_CODE_MAP = {'WA': 'Washington', 'TX': 'Texas'}
def update_employee_record(rec):
old_sal = rec['salary']
new_sal = old_sal * (1 + SALARY_RAISE_FACTOR)
rec['salary'] = new_sal
state_code = rec['state_code']
rec['state_name'] = STATE_CODE_MAP[state_code]
input_data = [
{'employee_name': 'Susan', 'salary': 100000.0,
'state_code': 'WA'},
{'employee_name': 'Ellen', 'salary': 75000.0,
'state_code': 'TX'},
for rec in input_data:
name = rec['employee_name']
salary = rec['salary']
state = rec['state_name']
print name + ' now lives in ' + state print ' and makes $' + str(salary)
If you run this script, you will see the following output:
Susan now lives in Washington and makes $110250.0
Ellen now lives in Texas
and makes $78750.0


The first line of the script defines the variable SALARY_RAISE_FACTOR to be the decimal number 0.05.


The next line defines what’s called a dict (short for the dictionary) called STATE_ CODE_MAP, which maps the postal abbreviations of several states to their full names. A dictionary maps “keys” to “values,” and they are enclosed within curly braces. There are commas between each key/value pair, and the key and value are separated by a colon.


The keys in a dict are usually strings, but they can also be numbers or any other atomic-type except None (which we’ll see in a minute). The values can be any Python object whatsoever, and different values can have different types.


But in this case, the values are all strings. Dicts are one of Python’s three main “container” data types (i.e., it contains other data), the other two being lists and tuples.


Next up, the line def update_employee_record(rec):

says that we are defining a function called update_employee_record that takes in a single argument and that the argument is called rec within the scope of this function. In our code, rec will always be a dict, but we have not specified that in the function declaration.


You can pass an integer, string, or anything else into update_employee_record. In this case, it so happens that we later do operations to rec that will fail if it’s not a dictionary (or something that behaves like one), but Python won’t know anything is amiss until the operation fails.


Here we come to the most famous gotcha of Python. The rest of the body of the function is all indented the same way: exactly four spaces. It could have been two spaces, or a tab, or any other whitespace combinations, but it must be consistent.


Consistency such as this is good practice in any programming language since it makes the code easier to read, but Python requires it. This is the single most controversial thing about Python, and it can get confusing if you’re in a situation where tabs and spaces are getting mixed.


In the body of the function, when we say old_sal = rec['salary']

we are pulling out the “salary” field in rec. This is how you get data out of a dict. By doing this, we are also tacitly assuming that there is a “salary” field: the code will throw an error if there isn’t one.

Later when we say rec['salary'] = new_sal we are assigning to the “salary” field – creating the field if there isn’t one, and overwriting it if there’s already one there.


The input_data variable is a list. Lists can contain elements of any type, but in this case, they are all dictionaries. Note that in this case, the values in the dictionaries are not all the same type: some are strings, but there is also a float field.


In the last part of the script, the line for rec in input_data: will loop over all of the elements of input_data in an order, executing the body of the loop for each one. Similar to function declarations, the body of the loop must be indented consistently.


The print statement here deserves a special mention. When we say print ' and makes $' + str(salary) there are three things going on:

str(salary) takes a salary, which is a float like 75,000.0, and returns a string like “75,000”. str() is a function that takes many Python objects and returns a string representation of them.


Adding two strings with + just concatenates them. Adding a string to a float would have given an error, which is why we had to say str(salary). The print statement in Python is a little weird. Most built-in functions in Python are called with parentheses, but the print doesn’t use them. This rare inconsistency was remedied in Python 3.0.


Atomic Data Types

Atomic Data Types

Python has five main atomic data types that you’ll have to worry about. If you have used a programming language in the past, they should mostly sound pretty familiar:

  • int: A mathematical integer
  • float: A floating-point number
  • bool: A true/false flag
  • string: A piece of text of arbitrarily many characters (possibly 0 or 1)


NoneType: This is a special type with only a single value None. It is often used as a placeholder when there is missing data or some process failed. Declaring a variable that is an int or a float is very straightforward:

my_integer = 2
my_other_integer = 2 + 3
my_float = 2.0
Boolean values are similarly uncomplicated:
my_true_bool = True
my_false_bool = False
this_is_true = (0 < 100)
this_is_false = (0 > 100)


NoneType is special. The only value it can take is called None, and this is often used as a placeholder when a variable should exist, but you don’t want it to have a meaningful value. Functions that fail in some way will also often return None to signify that there was a problem.


Comments and Docstrings


There are two kinds of comments in Python:

Those denoted by a # character, such as this: # This whole line is a comment, a = 5 # and the last part of this line is too

Strings that take up a line (or more) in your code but aren’t assigned to a variable.


It’s common practice to have a string at the beginning of a Python file that describes what the file does and how to use it. Such a string is called a doc-string. If you import the file as a library, then that library will have a field called doc that acts as built-in documentation.


These things come in handy! A function can also have a doc string, such as this:

def sqr(x): "This function just squares its input " return x * x


Complex Data Types

Python has three main data containers: lists, tuples, and dicts. There is also one called a set that you will use less often. Each of them contains other data structures, hence the name.


The first thing to know about containers in Python is that, unlike many other languages, you can mix and match the types. A list can consist entirely of ints. But it can also contain tuples, dictionaries, user-defined types, and even other lists.


All of Python’s container types are classes, in the object-oriented sense. However, they also all act as functions, which try to coerce their arguments into the appropriate type. For example,

my_list = ["a ", "b ", "c "]
my_set = set(my_list)
my_tuple = tuple(my_list)

will create a list and then create a set and a tuple that contain identical data.




A list is just what it sounds like: an ordered list of variables. The following code shows basic usage:

my_list = ["a ", "b ", "c "] print my_list[0] # prints "a "

my_list[0] = "A " # changes that element of the list my_list.append("d ") # adds new element to the end # List elements can be ANYTHING

mixed_list = ["A ", 5.7, "B ", [1,2,3]]


There is a special operation called a list comprehension, which lets us create one list from another by applying the same operation to all of its elements (and possibly filtering out some of those elements):

original_list = [1,2,3,4,5,6,7,8]

squares = [x*x for x in original_list]

squares_of_evens = [x*x for x in original list]

if x%2==0]


If you have not seen it before, there is one very important convention with list indexing that can be confusing at first: the first element in the list is element number 0.


The second is element number 1, and so on. There are reasons (some of them historical) for this convention, and if it mystifies you, you’re not alone. But you will have to get used to it with Python.


If you want to select a subset of a list, then you can do it with a colon:

my_list = ["a", "b", "c"]

first_two_elements = my_list[0:3]


The first number says which index to start at, and the second number says which index to stop just short of. If the first number is omitted, then you will start at the beginning. If the second is omitted, you will continue to the end. So we can say

my_list = ["a", "b", "c"]

first_two_elements = my_list[:3] last_two_elements = my_list[1:]


You can also use negative indices. –1 will refer to the last element in the list, –2 to the one before, etc. So we can say

my_list = ["a", "b", "c"]

all_but_last_element = my_list[:-1]


Strings and Lists


For complex string manipulation, one of the most flexible methods that you can call on strings is split(). It will break a string upon whitespace and return those parts of it as a list. Alternatively, you can pass another string as an argument, which will cause you to split on that string instead. It works such as this:

"ABC DEF".split() ['ABC', 'DEF']

"ABC \tDEF".split() ['ABC', 'DEF']

"ABC \tDEF".split(' ') ['ABC', '\tDEF']

"ABCABD".split("AB") ['', 'C', 'D']

The inverse of split() is the join() method. It is called on a string, and you pass in a list of other strings. The strings are all then concatenated into one string, using the string as a delimiter. For example,

",".join(["A", "B", "C"]) 'A,B,C'


You might have noticed that the syntax for selecting characters in a string is the same as that for selecting elements in a list.


In general, it is called “slice notation,” and it is possible to create other Python objects that use the same notation. Most generally, a slice takes in a start index, an end index, and how big the spacing should be. For example,

start, end, count_by = 1, 7, 2
"ABCDEFG"[start: end: count_by] 'BDF'




A tuple is conceptually a list that cannot be modified (no changing the elements, no adding/removing elements). Having them may seem redundant, but tuples are much more efficient than lists in some cases, and they play a central role in the operation of Python under the hood.


There are also several things that, for technical reasons, you can do with tuples that you can’t with lists. The most obvious of these is that the keys in a dictionary cannot be lists, but they can be tuples.

my_tuple = (1, 2, "hello world")

print my_tuple[0] # prints 1

my_tuple[1] = 5 # This will give an error!


There is one important piece of syntactic sugar to know that is often used with tuples. Oftentimes, we want to give names to the different fields in a tuple, and it is clunky to explicitly define a new variable for each of them. In these cases, we can do multiple assignments as follows:

my_tuple = (1, 2)
zeroth_field, first_field = my_tuple



A dictionary is a structure that takes in a key and returns a value. The keys for a dictionary are usually strings, but they can also be any other atomic data type or tuples (but they can’t be lists or dictionaries).


The values can be anything at all – integers, other dictionaries, external libraries, etc. In defining a dictionary, you use curly braces, with a colon separating the key and its value:


my_dict = {"January": 1, "February":2} print my_dict["January"] # prints 1 my_dict["March"] = 3 # add new element
my_dict["January"] = "Start of the year" # overwrite old value


As an interesting note, the Python language itself is largely built out of dictionaries (or slight variations of them). The namespace that stores all of your variables, for example, is a dictionary mapping the variables’ names to the objects themselves.


You can also create a dictionary by passing in a list of tuples to the function dict(), and you can create a list of tuples by calling the items() method on a dictionary:

pairs = [("one",1), ("two",2)]

as_dict = dict(pairs)

same_as_pairs = as_dict.items()




A set is somewhat similar to a dictionary with only keys and no values. It stores a collection of unique objects that are of atomic types. You can add new values to a set, which will do nothing if the value is already in it. You can also query the set to see if a value is in it. A simple shell script shows how this works:

s = set()
5 in s False
5 in s
s.add(5) # does nothing
Defining Functions
A function in Python is defined and called as follows:
def my_function(x):
y = x+1
x_sqrd = x*x
return x_sqrd
five_plus_one_sqrd = my_function(5)


This is a so-called “pure function,” meaning that it takes some input, returns an output, and does nothing else. A function can also have side effects, such as printing something to the screen or operating on a file.


In our example script earlier, modifying the input dictionary was a side effect. If no return value is specified, the function will return None.


You can also define optional arguments in a function, using this syntax:

def raise(x, n=2):
return pow(x,n)
two_sqrd = raise(2)
two_cubed = raise(2, n=3)
If the function you are defining only contains one line, and has no side effects, you can also define it using a so-called lambda expression:
sqr = lambda x : x*x
five_sqrd = sqr(5)

Assigning a lambda expression to "sqr" is equivalent to the normal syntax for function definitions. The term “lambda” is a reference to the Lisp programming language, which defines functions using the “lambda” keyword in a similar way.


Lambda functions are mostly used if you’re passing a one-off function as an argument to another function, and there’s no need to pollute the namespace with a new function name. For example,

def apply_to_evens(a_list, a_func):

return [a_func(x) for x in a_list if x%2==0]

my_list = [1,2,3,4,5]

sqrs_of_evens = apply_to_evens(my_list, lambda x:x*x)


Functions such as this, which are defined on the fly and never given an actual name, are called “anonymous functions.” They can be very handy in data science, especially in Big Data.


For Loops and Control Structures

Control Structures

The main control structure you do in practice is to loop over a list, as follows:

my_list = [1, 2, 3]

for x in my_list:

print "the number is ", x


If you are iterating over a list of tuples (as you might if you’re working with a dictionary), you can use the shorthand tuple notation I mentioned previously:

for key, value in my_dict.items():

print "the value for ", key, " is ", value


More generally, any data structure that allows for-loops such as this is called “iterable.” Lists are the most prominent iterable data type, but they are far from the only one.


If statements are handled this way,

if i < 3:
print "i is less than three"
elif i < 5: print "i is between 3 and 5"
else: print "i is greater than 5"
You don’t see it that often in practice, but Python also allows for while-loops, which are similar to this:
i = 0
while i < 5:
print "i is still less than five"
i = i+1


Exception Handling

Exception Handling

If Python code fails, sometimes, we want to have the script be prepared for that and act accordingly (rather than just dying). That is illustrated here:


lines = input_text.split("\n") print "tenth line was: ", lines[9]


print "There were < 10 lines"



To import functionality from an existing library, you use any of the following syntaxes:

from my_lib import f1, f2 # f1 & f2 in namespace import other_lib as ol # ol.f1 is the f1 func from other_lib import * # f1 is in namespace


Generally, the first and second methods of importing a library make for the most readable code; if you import * from several libraries and then call f1 later on in your code, it’s not obvious which library f1 came from.


To write your own library, just write a .py file in which your functions, classes, or other objects are defined. It can then be imported using the aforementioned syntax. Just make sure that your library is in the directory you are running your code from or in some other place that Python can find it.


Classes and Objects


Strictly speaking, everything in Python (and I mean everything – integers, functions, classes, imported libraries, etc.) is what’s called an object. However, most of the language is built around a few high-powered classes (such as lists and dictionaries) that do most of the heavy lifting, so it’s common to use Python only as a scripting language.


However, if you want to define your own classes, you can do it this way:

class Dog:
def __init__(self, name): = name
def respond_to_command(self, command):
if command == self.speak() def speak(self):
print "bark bark!!"
fido = Dog("fido")
fido.respond_to_command("spot") # does nothing fido.respond_to_command("fido") # prints bark bark


Here __init__ is a special function that gets called whenever an instance of the class is created. It does all of the initial setup required for the object.


The one thing that throws a lot of people off is the “self” keyword that gets passed in as the first argument for every function in the class. When I call fido. respond_to_command, the “self” argument in respond_to_command refers to Fido himself, that is, the Dog object whose method is being called.


This allows us to refer specifically to fido’s data elements, such as


For many object-oriented languages, just saying “name” in resond_to_command will implicitly refer to fido’s name, but Python requires that it be explicit. It’s similar to the keyword “this” that you will see in languages such as C++.


[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]


GOTCHA: Hashable and Unhashable Types


When I first started learning Python, there was one big gotcha that I ran into. It caused me a lot of grief for a few days as I tried to figure out why my code was failing, and I would like to spare you my pain. Python’s data type falls into two categories:


Hashable types. This includes ints, floats, strings, tuples, and a few more obscure ones. These are generally low-level data types, and instances of them are immutable.


 Unhashable types include lists, dictionaries, and libraries. Generally, unhashable types are for larger, more complex objects, which have an internal structure that can be modified.


The biggest difference between hashable and unhashable types is illustrated in this shell session:

a = 5 # a is a hashable int
b = a # b points to a COPY of a
a = a + 1
print b # b has NOT been incremented
A = [] # A is an UNhashable list
B = A # B points to the SAME list as A.

When I say b = a, a copy of the hashable int is made in memory, and the variable name b is set to point to it. But when I’m using unhashable lists and say B = A, the variable B is set to point to the exact same list!


If I had truly wanted to make a copy of A, so that appending to A didn’t affect

B, I could have said something like the following: >>> B = [x for x in A]


which would have constructed a new list in memory. If A was a list of integers, then A and B would be incapable of stepping on each other’s toes: they would have their own separate copies of the numbers.


However, if the elements of A were themselves unhashable types, then B would be distinct from A, but they would be pointing to the same objects. For example,

A = [{}, {}] # list of dicts

B = [x for x in A]

A[0]["name"] = "bob"



The other thing about hashable types is that the keys in a dictionary must be hashable.


Python’s Technical Libraries

Python’s Libraries

Python was designed mostly as a tool for software engineers, but there is an excellent suite of libraries available that make it a first-class environment for technical computing, competing with the likes of MATLAB and R. The main ones, which will be covered in this blog, are as follows:


Pandas: This is the big one for you to know. It stores and operates on data in data frames, very efficiently and with a sleek, intuitive API.


NumPy: This is a library for dealing with numerical arrays in ways that are fast and memory efficient, but it’s clunky and low level for a user. Under the hood, Pandas operates on NumPy arrays.


Scikit-learn: This is the main machine learning library, and it operates on NumPy arrays. You can take Pandas objects, turn them into NumPy arrays, and then plug them into scikit-learn.


Matplotlib: This is the big plotting and visualization library. Similar to NumPy, it is low level and a bit clunky to use directly. Pandas provide human-friendly wrappers that call matplotlib routines.


SciPy: This provides a suite of functions that perform fancy numerical operations on NumPy arrays.


These aren’t the only technical computing libraries available in Python, but they’re by far the most popular, and together they form a cohesive, powerful tool suite.


NumPy is the most fundamental library; it defines the core numerical arrays that everything else operates on.


However, most of your actual code (especially data munging and feature extraction) will be working within Pandas, only switching to the other libraries as needed. The rest of this blog will be a quick crash course on the basic data structures of Pandas.


Data Frames

Data Frames

The central kind of object in Pandas is called a DataFrame, which is similar to SQL tables or R data frames. A data frame is a table with rows and columns, where each column holds data of a particular type (such as integers, strings, or floats).


DataFrames make it easy and efficient to apply a function to every element in a column or to calculate aggregates such as the sum of a column. Some of the basic operations on data frames are shown in this code: import pandas as PD


Making data frame from a dictionary that maps column names to their values df = pd.DataFrame({
"name": ["Bob", "Alex", "Janice"],
"age": [60, 25, 33] })
Reading a DataFrame from a file
other_df = pd.read_csv(“myfile.csv”)
Making new columns from old ones
is really easy
df["age_plus_one"] = df["age"] + 1
df["age_times_two"] = 2 * df["age"]
df["age_squared"] = df["age"] * df["age"]
df["over_30"] = (df["age"] > 30) # this col is bools
# The columns have various built-in aggregate functions
total_age = df["age"].sum()
median_age = df["age"].quantile(0.5)
You can select several rows of the DataFrame
and make a new DataFrame out of them
df_below50 = df[df["age"] < 50]
# Apply a custom function to a column
df["age_squared"] = df["age"].apply(lambda x: x*x)


One important thing about DataFrames is the notion of an index. This is basically a name (not necessarily unique) that is given to every row of the data frame. By default, the indexes are just the line numbers (starting at 0), but you can set the index to be other columns if you like:

df = pd.DataFrame({
"name": ["Bob", "Alex", "Jane"],
"age": [60, 25, 33]
print df.index # prints 0-2, the line numbers
Create a DataFrame containing the same data,
but where name is the index
df_w_name_as_ind = df.set_index("name")
print df_w_name_as_ind.index # prints their names
# Get the row for Bob
bobs_row = df_w_name_as_ind.ix["Bob"] print bobs_row["age"] # prints 60



Besides DataFrames, the other big data structure in Pandas is the Series. Really, I’ve already shown them to you: a column in a data frame is a Series.


Conceptually, a Series is just an array of data objects, all the same type, with an index associated. The columns of a DataFrame are Series objects that all happen to share the same index.


The following code shows you some of the basic Series operations, independent of their function in DataFrames:

# import Pandas. I always alias it as pd
import pandas as pd
s = pd.Series([1,2,3]) # make Series from list
# display the values in s
# note index is to the far left
3 dtype: int64
>>> s+2 # Add a number to each element of s
5 dtype: int64
>>> s.index # you can access the index directly Int64Index([0, 1, 2], dtype='int64')
>>> # Adding two series will add corresponding elements to each other
>>> s + pd.Series([4,4,5])
8 dtype: int64


Now technically, I lied to you a minute ago when I said that a Series object’s elements all have to be the same type. They have to be the same type if you want all the performance benefits of Pandas, but we have actually already seen a Series object that mixes its types:

bobs_row = df_w_name_as_ind.ix["Bob"]
<class 'pandas.core.series.Series'> >>> bobs_row
age 60
age_plus_one 61
age_times_two 120
age_squared 3600
over_30 True
Name: Tom, dtype: object


So we can see that this row of a data frame was actually a Series object. But instead of int64 or something similar, its type is “object.” This means that under the hood, it’s not storing a low-level integer representation or anything similar; it’s storing a reference to an arbitrary Python object.


Joining and Grouping


So far we’ve focused on the following DataFrame operations:

  • Creating data frames
  • Adding new columns that are derived from basic operations to existing columns
  • Using simple conditions to select rows in a DataFrame
  • Aggregating columns
  • Setting columns to function as an index, and using the index to pull out rows of the data.


This section discusses two more advanced operations: joining and grouping.

These may be familiar to you from working with SQL.


Joining is used if you want to combine two separate data frames into a single frame containing all the data. We take two data frames, match up rows that have a common index, and combine them into a single frame. This shell session shows it:

 style="margin:0;height:370px;width:977px">>>> df_w_age = pd.DataFrame({
"name": ["Tom", "Tyrell", "Claire"],
"age": [60, 25, 33]
>>> df_w_height = pd.DataFrame({
"name": ["Tom", "Tyrell", "Claire"],
"height": [6.2, 4.0, 5.5]
joined = df_w_age.set_index("name").join( df_w_height.set_index("name"))
print joined
age height
Tom 60 6.2
Tyrell 25 4.0
Claire 33 5.5
>>> print joined.reset_index()
name age height
0 Tom 60 6.2
1 Tyrell 25 4.0
2 Claire 33 5.5
The other thing we often want to do is to group the rows based on some property and aggregate each group separately. This is done with the groupby() function, the use of which is shown here:
>>> df = pd.DataFrame({
"name": ["Tom", "Tyrell", "Claire"],
"age": [60, 25, 33],
"height": [6.2, 4.0, 5.5],
"gender": ["M", "M", "F"]
# use built-in aggregates
print df.groupby("gender").mean() age height
F 33.0 5.5
M 42.5 5.1
medians = df.groupby("gender").quantile(0.5)
# Use a custom aggregation function
def agg(ddf):
return pd.Series({
"name": max(ddf["name"]),
"oldest": max(ddf["age"]),
"mean_height": ddf["height"].mean()
print df.groupby("gender").apply(agg)
mean_heightname oldest
F 5.5 Claire 33
M 5.1 Tom 60


I first encountered the Scala language in 2010. I was working on a graduate training programme for an international banking organization when I was asked to give an hour talk to the graduates at Scala.


At that point, I had heard the name mentioned but had no idea what it was. I, therefore, did some reading, installed the IDE being used and tried out some examples—and was hooked.


Since then I have trained a wide range of people in Scala, used it to develop large commercial systems and written a blog on Scala and Design Patterns. I am still hooked on it and find myself discovering new aspects to the language and the environment on almost every Scala project I am involved with.


Compiling and Executing Scala

Executing Scala

It is useful and instructive to actually look at what happens when you compile and execute a Scala program. If your code is compiled successfully, the compiler will generate class files representing your Scala code such as Hello.class and Hello$.class.


The files are located (by default) under the out directory of your project; for example, on a Mac, the directory listing.


The .class files represent the compiled version of the Scala object Hello.


So what has happened here—we haven’t created any form of executable, rather we have created a set of class files. The effects of compiling your Scala code. The compiler compiles the. Scala files into.-class files. These, in turn, run on the Virtual Machine (this is what runs when you type in Scala to the command prompt).


The Virtual Machine (actually the JVM plus the Scala libraries) reads the class files and then runs them.


This may involve interpreting the class files, compiling the class files to native code or a combination of the two depending upon the JVM being used. If we take the original approach the JVM interprets the class files. Thus the class files run on the JVM.


The JVM can be viewed as a virtual computer (that is one that exists only in software). Thus, Scala runs on a software computer. This software computer must then execute on the underlying host machine.


This means that in effect the JVM has two aspects to it: the part that interprets Scala programs and a back end that must be ported to different platforms as required.


This means that although the Scala programs you write are indeed “write once, run anywhere” the JVMs they run on need to be ported to each platform on which Scala will execute.


Although your Scala programs do not need to be re-written to run on Unix, NT, Linux, etc., the JVM does. This means that different JVMs on different platforms can (and do) have different bugs in them.


It is therefore essential to test your Scala programs on all platforms that they are to be used on. In reality, Scala is actually “write once, run anywhere, test everywhere” that you will use your Scala programs.


Note that there are multiple languages that can be compiled to JVM bytecodes (and Scala is just one example, others include Ada, C#, Python), but that the bytecode language was originally designed for Scala and thus must directly support the Scala language.


As Scala was not the original source language for bytecodes there must be a mapping from the concepts in Scala to these bytecodes.


Thus one Scala concept may generate one, two or three different implementations at the bytecode level. In this case, our hello world object results in two-byte code elements being created called Hello.class and Hello$.class.


For the most part, you can ignore these details; they will typically become relevant only if you need to integrate Java into a Scala application (or Scala code into a Java application).


Why Have Automatic Memory Management?

Automatic Memory Management

Any discussion of Scala needs to consider how Scala handles memory. One of the many advantages of languages such as Java, C#, and Scala over languages such as C ++ is that they automatically manage memory allocation and reuse.


It is not uncommon to hear C++ programmers complaining about spending many hours attempting to track down a particularly awkward bug only to find it was a problem associated with memory allocation or pointer manipulation.


Similarly, a regular problem for C++ developers is that of memory creep, which occurs when memory is allocated but is not freed up. The application either uses all available memory or runs out of space and produces a runtime error.


Most of the problems associated with memory allocation in languages such as C occur because programmers must concentrate not only on the (often complex) application logic but also on memory management.


They must ensure that they allocate only the memory that is required and de-allocate it when it is no longer required. This may sound simple, but it is no mean feat in a large complex application.


An interesting question to ask is “why do programmers have to manage memory allocation?”. There are few programmers today who would expect to have to manage the registers being used by their programs, although 20 or 30 years ago the situation was very different.


One answer to the memory management question, often cited by those who like to manage their own memory, is that “it is more efficient, you have more control, it is faster and leads to more compact code”.


Of course, if you wish to take these comments to their extreme, then we should all be programming in assembler. This would enable us all to produce faster, more efficient and more compact code than that produced by Pascal or C++.


The point about high-level languages, however, is that they are more productive, introduce fewer errors, are more expressive and are efficient enough (given modern computers and compiler technology).


The memory management issue is somewhat similar. If the system automatically handles the allocation and de-allocation of memory, then the programmer can concentrate on the application logic.


This makes the programmer more productive, removes problems due to poor memory management and, when implemented efficiently, can still provide acceptable performance.


The Class Person

Class Person

The following classes define a simple class Person that has a first name and the last name and age (these examples were first considered in the introduction).


A Person is constructed by providing the first and last names, and their age and setters and getters are provided for each.

Here is the Java class:

class Person {
private String firstName;
private String lastName;
private int age;
public Person(String firstName,
String lastName,
int age) {
this.firstName = firstName;
this.lastName = lastName;
this.age = age;
public void setFirstName(String firstName){ this.firstName = firstName;
public void String getFirstName() {
return this.firstName;
public void setLastName(String lastName) { this.lastName = lastName;
public void String getLastName() {
return this.lastName;
public void setAge(int age) {
this.age = age;
public void int getAge() {
return this.age;
And here is the equivalent Scala class:
class Person(
var firstName: String,
var lastName: String,
var age: Int)


Certainly, the Java class is longer than the Scala class, but look at the Java class “How many times have you written something like that?” Most of this Java class is boilerplate code.


In fact, it is so common that tools such as Eclipse allow us to create the boilerplate code automatically which may mean that we do not have to type much more in the Java case than the Scala case.


However, when I look back and have to read this code I may have to wade through a lot of such boilerplate code in order to find the actual functionality of interest. In the Scala case, this boilerplate code is handled by the language means that I can focus on what the class actually does.


Actually, the Object-Oriented side of Scala is both more sophisticated than that in either Java or C# and also different in nature. For example, many people have found the distinction between the static side of a class and the instance side of a class confusing. Scala does away with this distinction by not including the static concept.


Instead, it allows the user to define singleton objects, if these singleton objects have the same name as a class and are in the same source file as the class, then they are referred to as companion objects.


Companion objects then have a special relationship with the class that allows them to access the internals of a class (private fields and methods) and can provide the Scala equivalent of static behavior.


The class hierarchy in Scala is based on single inheritance of classes but allows multiple traits to be mixed into any given class. A Trait is a structure within the Scala language that is neither a class nor an interface (note Scala does not have interfaces even though it compiles to Java Byte Codes).


It can, however, be combined with classes to create new types of classes and objects. As such a Trait can contain data, behavior, functions, type declarations, abstract members, etc. but cannot be instantiated itself.


A Hybrid Language

Hybrid Language

If all Scala did was provide the ability to program functionally all that would do is provide yet another functional programming language. However, it is the fact that Scala mixes the two paradigms that allow us to create software solutions that are both concise and expressive.


The Object-Oriented paradigm has been such a success because it can be used to model concepts and entities within problem domains. When this is combined with the ability to treat functions as first-class entities we obtain a very powerful combination.


For example, we can now create classes that will hold data (including other objects) and define behaviors in terms of methods but which can easily and naturally be given functions that can be applied to the members of that object.

val numbers = List(1, 2, 3, 4, 5)
In this case I have created a list of integers (note that this is a list of Integers as the type has been inferred by Scala) that are stored in the variable numbers.
val filtered = numbers.filter((n: Int) => n < 3)


I have then applied a function to each of the elements of the list. This function is an anonymous function that takes an Int (and stores that Int in the variable n). It then tests to see if the value of n is less than 3.


If it is it returns true otherwise it returns false. The method filter uses the function passed to it to determine whether the value passed it should be included in the result or not.


This means that the variable filtered will hold a list of integers where each integer is less than the value 3. Again note that this is again a List of Ints as once again Scala has inferred the type.


The output from these statements I shown below:

List(1, 2, 3, 4, 5)

List(1, 2)

The Scala IDE

Scala IDE

You will need a Scala environment on your local machine in order to develop, compile, test and run Scala applications. As Scala is a JVM language this also means that you must have a Java environment on your machine.


In theory, you could install each of these components yourself and use whatever editor you wished (including Emacs, or TextEdit). However, the easiest way to get started with Scala is to install one of the Scala IDEs available free on the Web.


There are several to choose from with the IntelliJ IDEA and Eclipse-based Scala IDEs being the most popular.

The IntelliJ IDE provides full support for Scala (as well as Java). Support for Scala is built into the IntelliJ IDEA Ultimate version; however, an additional plugin must be installed to use Scala with the IntelliJ IDEA Community Edition (the free version).


We will step through installing the IntelliJ IDEA Community Edition and then add the Scala Plugin to it.

  • You can download the IntelliJ IDE from
  • Download IntelliJ IDEA: The Java IDE for Professional Developers by JetBrains


Create a New Package

Create Package

To create a new package, select your class='lazy' data-src directory under your module and from the right mouse menu select New -> Package, for example

This will display the new Package Wizard (note it says Java but is being reused for Scala packages in the Scala IDE). You can use whatever name you wish although you should note that a Scala package is a series of names separated by ‘.’ which are typically prefixed by the domain of the organization creating the code. I am using


Once you have provided a package name click ‘OK’.You will now see a new package provided for you in the Project View of the IDE. An example of the structure created under the Project heading.


Defining the Class

Defining Class

The simple class you just created now needs to be expanded to represent a company. The class must have the following information:

  • The name of the company
  • The address of the head office of the company
  • The phone number of the company
  • The company registration number
  • The company VAT number

The address of the company could be a separate type including county, postcode/ zip code. However, we will keep things simple for the moment.


The fields of the company will all be of type String and will have some form of the default value, for example, the empty or null string represented by “ ”.


The string is referred to as a type as it represents a concept with the programming language. As such a string is Zero or more characters which respond to certain operations such as substring, length.


Update your definition of the Company class so that it resembles the following listing:

class Company {
var name = ""
var address = ""
var telephone = "0000"
var registrationNumber = "000"
var vatNumber = "xxxx"
var postcode = "xxx xxx"


Adding Behaviour

We can also add some behavior to this class by providing a print method that will print out the Company details in an appropriate format. The printer method will be done first. This method will not return a value as it will be used to print information on the Company out to the Console.

class Company {
var name = ""
var address = ""
var telephone = "0000"
var registrationNumber = "000"
var vatNumber = "xxxx"
var postcode = "xxx xxx"
def print() = println(s"Company name $name at $address")


Note that we have used the single line form of defining a method—this is not the only option and you could experiment with other formats one you have this version working.


Test Application

Test Application

You will then see that there are actually three options available at this point; Class, Object and Trait. Select the Object option and provide a name for the Object (I am using CompanyTestApp): You should now see a new tab on in the central code editor area of the IDE. 


This contains the skeleton of the CompanyTestApp object. It is not yet an application. Modify the declaration of the object so that it extends the App trait. So that you now have:


object CompanyTestApp extends App {



Remember as you are using the App trait, you do not need to define the main method declaration—you only need to add what the application needs to do. In our case, we will create a new instance of the Company class and print out its details:

object CompanyTestApp extends App {
println("Starting CompanyTestApp")
val company = new Company()
println("Done CompanyTestApp")


Recall that we do not need to define the type of the value we will hold our company reference in (this will be inferred by Scala), but that new instances are created using the keyword new. We can now run this application either using the right mouse menu from the file CompanyTestApp in the Project View (Run). 


This is because the Company object does not yet have any data defined by you. We will now add that data:

object CompanyTestApp extends App {
println("Starting CompanyTestApp")
val company = new Company()
Set up the company information = "John Sys" company.address = "Coldharbour Street, London" company.telephone = "123456" company.registrationNumber = "99999999" company.vatNumber = "BB112233AA" company.postcode = "BS16 1QY"
println("Done CompanyTestApp")


In the above example, we have populated the fields that are defined within the Company instance with suitable data. If you now rerun this application, you should see the more comprehensible output in the Run console.