Learning Data mining with Python

Data Mining with Python, learning data mining with python pdf free download
Dr.JakeFinlay Profile Pic
Published Date:22-07-2017
Your Website URL(Optional)
Data Mining with Python (Working draft)  Finn Arup Nielsen May 8, 2015Chapter 1 Introduction 1.1 Other introductions to Python? Although we cover a bit of introductory Python programming in chapter 2 you should not regard this book as a Python introduction: Several free introductory ressources exist. First and foremost the ocial Python Tu- torial at http://docs.python.org/tutorial/. Beginning programmers with no or little programming experience may want to look into the book Think Python available from http://www.greenteapress.com/thinkpython/ or as a book 1, while more experience programmers can start with Dive Into Python available from 1 http://www.diveintopython.net/. Kevin Sheppard's presently 381-page Introduction to Python for Econo- metrics, Statistics and Data Analysis covers both Python basics and Python-based data analysis with Numpy, SciPy, Matplotlib and Pandas, and it is not just relevant for econometrics 2. Developers already well- versed in standard Python development but lacking experience with Python for data mining can begin with chapter 3. Readers in need of an introduction to machine learning may take a look in Marsland's Machine learning: An algorithmic perspective 3, that uses Python for its examples. 1.2 Why Python for data mining? Researchers have noted a number of reasons for using Python in the data science area (data mining, scienti c computing) 4, 5, 6: 1. Programmers regard Python as a clear and simple language with a high readability. Even non- programmers may not nd it too dicult. The simplicity exists both in the language itself as well as in the encouragement to write clear and simple code prevalent among Python programmers. See this in contrast to, e.g., Perl where short form variable names allow you to write condensed code but also requires you to remember nonintuitive variable names. A Python program may also be 25 shorter than corresponding programs written in Java, C++ or C 7, 8. 2. Platform-independent. Python will run on the three main desktop computing platforms Mac, Linux and Windows, as well as on a number of other platforms. 3. Interactive program. With Python you get an interactive prompt with REPL (read-eval-print loop) like in Matlab and R. The prompt facilitates exploratory programming convenient for many data mining tasks, while you still can develop complete programs in an edit-run-debug cycle. The Python- derivatives IPython and IPython Notebook are particularly suited for interactive programming. 4. General purpose language. Python is a general purpose language that can be used to a wide variety of tasks beyond data mining, e.g., user applications, system administration, gaming, web development psychological experiment presentations and recording. This is in contrast to Matlab and R. 1 For further free website for learning Python see http://www.fromdev.com/2014/03/python-tutorials-resources.html. 1Too see how well Python with its modern data mining packages compares with R take a look at Carl J. 2 V.'s blog posts on Will it Python? and his Github repository where he reproduces R code in Python based on R data analyses from the book Machine Learning for Hackers. 5. Python with its BSD license fall in the group of free and open source software. Although some large Python development environments may have associated license cost for commercial use, the basic Python development environment may be setup and run with no licensing cost. Indeed in some systems, e.g., many Linux distributions, basic Python comes readily installed. The Python Package Index provides a large set of packages that are also free software. 6. Large community. Python has a large community and has become more popular. Several indicators testify to this. Popularity of Language Index (PYPL) bases its programming language ranking on Google search volume provided by Google Trends and puts Python in the third position after Java and PHP. According to PYPL the popularity of Python has grown since 2004. TIOBE constructs another indicator putting Python in rank 6th. This indicator is \based on the number of skilled engineers world- 3 wide, courses and third party vendors". Also Python is among the leading programming language in 4 terms of StackOver ow tags and Github projects. Furthermore, in 2014 Python was the most popular programming language at top-ranked United States universities for teaching introductory programming 9. 7. Quality: The Coverity company nds that Python code has errors among its 400,000 lines of code, but that the error rate is very low compared to other open source software projects. They found a 0.005 defects per KLoC 10. 8. IPython Notebook: With the browser-based interactive notebook, where code, textual and plot- ting results and documentation may be interleaved in a cell-based environment the IPython Notebook represents a interesting approach that you will typically not nd in many other programming lan- guage. Exceptions are the commercial systems Maple and Mathematica that have notebook interfaces. IPython Notebooks runs locally on a Web-browser. The Notebook les are JSON les that can easily be shared and rendered on the Web. The obvious advantages in with the IPython Notebook has led other language to use the environment. The IPython Notebook can be changed to use the Julia language as the computational backend, i.e., instead of writing Python code in the code cells of the notebook you write Julia code. With appropriate extensions the IPython Notebook can intermix R code. 1.3 Why not Python for data mining? Why shouldn't you use Python? 1. Not well-suited to mobile phones and other portable devices. Although Python surely can run on mobile phones and there exist a least one (dated) book for `Mobile Python' 11, Python has not caught on for development of mobile apps. There exist several mobile app development frameworks with Kivy mentioned as leading contender. Developers can also use Python in mobile contexts for the backend of a web-based system and for data mining data collected at the backend. 2. Does not run `natively' in the browser. Javascript entirely dominates as the language in web- 5 browsers. Various ways exist to mix Python and webbrowser programming. The Pyjamas project with its Python-to-Javascript compiler allows you to write webbrowser client code in Python and compile it to Javascript which the webbrowser then runs. There are several other of these stand-alone Javascript 2 http://slendermeans.org/pages/will-it-python.html. 3 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html. 4 http://www.dataists.com/2010/12/ranking-the-popularity-of-programming-langauges/. 5 See https://wiki.python.org/moin/WebBrowserProgramming 2compilers in `various states of development' as it is called: PythonJS, Pyjaco, Py2JS. Other frameworks use in-browser implementations, one of them being Brython, which enable the front-end engineer to write Python code in a HTML script tag if the page includes the brython.js Javascript library via the HTML script tag. It supports core Python modules and has access to the DOM API, but not, e.g., the scienti c Python libraries written in C. Brython scripts run unfortunately considerable slower than scripts directly implemented Javascript or ordinary Python implementation execution 12. 3. Concurrent programming. Standard Python has no direct way of utilizing several CPUs in the language. Multithreading capabilities can be obtained with the threading package, but the individual threads will not run concurrently on di erent CPUs in the standard python implementation. This implementation has the so-called `Global Interpreter Lock' (GIL), which only allows a single thread at a time. This is to ensure the integrity of the data. A way to get around the GIL is by spawning new process with the multiprocessing package or just the subprocess module. 4. Installation friction. You may run into problems when building, distributing and installing your software. There are various ways to bundle Python software, e.g., with setuptools package. Based on a con guration le, setup.py, where you specify, e.g., name, author and dependencies of your package, setuptools can can build a le to distribute with the commands python setup.py bdist orpython setup.py bdist egg. The latter command will build a so-called Python Egg le containing all the Python les you speci ed. The user of your package can install your Python les based on the con guration and content of that le. It will still need to download and install the dependencies you have speci ed in the setup.py le, before the user of your software can use your code. If your user does not have Python, the installation tools and a C compiler installed it is likely that s/he nd it a considerable task to install your program. Various tools exist to make the distribution easier by integrating the the distributed le to one self- contained downloadable le. These tools are called cx Freeze, PyInstaller, py2exe for Window and py2app for OSX) and pynsist. 5. Speed. Python will typically perform slower than a compiled languages such as C++, and Python typically performs poorer than Julia, the programming language designed for technical computing. Various Python implementations and extensions, such as pypy, numba and Cython, can speed up the execution of Python code, but even then Julia can perform faster: Andrew Tulloch has reported per- 6 formance ratios between 1.1 and 300 in Julia's favor for isotonic regression algorithms. The slowness of Python means that Python libraries tends to be developed in C, while, e.g., well-performing Julia 7 libraries may be developed in Julia itself. Speeding up Python often means modifying Python code with, e.g., specialized decorators, but a proof-of-concept system, Bohrium, has shown that a Python extension may require only little change in `standard' array-processing code to speed up Python con- siderably 13. It may, however, be worth to note that variability in a program's performance can vary as much or more between programmers as between Python, Java and C++ 7. 1.4 Components of the Python language and software 1. The Python language keywords. At its most basic level Python contains a set of keywords, for def- inition (of, e.g., functions, anonymous function and classes with def, lambda and class, respectively), for control structures (e.g., if and for), exceptions, assertions and returning arguments (yield and return). If you want to have a peek at all the keywords, then the keyword module makes their names available in the keyword.kwlist variable. Python 2 has 31 keywords, while Python 3 has 33. 6 http://tullo.ch/articles/python-vs-julia/ 7 Mike Innes, Performance matters more than you think. 3Figure 1.1: The Python hierarchy. 2. Built-in classes and functions. An ordinary implementation of Python makes a set of classes and functions available at program start without the need of module import. Examples include the function for opening les (open), classes for built-in data types (e.g., float and str) and data manipulation functions (e.g., sum, abs and zip). The builtins module makes these classes and functions avail- 8 able and you can see a listing of them with dir( builtins ). You will nd it non-trivial to get rid of the built-in functions, e.g., if you want to restrict the ability of untrusted code to call the open function, cf. sandboxing Python. 3. Built-in modules. Built-in modules contain extra classes and functions built into Python, but not immediately accessible. You will need to import these with import to use them. The sys built-in module contains a list of all the built-in modules: sys.builtin module names. Among the built-in modules are the system-speci c parameters and functions module (sys), a module with mathematical functions (math), the garbage collection module (gc) and a module with many handy iterator functions good to be acquited with (itertools). The set of built-in modules varies between implementations of Python. In one of my installations I count 46 modules, which include the builtins module and the current working module main . 4. Python Standard Library (PSL). An ordinary installation of Python makes a large set of modules with classes and functions available to the programmer without the need for extra installation. The programmer only needs to write a one line import statement to have access to exported classes, functions and constants in such a module. You can see which Python (byte-compiled) source le associates with the import via file property of the module, e.g., after import os you can see the lename with os. file . Built-in modules do not have this property set in the standard implementation of Python. On a typically Linux system you might nd the PSL modules in a directories with names like /usr/lib/python3.2/. One of my installations has just above 200 PSL modules. 8 There are some silly di erences between builtin and builtins . For Python3 use builtins . 45. Python Package Index (PyPI) also known as the CheeseShop is the central archive for Python packages available from https://pypi.python.org. The index reports that it contains over 42393 packages as of April 2014. They range from popular packages such as lxml and requests over large web frameworks, such as Django to strange packages, such as absolute, a package with the sole purpose of implementing a function that computes the absolute value of a number (this functionality is already built-in with the abs function). You will often need to install the packages unless you use one of the large development frameworks such as Enthought and Anaconda or if it is already installed via your system. If you have the pip program up and running then installation of packages from PyPI is relatively easy: From the terminal (outside Python) you write pip install packagename, which will download, possibly compile, install and setup the package. Unsure of the package, you can write pip search query and pip will return a list of packages matching the query. Once you have done installed the package you will be able to use the package in Python with import packagename. If parts of the software you are installing are written in C, then the pip install will require a C compiler to build the library les. If a compiler is not readily available you can download and install a binary pre-compiled package, if this is available. Otherwise some systems, e.g., Ubuntu and Debian will distribute a large set of the most common package from PyPI in their pre-compiled version, e.g., the Ubuntu/Debian name of lxml and requests are called python-lxml and python-requests. On a typical Linux system you will nd the packages installed under directories, such as /usr/lib/python2.7/dist-packages/ 6. Other Python components. From time to time you will nd that not all packages are available from the Python Package Index. Often these packages comes with a setup.py that allows you to install the software. If the bundle of Python les does not even have a setup.py le, you can download it a put in your own self-selected directory. The python program will not be able to discover the path to the program, so you will need to tell it. In Linux and Windows you can set the environmental variable PYTHONPATH to a colon- or semicolon-separated list of directories with the Python code. Windows users may also set the PYTHONPATH from the `Advanced' system properies. Alternatively the Python developer can set the sys.path attribute from within Python. This variable contains the paths as strings in a list and the developer can append a new directory to it. Github user Vinta provides a good curated list of important Python frameworks, libraries and software from https://github.com/vinta/awesome-python. 1.5 Developing and running Python 1.5.1 Python, pypy, IPython . . . Various implementations for running or translating Python code exist: CPython, IPython, IPython note- book, PyPy, Pyston, IronPython, Jython, Pyjamas, Cython, Nuitka, Micro Python, etc. CPython is the standard reference implementation and the one that you will usually work with. It is the one you start up when you write python at the command-line of the operating system. The PyPy implementation pypy usually runs faster than standard CPython. Unfortunately PyPy does not (yet) support some of the central Python packages for data mining, numpy and scipy, although some work on the issue has apparently gone on since 2012. If you do have code that does not contain parts not supported by PyPy and with critical timing performance, then pypy is worth looking into. Another jit-based (and LLVM-based) Python is Dropbox's Pyston. As of April 2014 it \`works', though doesn't support very much of the Python language, and currently is not very useful for end-users." and \seems to have better 59 performance than CPython but lags behind PyPy." Though interesting, these programs are not yet so relevant in data mining applications. Some individuals and companies have assembled binary distributions of Python and many Python package together with an integrated development environment (IDE). These systems may be particularly relevant for users without a compiler to compile C-based Python packages, e.g., many Windows users. Python(x,y) is a Windows- and scienti c-oriented Python distribution with the Spyder integrated development environment. WinPython is similar system. You will nd many relevant data mining package included in the WinPython, e.g., pandas, IPython, numexpr, as well as a tool to install, uninstall and upgrade packages. Continuum Analytics distributes their Anaconda and Enthought their Enthought Canopy, both systems targeted to scientists, engineers and other data analysts. Available for the Window, Linux and Mac platforms they include what you can almost expect of such data mining environments, e.g., numpy, scipy, pandas, nltk, networkx. Enthought Canopy is only free for academic use. The basic Anaconda is `completely free', while the Continuum Analytics provides some `add-ons' that are only free for academic use. Yet another prominent commercial grade distribution of Python and Python packages is ActivePython. It seems less geared towards data mining work. For Windows users not using these systems and who do not have the ability to compile C may take a look at Christoph Gohlke's large list of precompiled binaries assembled at http://www.lfd.uci.edu/gohlke/pythonlibs/. 1.5.2 IPython Notebook IPython Notebook is a system that intermix editor, Python interactive sessions and output, similar to Mathematica. It is browser-based and when you install newer versions of IPython you have it available and the ability to start it from the command-line outside Python with the command ipython notebook. You will get a webserver running at your local computer with the default address with the IPython Notebook prompt available, when you point your browser to that address. You edit directly in the browser in what IPython Notebook calls `cells', where you enter lines of Python code. The cells can readily be executed, e.g., via the shift+return keyboard shortcut. Plots either appear in a new window or if you set %matplotlib online they will appear in the same browser window as the code. You can intermix code and plot with cells of text in the Markdown format. The entire session with input, text and output will be stored in a special JSON le format with the .ipynb extension, ready for distribution. You can also export part of the session with the source code as an ordinary Python source .py le. Although great for interactive data mining, IPython Notebook is perhaps less suitable to more traditional software development where you work with multiple reuseable modules and testing frameworks. 1.5.3 Python 2 vs. Python 3 Python is in a transition phase between the old Python version 2 and the new Python version 3 of the language. Python 2 is scheduled to survive until 2020 and yet in 2014 developers responded in a survey that the still wrote more 2.x code than 3.x code 14. Python code written for one version may not necessarily work for the other version, as changes have occured in often used keywords, classes and functions such as print, range, xrange, long, open and the division operator. Check out http://python3wos.appspot.com/ to get an overview of which popular modules support Python 3. 3D scienti c visualization lacks good Python 3 support. The central packages, mayavi and the VTK wrapper, are still not available for Python 3 as of March 2015. Some Linux distributions still default to Python 2, while also enables the installation of Python 3 making it accessible as python3 as according to PEP 394 15. Although many of the major data mining Python libraries are now available for Python 3, it might still be a good idea to stick with Python 2, while keeping Python 3 in mind, by not writing code that requires a major rewrite when porting to Python 3. The idea 10 of writing in the subset of the intersection of Python 2 and Python 3 has been called `Python X'. One 9 https://github.com/dropbox/pyston. 10 Stephen A. Goss, Python 3 is killing Python, https://medium.com/deliciousrobots/5d2ad703365d/. 6part of this approach uses the future module importing relevant features, e.g., future .division and future .print function like: from __future__ import division, print_function, unicode_literals This scheme will change Python 2's division operator `/' from integer division to oating point division and the print from a keyword to a function. Python X adherrence might be particular inconvenient for string-based processing, but the module six provides further help on the issue. For testing whether a variable is a general string, in Python 2 you would test whether the variable is an instance of the basestring built-in type to capture both byte-based strings (Python 2 str type) and Unicode strings (Python 2 unicode type). However, Python 3 has no basestring by default. Instead you test with the Python 3 str class which contains Unicode strings. A constant in the six module, the six.string types captures this di erence and is an example how the six module can help writing portable code. The following code testing for string type for a variable will work in both Python 2 and 3: if isinstance(my_variable, six.string_types): print('my_variable is a string') else: print('my_variable is not a string') Most data mining packages are now available as Python 3. Unfortunately the central 3D visualization package Mayavi is not available yet. 1.5.4 Editing For editing you should have a editor that understands the basic elements of the Python syntax, e.g., to help you make correct indentation which is an essential part of the Python syntax. A large number of Python- 11 aware editors exists, e.g., Emacs and the editors in the Spyder and Eric IDEs. Commercial IDEs, such as PyCharm and Wing IDE, also have good Python editors. For autocompletion Python has a jedi module, which various editors can use through a plugin. Pro- grammers can also call it directly from a Python program. IPython and spyder features autocompletion For collorative programmingpair programming or physically separated programmingit is worth to note that the collaborative document editor Gobby has support for Python syntax highlighting and Pythonic indentation. It features chat, but has no features beyond simple editing, e.g., you will not nd support for direct execution, style checking nor debugging, that you will nd in Spyder. The Rudel plugin for Emacs supports the Gobby protocol. 1.5.5 Python in the cloud A number of websites enable programmers to upload their Python code and run it from the website. Google App Engine is perhaps the most well-known. With Google App Engine Python SDK developers can develop and test web application locally before an upload to the Google site. Data persistency is handle by a speci c Google App Engine datastore. It has an associated query language called GQL resembling SQL. The web application may be constructed with the Webapp2 framework and templating via Jinja2. Further information is available in the book Programming Google App Engine 16. There are several other websites for running Python in the cloud: pythonanywhere, Heroku, PiCloud and StarCluster. Freemium service Pythonanywhere provides you, e.g., with a MySQL database and, the traditional data mining packages, the Flask web framework and web-access to the server access and error logs. 1.5.6 Running Python in the browser Some systems allow you to run Python with the webbrowser without the need for local installation. Typically, the browser itself does not run Python, instead a webservice submits the Python code to a backend system 11 See https://stackover ow.com/questions/81584/what-ide-to-use-for-python for an overview of features. 7that runs the code and return the result. Such systems may allow for quick and collaborative Python development. The company Runnable provides a such service through the URL http://runnable.com, where users may write Python code directly in the browser and let the system executes and returns the result. The cloud service Wakari (https://wakari.io/) let users work and share cloud-based IPython Notebook sessions. It is a cloud version of from Continuum Analytics' Anaconda. The Skulpt implementation of Python runs in a browser and a demonstration of it runs from its homepage http://www.skulpt.org/. It used by several other websites, e.g., CodeSkulptor http://www.codeskulptor.org. Codecademy is a webservice aimed at learning to code. Python features among the program- ming languages supported and a series of interactive introductory tutorials run from the URL http://www.codecademy.com/tracks/python. The Online Python Tutor uses its interactive environment to demonstrate with program visualization how the variables in Python changes as the program is executed 17. This may serve well novices learning the Python, but also more experienced programmer when they debug. pythonanywhere (https://www.pythonanywhere.com) also has coding in the browser. Code Golf from http://codegolf.com/ invites users to compete by solving coding problems with the smallest number of characters. The contestants cannot see each others contributions. Another Python code challenge website is Check IO, see http://www.checkio.org Such services have less relevance for data mining, e.g., Runnable will not allow you to import numpy, but they may be an alternative way to learn Python. CodeSkulptor implementing a subset of Python 2 allows the programmer to import the modules numeric, simplegui, simplemap and simpleplot for rudimentary matrix computations and plotting numerical data. At Plotly (https://plot.ly) users can collaboratively construct plots, and Python coding with Numpy features as one of the methods to build the plots. 8Chapter 2 Python 2.1 Basics Two functions in Python are important to known: help and dir. help shows the documentation for the input argument, e.g., help(open) shows the documentation for the open built-in function, which reads and writes les. help works for most elements of Python: modules, classes, variables, methods, functions, . . . , but not keywords. dir will show a list of methods, constants and attributes for a Python object, and since most elements in Python are objects (but not keywords) dir will work, e.g., dir(list) shows the methods associated with the built-in list datatype of Python. One of the methods in the list object is append. You can see its documentation with help(list.append). Indentation is important in Python, actually essential: It is what determines the block structure, so indentation limits the scope of control structures as well as class and function de nitions. Four spaces is the default indentation. Although the Python semantic will work with other number of spaces and tabs for indentation, you should generally stay with four spaces. 2.2 Datatypes Table 2.1 displays Python's basic data types together with the central data types of the Numpy and Pandas modules. The data types in the rst part of table are the built-in data types readily available when python starts up. The data types in the second part are Numpy data types discussed in chapter 3, speci cally in section 3.1, while the data types in the third part of the table are from the Pandas package discussed in section 3.3. An instance of a data type is converted to another type by instancing the other class, e.g., turn the oat 32.2 into a string '32.2' with str(32.2) or the string 'abc' into the list 'a', 'b', 'c' with list('abc'). Not all of the conversion combinations work, e.g., you cannot convert an integer to a list. It results in a TypeError. 2.2.1 Booleans (bool) A Boolean bool is either True or False. The keywords or, and and not should be used with Python's Booleans, not the bitwise operations , & and . Although the bitwise operators work for bool they evaluate the entire expression which fails, e.g., for this code (len(s) 2) & (s2 == 'e') that checks whether the third character in the string is an `e': For strings shorter than 3 characters an indexing error is produced as the second part of the expression is evaluated regardless of the value of the rst part of the expression. The expression should instead be written (len(s) 2) and (s2 == 'e'). Values of other types that evaluates to False are, e.g., 0, None, '' (the empty string), , () (the empty tuple),fg, 0.0 and b'nx00', while values evaluating to True are, e.g., 1, -1, 1, 2, '0', 0 and 0.000000000000001. 9Built-in type Operator Mutable Example Description bool No True Boolean bytearray Yes bytearray(b'nx01nx04') Array of bytes bytes b'' No b'nx00nx17nx02' complex No (1+4j) Complex number dict f:g Yes f'a': True, 45: 'b'g Dictionary, indexed by, e.g., strings float No 3.1 Floating point number frozenset No frozenset(f1, 3, 4g) Immutable set int No 17 Integer list Yes 1, 3, 'a' List set fg Yes f1, 2g Set with unique elements slice : No slice(1, 10, 2) Slice indices str "" or '' No "Hello" String tuple (,) No (1, 'Hello') Tuple Numpy type Char Mutable Example array Yes np.array(1, 2) One-, two, or many-dimensional matrix Yes np.matrix(1, 2) Two-dimensional matrix bool np.array(1, 'bool_') Boolean, one byte long int np.array(1) Default integer, same as C's long int8 b np.array(1, 'b') 8-bit signed integer int16 h np.array(1, 'h') 16-bit signed integer int32 i np.array(1, 'i') 32-bit signed integer int64 l, p, q np.array(1, 'l') 64-bit signed integer uint8 B np.array(1, 'B') 8-bit unsigned integer float np.array(1.) Default oat float16 e np.array(1, 'e') 16-bit half precision oating point float32 f np.array(1, 'f') 32-bit precision oating point float64 d 64-bit double precision oating point float128 g np.array(1, 'g') 128-bit oating point complex Same as complex128 complex64 Single precision complex number complex128 np.array(1+1j) Double precision complex number complex256 2 128-bit precision complex number Pandas type Mutable Example Description Series Yes pd.Series(2, 3, 6) One-dimension (vector-like) DataFrame Yes pd.DataFrame(1, 2) Two-dimensional (matrix-like) Panel Yes pd.Panel(1, 2) Three-dimensional (tensor-like) Panel4D Yes pd.Panel4D(1) Four-dimensional Table 2.1: Basic built-in and Numpy and Pandas datatypes. Here import numpy as np andimport pandas as pd. Note that Numpy has a few more datatypes, e.g., time delta datatype. 2.2.2 Numbers (int, float and Decimal) In standard Python integer numbers are represented with the int type, oating-point numbers with float and complex numbers withcomplex. Decimal numbers can be represented via classes in the decimal module, particularly the decimal.Decimal class. In the numpy module there are datatypes where the number of bytes representing each number can be speci ed. Python 2 has long, which is for long integers. In Python 2 int(12345678901234567890) will switch 10ton a variable with long datatype. In Python 3 long has been subsumed in int, so int in this version can represent arbitrary long integers, while the long type has been removed. A workaround to de ne long in Python 3 is simply long = int. 2.2.3 Strings (str) Strings may be instanced with either single or double quotes. Multiline strings are instanced with either three single or three double quotes. The style of quoting makes no di erence in terms of data type. s = "This is a sentence." t = 'This is a sentence.' s == t True u = """This is a sentence.""" s == u True The issue of multibyte Unicode and byte-strings yield complexity. Indeed Python 2 and Python 3 di er (unfortunately) considerably in their de nition of what is a Unicode strings and what is a byte strings. The triple double quotes are by convention used for docstrings. When Python prints out a it uses single quotes, unless the string itself contains a single quote. 2.2.4 Dictionaries (dict) A dictionary (dict) is a mutable data structure where values can be indexed by a key. The value can be of any type, while the key should be hashable, which all immutable objects are. It means that, e.g., strings, integers, tuple and frozenset can be used as dictionary keys. Dictionaries can be instanced with dict or with curly braces: dict(a=1, b=2) strings as keys, integers as values 'a': 1, 'b': 2 1: 'january', 2: 'february' integers as keys 1: 'january', 2: 'february' a = dict() empty dictionary a('Friston', 'Worsley') = 2 tuple of strings as keys a ('Friston', 'Worsley'): 2 Dictionaries may also be created with dictionary comprehensions, here an example with a dictionary of lengths of method names for the oat object: name: len(name) for name in dir(float) '__int__': 7, '__repr__': 8, '__str__': 7, 'conjugate': 9, ... Iterations over the keys of the dictionary are immediately available via the object itself or via the dict.keys method. Values can be iterated with the dict.values method and both keys and values can be iterated with the dict.items method. Dictionary access shares some functionality with object attribute access. Indeed the attributes are ac- cessible as a dictionary in the dict attribute: class MyDict(dict): ... def __init__(self): ... self.a = None my_dict = MyDict() my_dict.a my_dict.a = 1 my_dict.__dict__ 'a': 1 11 my_dict'a' = 2 my_dict 'a': 2 In the Pandas library (see section 3.3) columns in its pandas.DataFrame object can be accessed both as attributes and as keys, though only as attributes if the key name is a valid Python identi er, e.g., strings with spaces or other special characters cannot be attribute names. The addict package provides a similar functionality as in Pandas: from addict import Dict paper = Dict() paper.title = 'The functional anatomy of verbal initiation' paper.authors = 'Nathaniel-James, Fletcher, Frith' paper 'authors': 'Nathaniel-James, Fletcher, Frith', 'title': 'The functional anatomy of verbal initiation' paper'authors' 'Nathaniel-James, Fletcher, Frith' The advantage of accessing dictionary content as attributes is probably mostly related to ease of typing and readability. 2.2.5 Dates and times There are various means to handle dates and times in Python. Python provides thedatetime module with the datetime.datetime class (the class is confusingly called the same as the module). The datetime.datetime class records date, hours, minutes, seconds, microseconds and time zone information, while datetime.date only handles dates. As an example consider computing the number of days from 15 January 2001 to 24 September 2014. datetime.date makes such a computation relatively straightforward: from datetime import date date(2014, 9, 24) - date(2001, 1, 15) datetime.timedelta(5000) str(date(2014, 9, 24) - date(2001, 1, 15)) '5000 days, 0:00:00' i.e., 5000 days from the one date to the other. A function in the dateutil module converts from date and times represented as strings to datetime.datetime objects, e.g., dateutil.parser.parse('2014-09-18') returns datetime.datetime(2014, 9, 18, 0, 0). Numpy has also a datatype to handle dates, enabling easy date computation on multiple time data, e.g., below we compute the number of days for two given days given a starting date: import numpy as np start = np.array('2014-09-01', 'datetime64') dates = np.array('2014-12-01', '2014-12-09', 'datetime64') dates - start array(91, 99, dtype='timedelta64D') Here the computation defaults to represent the timing with respect to days. A datetime.datetime object can be turned into a ISO 8601 string format with the datetime.datetime.isoformat method but simply using str may be easier: from datetime import datetime str(datetime.now()) '2015-02-13 12:21:22.758999' To get rid of the part with milliseconds use the replace method: str(datetime.now().replace(microsecond=0)) '2015-02-13 12:22:52' 122.2.6 Enumeration Python 3.4 has an enumeration datatype (symbolic members) with the enum.Enum class. In previous versions of Python enumerations were just implemented as integers, e.g., in the re regular expression module you would have a ag such as re.IGNORECASE set to the integer value 2. For older versions of Python the enum34 pip package can be installed which contains an enum Python 3.4 compatible module. Below is a class called Grade derived from enum.Enum and used as a label for the quality of an apple, where there are three xed options for the quality: from enum import Enum class Grade(Enum): good = 1 bad = 2 ok = 3 After the de nition apple = 'quality': Grade.good apple'quality' is Grade.good True Outside the builtins the module collections provides a few extra interesting general container datatypes (classes). collections.Counter can, e.g., be used to count the number of times each word occur in a word list, while collections.deque can act as ring bu er. 2.3 Functions and arguments Functions are de ned with the keyword def and the return argument speci es which object the function should return, if any. The function can be speci ed to have multiple, positional and keyword (named) input arguments and optional input arguments with default values can also be speci ed. As with control structures indentation marks the scope of the function de nition. Functions can be called recursively, but the are usually slower than their iterative counterparts and there is by default a recursion depth limit on 1000. 2.3.1 Anonymous functions with lambdas One-line anonymous function can be de ned with the lambda keyword, e.g., the de nition of the polynomial 2 f(x) = 3x 2x 2 could be done with a compact de nition like f = lambda x: 3x2 - 2x - 2. The variable before the colon is the input argument and the expression after the colon is the returned value. After the de nition we can call the function f like an ordinary function, e.g., f(3) will return 19. Functions can be manipulated like Python's other objects, e.g., we can return a function from a function. Below the polynomial function returns a function with xed coecients: def polynomial(a, b, c): return lambda x: ax2 + bx + c f = polynomial(3, -2, -2) f(3) 2.3.2 Optional function arguments The can be used to catch multiple optional positional and keyword arguments, where the standard names are args and kwargs. This trick is widely used in the Matplotlib plotting package. An example is shown below where a user function called plot dirac is de ned which calls the standard Matplotlib plotting 13function (matplotlib.pyplot.plot with the alias plt.plot), so that we can call plot dirac with the linewidth keyword and pipe it further on to the Matplotlib function to control the line width of line that we are plotting: import matplotlib.pyplot as plt def plot_dirac(location, args, kwargs): print(args) print(kwargs) plt.plot(location, location, 0, 1, args, kwargs) plot_dirac(2) plt.hold(True) plot_dirac(3, linewidth=3) plot_dirac(-2, 'r') plt.axis((-4, 4, 0, 2)) plt.show() In the rst call toplot diracargs andkwargs with be empty, i.e., an empty tuple and and empty dictionary. In the second called print(kwargs) will show 'linewidth': 3 and in the third call we get ('r',) from the print(args) statement. The above polynomial function can be changed to accept a variable number of positional arguments so polynomials of any order can be returned from the polynomial construction function: def polynomial(args): expons = range(len(args))::-1 return lambda x: sum(coefxexpon for coef, expon in zip(args, expons)) f = polynomial(3, -2, -2) f(3) Returned result is 19 f = polynomial(-2) f(3) Returned result is -2 2.4 Object-oriented programming Almost everything in Python is an object, e.g., integer, strings and other data types, functions, class de ni- tions and class methods are objects. These objects have associated methods and attributes, and some of the default methods and functions follow a speci c naming pattern with pre- and post xed double underscore. Table 2.2 gives an overview of some of the methods and attributes in an object. As always the dir function lists all the methods de ned for the object. Figure 2.1 shows another overview of the method in the common built-in data types in a formal concept analysis lattice graph. The graph is constructed with the concepts module which uses the graphviz module and Graphviz program. The plot shows, e.g., that int and bool de ne the same methods (their implementations are of course di erent), that format and str are de ned by all data types and that contains and len are available for set, dict, list, tuple and str, but not for bool, int and float. Developers can de ne their own classes with the class keyword. The class de nitions can take advantage of multiple inheritance. Methods of the de ned class is added to the class with the def keyword in the indented block of the class. New classes may be derived from built-in data types, e.g., below a new integer class is de ned with a de nition for the length method: class Integer(int): def __len__(self): return 1 i = Integer(3) len(i) 1 14Figure 2.1: Overview of methods and attributes in the common Python 2 built-in data types plotted as a formal concept analysis lattice graph. Only a small subset of methods and attributes is shown. 15Method Operator Description init ClassName() Constructor, called when an instance of a class is made del del Destructor call object name() The method called when the object is a function, i.e., `callable' getitem Get element: a.__getitem__(2) the same as a2 setitem = Set element: a.__setitem__(1, 3) the same as a1 = 3 contains in Determine if element is in container str Method used for print keyword/function abs abs() Method used for absolute value len len() Method called for the len (length) function add + Add two objects, e.g., add two numbers or concatenate two strings iadd += Addition with assignment div / Division (In Python 2 integer division for int by default) floordiv // Integer division with oor rounding 4 pow Power for numbers, e.g., 3 4 = 3 = 81 and & Method called for and operator `&' eq == Test for equality. lt Less than le = Less than or equal xor Exclusive or. Works bitwise for integers and binary for Booleans . . . Attribute Description class Class of object, e.g., type 'list' (Python 2), class 'list' (3) doc The documention string, e.g., used for help() Table 2.2: Class methods and attributes. These names are available with the dir function, e.g., an integer = 3; dir(an integer). 2.4.1 Objects as functions Any object can be turned into a function by de ning the call method. Here we derive a new class from the str data type/class de ning the call method to split the string into words and return a word indexed by the input argument: class WordsString(str): def __call__(self, index): return self.split()index After instancing the WordString class with a string we can call the object to let it return, e.g., the fth word: s = WordsString("To suppose that the eye will all its inimitable contrivances") s(4) 'eye' Alternatively we could have de ned an ordinary method with a name such as word and called the object as s.word(4), a slightly longer notation, but perhaps more readable and intuitive for the user of the class compared to the surprising use with the call method. 162.5 Modules and import 1 \A module is a le containing Python de nitions and statements." The le should have the extension .py. A Python developer should group classes, constants and functions into meaningful modules with meaningful names. To use a module in another Python script, module or interactive sessions they should be imported 2 with the import statement. For example, to import the os module write: import os The le associated with the module is available in the file attribute; in the example that would be os. file . While standard Python 2 (CPython) does not make this attribute available for builtin modules it is available in Python 3 and in this case link to the os.py le. Individual classes, attributes and functions can be imported via the from keyword, e.g., if we only need the os.listdir function from the os module we could write: from os import listdir This import variation will make the os.listdir function available as listdir. If the package contains submodules then they can be imported via the dot notation, e.g., if we want names from the tokenization part of the NLTK library we can include that submodule with: import nltk.tokenize The imported modules, class and functions can be renamed with the as keyword. By convention several data mining modules are aliased to speci c names: import numpy as np import matplotlib.pyplot as plt import networkx as nx import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf With these aliases Numpy's sin function will be avaiable under the name np.sin. Import statements should occur before imported name is used. They are usually placed at the top of the le, but this is only a style convention. Import of names from the special future module should be at the very top. Style checking tool ake8 will help on checking conventions for imports, e.g., it will complain about unused import, i.e., if a module is imported but the names in it are never used in the importing module. The flake8-import-order ake8 extension even pedantically checks for the ordering of the imports. 2.5.1 Submodules If a package contains of a directory tree then subdirectories can be used as submodules. For older versions of Python is it necessary to have a init .py le in each subdirectory before Python recognizes the subdirectories as submodules. Here is an example of a module, imager, which contains three submodules in two subdirectories: /imager __init__.py /io __init__.py jpg.py /process __init__.py factorize.py categorize.py 1 6. Modules in The Python Tutorial 2 Unless built-in. 17Provided that the module imager is available in the path (sys.path) the jpg module will now be available for import as import imager.io.jpg Relative imports can be used inside the package. Relative import are speci ed with single or double dots in much the same way as directory navigation, e.g., a relative import of the categorize and jpg modules from the factorize.py le can read: from . import categorize from ..io import jpg Some developers encourage the use of relative imports because it makes refactoring easier. On the other hand can relative imports cause problems if circular import dependencies between the modules appear. In this latter case absolute imports work around the problem. Name clashes can appear: In the above case the io directory shares name with the io module of the standard library. If the le imager/__init__.py writes `import io' it is not immediately clear for the novice programmer whether it is the standard library version of io or the imager module version that Python imports. In Python 3 it is the standard library version. The same is the case in Python 2 if the `from __future__ import absolute_import' statement is used. To get the imager module version, imager.io, a relative import can be used: from . import io Alternatively, an absolute import with import imager.io will also work. 2.5.2 Globbing import In interactive data mining one sometimes imports everything from the pylab module with `from pylab import '. pylab is actually a part of Matplotlib (as matplotlib.pylab) and it imports a large number of functions and class from the numerical and plotting packages of Python, i.e., numpy and matplotlib, so the de nitions are readily available for use in the namespace without module pre x. Below is an example where a sinusoid is plotted with Numpy and Matplotlib functions: from pylab import t = linspace(0, 10, 1000) plot(t, sin(2 pi 3 t)) show() Some argue that the massive import of de nitions with `from pylab import ' pollutes the namespace and should not be used. Instead they argue you should use explicit import, like: from numpy import linspace, pi, sin from matplotlib.pyplot import plot, show t = linspace(0, 10, 1000) plot(t, sin(2 pi 3 t)) show() Or alternatively you should use pre x, here with an alias: import numpy as np import matplotlib.pyplot as plt t = np.linspace(0, 10, 1000) plt.plot(t, np.sin(2 np.pi 3 t)) plt.show() 18This last example makes it more clear where the individual functions comes from, probably making large Python code les more readable. With `from pylab import ' it is not immediately clear the the load function comes from, in this case the numpy.lib.npyio module which function reads pickle les. Similar named functions in di erent modules can have di erent behavior. Jake Vanderplas pointed to this nasty example: start = -1 sum(range(5), start) 9 from numpy import sum(range(5), start) 10 Here the built-in sum function behaves di erently than numpy.sum as their interpretations of the second argument di er. 2.5.3 Coping with Python 2/3 incompatibility There is a number of modules that have changed their name between Python 2 and 3, e.g., ConfigParser/configparser, cPickle/pickle and cStringIO/StringIO/io. Exception handling and aliasing can be used to make code Python 2/3 compatible: try: import ConfigParser as configparser except ImportError: import configparser try: from cStringIO import StringIO except ImportError: try: from StringIO import StringIO except ImportError: from io import StringIO try: import cPickle as pickle except ImportError: import pickle After these imports you will, e.g., have the con guration parser module available as configparser. 2.6 Persistency How do you store data between Python sessions? You could write your own le reading and writing function or perhaps better rely on Python function in the many di erent modules, Python PSL, supports comma- separated values les (csv in PSL and csvkit that will handle UTF-8 encoded data) and JSON (json). PSL also has several XML modules, but developers may well prefer the faster lxml module, not only for XML, but also for HTML 18. 2.6.1 Pickle and JSON Python also has its own special serialization format called pickle. This format can store not only data but also objects with methods, e.g., it can store a trained machine learning classi er as an object and indeed you can discover that the nltk package stores a trained part-of-speech tagger as a pickled le. The power of pickle is also its downside: Pickle can embed dangerous code such as system calls that could erase your entire 19

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.