What is Data Science
Data science as lying at the intersection of computer science, statistics, and substantive application domains. From computer science comes to machine learning and high-performance computing technologies for dealing with scale.
From statistics comes a long tradition of exploratory data analysis, significance testing, and visualization. From application domains in business and the sciences comes challenges worthy of battle, and evaluation standards to assess when they have been adequately conquered. Why data science, and why now? I see three reasons for this sudden burst of activity:
New technology makes it possible to capture, annotate, and store vast amounts of social media, logging, and sensor data. After you have amassed all this data, you begin to wonder what you can do with it.
Computing advances make it possible to analyze data in novel ways and at ever increasing scales. Cloud computing architectures give even the little guy access to vast power when they need it. New approaches to machine learning have led to amazing advances in longstanding problems, like computer vision and natural language processing.
Prominent technology companies (like Google and Facebook) and quantitative hedge funds (like Renaissance Technologies and TwoSigma) have proven the power of modern data analytics.
Success stories applying data to such diverse areas as sports management and election forecasting have served as role models to bring data science to a large popular audience.
This data science ecosystem has a series of tools that you use to build your solutions. This environment is undergoing a rapid advancement in capabilities, and new developments are occurring every day.
I will explain the tools I use in my daily work to perform practical data science. Next, I will discuss the following basic data methodologies.
A data lake is a storage repository for a massive amount of raw data. It stores data in native format, in anticipation of future requirements. You will acquire insights from this blog on why this is extremely important for practical data science and engineering solutions.
While a schema-on-write data warehouse stores data in predefined databases, tables, and records structures, a data lake uses a less restricted schema-on-read-based architecture to store data. Each data element in the data lake is assigned a distinctive identifier and tagged with a set of comprehensive metadata tags.
A data lake is typically deployed using distributed data object storage, to enable the schema-on-read structure. This means that business analytics and data mining tools access the data without a complex schema. Using a schema-on-read methodology enables you to load your data as is and start to get value from it instantaneously.
For deployment to the cloud, it is a cost-effective solution to use Amazon’s Simple Storage Service (Amazon S3) to store the base data for the data lake.
I will demonstrate the feasibility of using cloud technologies to provision your data science work. It is, however, not necessary to access the cloud to follow the examples in this blog, as they can easily be processed using a laptop.
Data vault modeling, designed by Dan Linstedt, is a database modeling method that is intentionally structured to be in control of the long-term historical storage of data from multiple operational systems.
The data vaulting processes transform the schema-on- read data lake into a schema-on-write data vault. The data vault is designed into the schema-on-read query request and then executed against the data lake.
I have also seen the results stored in a schema-on-write format, to persist the results for future queries. At this point, I expect you to understand only the rudimentary structures required to formulate a data vault.
The structure is built from three basic data structures: hubs, inks, and satellites. Let’s examine the specific data structures, to clarify why they are compulsory.
Hubs contain a list of unique business keys with low propensity to change. They contain a surrogate key for each hub item and metadata classification of the origin of the business key.
Associations or transactions between business keys are modeled using link tables. These tables are essentially many-to-many join tables, with specific additional metadata.
The link is a singular relationship between hubs to ensure the business relationships are accurately recorded to complete the data model for the real-life business.
Hubs and links form the structure of the model but store no chronological characteristics or descriptive characteristics of the data. These characteristics are stored in appropriated tables identified as satellites.
Satellites are the structures that store comprehensive levels of the information on business characteristics and are normally the largest volume of the complete data vault data structure.
The appropriate combination of hubs, links, and satellites helps the data scientist to construct and store prerequisite business relationships. This is a highly in-demand skill for a data modeler.
Data Warehouse Bus Matrix
The Enterprise Bus Matrix is a data warehouse planning tool and model created by Ralph Kimball and used by numerous people worldwide over the last 40+ years. The bus matrix and architecture builds upon the concept of conformed dimensions that are interlinked by facts.
The data warehouse is a major component of the solution required to transform data into actionable knowledge. This schema-on-write methodology supports business intelligence against actionable knowledge.
Basic Data Methodologies
Schema-on-Write and Schema-on-Read
There are two basic methodologies that are supported by the data processing tools. Following is a brief outline of each methodology and its advantages and drawbacks.
A traditional relational database management system (RDBMS) requires a schema before you can load the data. To retrieve data from my structured data schemas, you may have been running standard SQL queries for a number of years.
Benefits include the following:
In traditional data ecosystems, tools assume schemas and can only work once the schema is described, so there is only one view on the data.
The approach is extremely valuable in articulating relationships between data points, so there are already relationships configured.
It is an efficient way to store “dense” data.
All the data is in the same data store. On the other hand, schema-on-write isn’t the answer to every data science problem.
Among the downsides of this approach is that
Its schemas are typically purpose-built, which makes them hard to change and maintain.
It generally loses the raw/atomic data as a source for future analysis.
It requires considerable modeling/implementation effort before being able to work with the data.
If a specific type of data can’t be stored in the schema, you can’t effectively process it from the schema. At present, schema-on-write is a widely adopted methodology to store data.
This alternative data storage methodology does not require a schema before you can load the data. Fundamentally, you store the data with minimum structure. The essential schema is applied during the query phase.
Benefits include the following:
It provides flexibility to store unstructured, semi-structured, and disorganized data.
It allows for unlimited flexibility when querying data from the structure.
Leaf-level data is kept intact and untransformed for reference and use for the future.
The methodology encourages experimentation and exploration.
It increases the speed of generating fresh actionable knowledge.
It reduces the cycle time between data generation to the availability of actionable knowledge.
I recommend a hybrid between schema-on-read and schema-on-write ecosystems for effective data science and engineering. I will discuss in detail why this specific ecosystem is the optimal solution when I cover the functional layer’s purpose in data science processing.
Data Science Processing Tools
Now that I have introduced data storage, the next step involves processing tools to transform your data lakes into data vaults and then into data warehouses. These tools are the workhorses of the data science and engineering ecosystem. Following are the recommended foundations for the data tools I use.
Apache Spark is an open source cluster computing framework. Originally developed at the AMP Lab of the University of California, Berkeley, the Spark code base was donated to the Apache Software Foundation, which now maintains it as an open source project. This tool is evolving at an incredible rate.
IBM is committing more than 3,500 developers and researchers to work on Spark- related projects and formed a dedicated Spark technology center in San Francisco to pursue Spark-based innovations.
SAP, Tableau, and Talend now support Spark as part of their core software stack.
Cloudera, Hortonworks, and MapR distributions support Spark as a native interface.
Spark offers an interface for programming distributed clusters with implicit data parallelism and fault-tolerance. Spark is a technology that is becoming a de-facto standard for numerous enterprise-scale processing applications.
I discovered the following modules using this tool as part of my technology toolkit.
Spark Core is the foundation of the overall development. It provides distributed task dispatching, scheduling, and basic I/O functionalities. This enables you to offload the comprehensive and complex running environment to the Spark Core.
This safeguards that the tasks you submit are accomplished as anticipated. The distributed nature of the Spark ecosystem enables you to use the same processing request on a small Spark cluster, then on a cluster of thousands of nodes, without any code changes.
Spark SQL is a component on top of the Spark Core that presents a data abstraction called Data Frames. Spark SQL makes accessible a domain-specific language (DSL) to manipulate data frames.
This feature of Spark enables ease of transition from your traditional SQL environments into the Spark environment. I have recognized its advantage when you want to enable legacy applications to offload the data from their traditional relational-only data storage to the data lake ecosystem.
Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. Spark Streaming has built-in support to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.
The process of streaming is the primary technique for importing data from the data source to the data lake.
Streaming is becoming the leading technique to load from multiple data sources. I have found that there are connectors available for many data sources. There is a major drive to build even more improvements on connectors, and this will improve the ecosystem even further in the future.
MLlib Machine Learning Library
Spark MLlib is a distributed machine learning framework used on top of the Spark Core by means of the distributed memory-based Spark architecture.
In Spark 2.0, a new library, http://spark.mk, was introduced to replace the RDD-based data processing with a DataFrame-based model. It is planned that by the introduction of Spark 3.0, only DataFrame-based models will exist.
Common machine learning and statistical algorithms have been implemented and are shipped with MLlib, which simplifies large-scale machine learning pipelines, including
Dimensionality reduction techniques, such as singular value decomposition (SVD) and principal component analysis (PCA)
Summary statistics, correlations, stratified sampling, hypothesis testing, and random data generation
Collaborative filtering techniques, including alternating least squares (ALS)
Classification and regression: support vector machines, logistic regression, linear regression, decision trees, and naive Bayes classification
Cluster analysis methods, including k-means and latent Dirichlet allocation (LDA)
Optimization algorithms, such as stochastic gradient descent and limited-memory BFGS (L-BFGS)
Feature extraction and transformation functions
GraphX is a powerful graph-processing application programming interface (API) for the Apache Spark analytics engine that can draw insights from large data sets. GraphX provides outstanding speed and capacity for running massively parallel and machine- learning algorithms.
The introduction of the graph-processing capability enables the processing of relationships between data entries with ease.
Apache Mesos is an open source cluster manager that was developed at the University of California, Berkeley. It delivers efficient resource isolation and sharing across distributed applications. The software enables resource sharing in a fine-grained manner, improving cluster utilization.
The Enterprise version of Mesos is Mesosphere Enterprise DC/OS. This runs containers elastically, and data services support Kafka, Cassandra, Spark, and Akka.
In microservices architecture, I aim to construct a service that spawns granularity, processing units, and lightweight protocols through the layers.
The toolkit and runtime methods shorten development of large-scale data-centric applications for processing. Akka is an actor-based message-driven runtime for running concurrency, elasticity, and resilience processes.
The use of high-level abstractions such as actors, streams, and futures facilitates data science and engineering granularity processing units.
The use of actors enables the data scientist to spawn a series of concurrent processes by using a simple processing model that employs a messaging technique and specific predefined actions/behaviors for each actor. This way, the actor can be controlled and limited to perform the intended tasks only.
Apache Cassandra is a large-scale distributed database supporting multi data center replication for availability, durability, and performance.
I use DataStax Enterprise (DSE) mainly to accelerate my own ability to deliver real- time value at epic scale, by providing a comprehensive and operationally simple data management layer with a unique always-on architecture built in Apache Cassandra.
The standard Apache Cassandra open source version works just as well, minus some extra management it does not offer as standard. I will just note that, for graph databases, as an alternative to GraphX, I am currently also using DataStax Enterprise Graph.
This is a high-scale messaging backbone that enables communication between data processing entities. The Apache Kafka streaming platform, consisting of Kafka Core, Kafka Streams, and Kafka Connect, is the foundation of the Confluent Platform.
The Confluent Platform is the main commercial supporter for Kafka (see confluent.io/). Most of the Kafka projects I am involved with now use this platform. Kafka components empower the capture, transfer, processing, and storage of data streams in a distributed, fault-tolerant manner throughout an organization in real time.
At the core of the Confluent Platform is Apache Kafka. Confluent extends that core to make configuring, deploying and managing Kafka less complex.
Kafka Streams is an open source solution that you can integrate into your application to build and execute powerful stream-processing functions.
This ensures Confluent-tested and secure connectors for numerous standard data systems. Connectors make it quick and stress-free to start setting up consistent data pipelines. These connectors are completely integrated with the platform, via the schema registry.
Kafka Connect enables the data processing capabilities that accomplish the movement of data into the core of the data solution from the edge of the business ecosystem.
Elastic search is a distributed, open source search and analytics engine designed for horizontal scalability, reliability, and stress-free management. It combines the speed of search with the power of analytics, via a sophisticated, developer-friendly query language covering structured, unstructured, and time-series data.
R programming language
R is a programming language and software environment for statistical computing and graphics. The R language is widely used by data scientists, statisticians, data miners, and data engineers for developing statistical software and performing data analysis.
The capabilities of R are extended through user-created packages using specialized statistical techniques and graphical procedures. A core set of packages is contained within the core installation of R, with additional packages accessible from the Comprehensive R Archive Network (CRAN).
Knowledge of the following packages is a must:
•\ sqldf (data frames using SQL): This function reads a file into R while filtering data with an sql statement. Only the filtered part is processed by R, so files larger than those R can natively import can be used as data sources.
•\ forecast (forecasting of time series): This package provides forecasting functions for time series and linear models.
•\ dplyr (data aggregation): Tools for splitting, applying, and combining data within R
•\ stringr (string manipulation): Simple, consistent wrappers for common string operations
•\ RODBC, RSQLite, and Cassandra database connection packages: These are used to connect to databases, manipulate data outside R, and enable interaction with the source system.
•\ lubridate (time and date manipulation): Makes dealing with dates easier within R
•\ ggplot2 (data visualization): Creates elegant data visualizations, using the grammar of graphics. This is a super-visualization capability.
•\ reshape2 (data restructuring): Flexibly restructures and aggregates data, using just two functions: melt and dcast (or acast).
•\ randomForest (random forest predictive models): Leo Breiman and Adele Cutler’s random forests for classification and regression
•\ gbm (generalized boosted regression models): Yoav Freund and Robert Schapire’s AdaBoost algorithm and Jerome Friedman’s gradient boosting machine
I will provide examples that demonstrate the basic ideas and engineering behind the framework and the tools.
Please note that there are many other packages in CRAN, which is growing on a daily basis. Investigating the different packages to improve your capabilities in the R environment is time well spent.
Scala is a general-purpose programming language. Scala supports functional programming and a strong static type system. Many high-performance data science frameworks are constructed using Scala, because of its amazing concurrency capabilities. Parallelizing masses of processing is a key requirement for large data sets from a data lake.
Scala is emerging as the de-facto programming language used by data-processing tools. I provide guidance on how to use it, in the course of this blog. Scala is also the native language for Spark, and it is useful to master this language.
Python is a high-level, general-purpose programming language created by Guido van Rossum and released in 1991. It is important to note that it is an interpreted language.
Python has a design philosophy that emphasizes code readability. Python uses a dynamic type system and automatic memory management and supports multiple programming paradigms (object-oriented, imperative, functional programming, and procedural).
Thanks to its worldwide success, it has a large and comprehensive standard library. The Python Package Index (PyPI) supplies thousands of third-party modules ready for use for your data science projects. I provide guidance on how to use it, in the course of this blog.
I suggest that you also install Anaconda. It is an open source distribution of Python that simplifies package management and deployment of features (see www.continuum. io/downloads).
MQTT (MQ Telemetry Transport)
MQTT stands for MQ Telemetry Transport. The protocol uses to publish and subscribe, extremely simple and lightweight messaging protocols.
It was intended for constrained devices and low-bandwidth, high-latency, or unreliable networks. This protocol is perfect for machine-to-machine- (M2M) or Internet-of-things-connected devices.
MQTT-enabled devices include handheld scanners, advertising boards, footfall counters, and other machines.
Let’s begin by constructing a customer. I have created a fictional company for which you will perform the practical data science as your progress through this blog. You can execute your examples in either a Windows or Linux environment. You only have to download the desired example set.
I suggest that you create a directory called c:\VKHCG to process all the examples in this blog. Next, from GitHub, download and unzip the DS_VKHCG_Windows.zip file into this directory.
I also suggest that you create a directory called ./VKHCG, to process all the examples in this blog. Then, from GitHub, download and untar the DS_VKHCG_Linux.tar.gz file into this directory.
Warning If you change this directory to a new location, you will be required to change everything in the sample scripts to this new location, to get the maximum benefit from the samples.
These files are used to create the sample company’s script and data directory, which I will use to guide you through the processes and examples in the rest of the blog.
It’s Now Time to Meet Your Customer
Vermeulen-Krennwallner-Hillman-Clark Group (VKHCG) is a hypothetical medium-size international company. It consists of four sub-companies: Vermeulen PLC, Krennwallner AG, Hillman Ltd, and Clark Ltd.
Vermeulen PLC is a data processing company that processes all the data within the group companies, as part of their responsibility to the group. The company handles all the information technology aspects of the business.
This is the company for which you have just been hired to be the data scientist. Best of luck with your future.
The company supplies
Networks, servers, and communication systems
Internal and external websites
Data analysis business activities
For the purposes of this blog, I will explain what other technologies you need to investigate at every section of the framework, but the examples will concentrate only on specific concepts under discussion, as the overall data science field is more comprehensive than the few selected examples.
By way of examples, I will assist you in building a basic Data Science Technology Stack and then advise you further with additional discussions on how to get the stack to work at scale.
The examples will show you how to process the following business data:
A number of handy data science algorithms
I will explain how to
Create a network routing diagram using geospatial analysis
Build a directed acyclic graph (DAG) for the schedule of jobs, using graph theory
If you want to have a more detailed view of the company’s data, take a browse at these data sets in the company’s sample directory (./VKHCG/01-Vermeulen/00-RawData).
Krennwallner AG is an advertising and media company that prepares advertising and media content for the customers of the group.
Advertising on billboards
Advertising and content management for online delivery
Event management for key customers
Via a number of technologies, it records who watches what media streams. The specific requirement we will elaborate is how to identify the groups of customers who will have to see explicit media content. I will explain how to
Pick content for specific billboards
Understand online website visitors’ data per country
Plan an event for top-10 customers at Neuschwanstein Castle
If you want to have a more in-depth view of the company’s data, have a glance at the sample data sets in the company’s sample directory (./VKHCG/02-Krennwallner/ 00-RawData).
The Hillman company is a supply chain and logistics company. It provisions a worldwide supply chain solution to the businesses, including
The principal requirement that I will expand on through examples is how you design the distribution of a customer’s products purchased online.
Through the examples, I will follow the product from factory to warehouse and warehouse to the customer’s door.
I will explain how to
Plan the locations of the warehouses within the United Kingdom
Plan shipping rules for best-fit international logistics
Choose what the best parking option is for shipping containers for a given set of products
Create an optimal delivery route for a set of customers in Scotland
If you want to have a more detailed view of the company’s data, browse the data sets in the company’s sample directory (./VKHCG/ 03-Hillman/00-RawData).
The Clark company is a venture capitalist and accounting company that processes the following financial responsibilities of the group:
Venture capital management
Forex (foreign exchange) trading
I will use the financial aspects of the group companies to explain how you apply practical data science and data engineering to common problems for the hypothetical financial data.
I will explain to you how to prepare
A simple forex trading planner
Gross profit for sales
Gross profit after tax for sales
Return on capital employed (ROCE)
Accounts receivable days
Accounts payable days
Five years ago, VKHCG consolidated its processing capability by transferring the concentrated processing requirements to Vermeulen PLC to perform data science as a group service.
This resulted in the other group companies sustaining 20% of the group business activities; however, 90% of the data processing of the combined group’s business activities was reassigned to the core team.
Vermeulen has since consolidated Spark, Python, Mesos, Akka, Cassandra, Kafka, elastic search, and MQTT (MQ Telemetry Transport) processing into a group service provider and processing entity.
I will use R or Python for the data processing in the examples. I will also discuss the complementary technologies and advise you on what to consider and request for your own environment.
Note The complementary technologies are used regularly in the data science environment. Although I cover them briefly, that does not make them any less significant.
VKHCG uses the R processing engine to perform data processing in 80% of the company business activities, and the other 20% is done by Python.
Therefore, we will prepare an R and a Python environment to perform the examples. I will quickly advise you on how to obtain these additional environments if you require them for your own specific business requirements.
I will cover briefly the technologies that we are not using in the examples but that is known to be beneficial.
Scala is popular in the data science community, as it supports massively parallel processing in an at-scale manner. You can install the language from the following core site: www.scala-lang.org/download/. Cheat sheets and references are available to guide you to resources to help you master this programming language.
Note Many of my larger clients are using Scala as their strategical development language.
Apache Spark is a fast and general engine for large-scale data processing that is at present the fastest-growing processing engine for large-scale data science projects. You can install the engine from the following core site: spark.apache.org/.
For large-scale projects, I use the Spark environment within DataStax Enterprise, Hortonworks, Cloudera, and MapR.
Note Spark is now the most sought-after common processing engine for at- scale data processing, with support increasing by the day. I recommend that you master this engine if you want to advance your career in data science at-scale.
Apache Mesos abstracts CPU, memory, storage, and additional computation resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to effortlessly build and run processing solutions effectively.
It is industry proven to scale to 10,000s of nodes. This empowers the data scientist to run massive parallel analysis and processing in an efficient manner.
The processing environment is available from the following core site:mesos.apache.org/. I want to give Mesosphere Enterprise DC/OS an honorable mention, as I use it for many projects.
Note Mesos is a cost-effective processing approach supporting growing dynamic processing requirements in an at-scale processing environment.
Akka supports building powerful concurrent and distributed applications to perform massive parallel processing while sharing the common processing platform at-scale. You can install the engine from the following core site: Akka | Akka. I use Akka processing within the Mesosphere Enterprise DC/OS environment.
Apache Cassandra database offers support with scalability and high availability, without compromising performance. It has linear scalability and a reputable fault-tolerance, as it is widely used by numerous big companies. You can install the engine from the following core site: http://cassandra.apache.org/.
I use Cassandra processing within the Mesosphere Enterprise DC/OS environment and DataStax Enterprise for my Cassandra installations.
Note I recommend that you consider Cassandra as an at-scale database, as it supports the data science environment with stable data processing capability.
Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and impressively fast. You can install the engine from the following core site: http://kafka.apache.org/.
I use Kafka processing within the Mesosphere Enterprise DC/OS environment, to handle the ingress of data into my data science environments.
Note I advise that you look at Kafka as a data transport, as it supports the data science environment with robust data collection facility.
Message Queue Telemetry Transport
Message Queue Telemetry Transport (MQTT) is a machine-to-machine (M2M) and the Internet of things connectivity protocol. It is an especially lightweight publish/subscribe messaging transport.
It enables connections to locations where a small code footprint is essential, and a lack of network bandwidth is a barrier to communication. See http://mqtt.org/ for details.
Note This protocol is common in sensor environments, as it provisions the smaller code footprint and lower bandwidths that sensors demand.
Now that I have covered the items you should know about but are not going to use in the examples, let’s look at what you will use.
The examples require the following environment. The two setups required within VKHCG’s environment are Python and R.
Python is a high-level programming language created by Guido van Rossum and first released in 1991. Its reputation is growing, as today, various training institutes are covering the language as part of their data science prospectus.
I suggest you install Anaconda, to enhance your Python development. It is an open source distribution of Python that simplifies package management and deployment of features (see http://www.continuum.io/downloads).
I use an Ubuntu desktop and server installation to perform my data science (see http://www.ubuntu.com/), as follows:
sudo apt-get install python3 python3-pip python3-setuptools
If you want to use CentOS/RHEL, I suggest you employ the following install process:
sudo yum install python3 python3-pip python3-setuptools
If you want to use Windows, I suggest you employ the following install process. Download the software from http://www.python.org/downloads/windows/.
Is Python3 Ready?
Once the installation is completed, you must test your environment as follows:
On success, you should see a response like this
Congratulations, Python is now ready.
One of the most important features of Python is its libraries, which are extensively available and make it stress-free to include verified data science processes into your environment.
To investigate extra packages, I suggest you review the PyPI—Python Package Index (https://pypi.python.org/).
You have to set up a limited set of Python libraries to enable you to complete the examples.
Warning Please ensure that you have verified all the packages you use. Remember: Open source is just that—open. Be vigilant!
This provides a high-performance set of data structures and data-analysis tools for use in your data science.
Install this by using
sudo apt-get install python-pandas
Install this by using
yum install python-pandas
Install this by using
pip install pandas
More information on Pandas development is available at http://pandas.pydata. org/.
I suggest following the cheat sheet (https://github.com/pandas-dev/pandas/ blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf), to guide you through the basics of using Pandas.
I will explain, via examples, how to use these Pandas tools.
Note I suggest that you master this package, as it will support many of your data loadings and storing processes, enabling overall data science processing.
Matplotlib is a Python 2D and 3D plotting library that can produce various plots, histograms, power spectra, bar charts, error charts, scatterplots, and limitless advance visualizations of your data science results.
Install this by using
sudo apt-get install python-matplotlib
Install this by using
Sudo yum install python-matplotlib
Install this by using:
pip install matplotlib
Explore http://matplotlib.org/ for more details on the visualizations that you can accomplish with exercises included in these packages.
Note I recommend that you spend time mastering your visualization skills. Without these skills, it is nearly impossible to communicate your data science results.
NumPy is the fundamental package for scientific computing, based on a general homogeneous multidimensional array structure with processing tools. Explore http://www.numpy.org/ for further details.
I will use some of the tools in the examples but suggest you practice with the general tools, to assist you with your future in data science.
SymPy is a Python library for symbolic mathematics. It assists you in simplifying complex algebra formulas before including them in your code. Explore http://www.sympy.org for details on this package’s capabilities.
Scikit-Learn is an efficient set of tools for data mining and data analysis packages. It provides support for data science classification, regression, clustering, dimensionality reduction, and preprocessing for feature extraction and normalization. This tool supports both supervised learning and unsupervised learning processes.
I will use many of the processes from this package in the examples. Explore http://scikit-learn.org for more details on this wide-ranging package. Congratulations. You are now ready to execute the Python examples. Now, I will guide you through the second setup for the R environment.
R is the core processing engine for statistical computing and graphics. Download the software from http://www.r-project.org/ and follow the installation guide for the specific R installation you require.
Install this by using sudo apt-get install r-base
Install this by using
sudo yum install R
From https://cran.r-project.org/bin/windows/base/, install the software that matches your environment.
VKHCG uses the RStudio development environment for its data science and engineering within the group.
RStudio produces a stress-free R ecosystem containing a code editor, debugging, and a visualization toolset. Download the relevant software from http://www.rstudio.com/ and follow the installation guide for the specific installation you require.
Install this by using
sudo dpkg -i *.deb
Install this by using wget https://download1.rstudio.org/rstudio-1.0.143-x86_64.rpm sudo yum install --nogpgcheck rstudio-1.0.143-x86_64.rpm
I suggest the following additional R packages to enhance the default R environment.
Data. Table enables you to work with data files more effectively. I suggest that you practice using Data. Table processing, to enable you to process data quickly in the R environment and empower you to handle data sets that are up to 100GB in size.
The documentation is available at https://cran.r-project.org/web/packages/ data.table/data.table.pdf. See https://CRAN.R-project.org/package=data.table for up-to-date information on the package.
To install the package, I suggest that you open your RStudio IDE and use the following command:
The ReadR package enables the quick loading of text data into the R environment. The documentation is available at https://cran.r-project.org/web/packages/ readr/readr.pdf.
See https://CRAN.R-project.org/package=readr for up-to-date information on the package.
To install the package, I advise you to open your RStudio IDE and use the following command:
I suggest that you practice by importing and exporting different formats of files, to understand the workings of this package and master the process. I also suggest that you investigate the following functions in depth in the ReadR package:
Spec_delim(): Supports getting the specifications of the file without reading it into memory
read_delim(): Supports reading of delimited files into the R environment
write_delim(): Exports data from an R environment to a file on disk
JSON Lite Package
This package enables you to process JSON files easily, as it is an optimized JSON parser and generator specifically for statistical data.
The documentation is at https://cran.r-project.org/web/packages/jsonlite/ jsonlite.pdf.
See https://CRAN.R-project.org/package=jsonlite for up-to-date information on the package.
To install the package, I suggest that you open your RStudio IDE and use the following command:
I also suggest that you investigate the following functions in the package:
fromJSON(): This enables you to import directly into the R environment from a JSON data source.
prettify(): This improves the human readability by formatting the JSON, so that a human can read it easier.
minify(): Removes all the JSON indentation/whitespace to make the JSON machine readable and optimized
toJSON(): Converts R data into JSON formatted data
read_json(): Reads JSON from a disk file
write_json(): Writes JSON to a disk file
Visualization of data is a significant skill for the data scientist. This package supports you with an environment in which to build a complex graphics format for your data.
It is so successful at the task of creating detailed graphics that it is called “The Grammar of Graphics.”
The documentation is located at https://cran.r-project.org/web/ packages/ ggplot2/ ggplot2.pdf.
To install the package, I suggest that you open your RStudio IDE and use the following command:
I recommend that you master this package to empower you to transform your data into a graphic you can use to demonstrate to your business the value of the results.
The packages we now have installed will support the examples.
Amalgamation of R with Spark
I want to discuss an additional package because I see its mastery as a major skill you will require to work with current and future data science.
This package is interfacing the R environment with the distributed Spark environment and supplies an interface to Spark’s built-in machine-learning algorithms. A number of my customers are using Spark as the standard interface to their data environments.
Understanding this collaboration empowers you to support the processing of at-scale environments, without major alterations in the R processing code.
The documentation is at https://cran.r-project.org/web/packages/sparklyr/ sparklyr.pdf.
See https://CRAN.R-project.org/package=sparklyr for up-to-date information on the package.
To install the package, I suggest that you open your RStudio IDE and use the following command:
sparklyr is a direct R interface for Apache Spark to provide a complete dplyr back end.
Once the filtering and aggregate of Spark data sets is completed downstream in the at-scale environment, the package imports the data into the R environment for analysis and visualization.
Computer Science, Data Science, and Real Science
Computer scientists, by nature, don't respect data. They have traditionally been taught that the algorithm was the thing, and that data was just meat to be passed through a sausage grinder.
So to qualify as an effective data scientist, you must first learn to think like a real scientist. Real scientists strive to understand the natural world, which is a complicated and messy place.
By contrast, computer scientists tend to build their own clean and organized virtual worlds and live comfortably within them. Scientists obsess about discovering things, while computer scientists invent rather than discover.
People's mindsets strongly color how they think and act, causing misunderstandings when we try to communicate outside our tribes. So fundamental are these biases that we are often unaware we have them. Examples of the cultural differences between computer science and real science include:
Data vs. method centrism:
Scientists are data-driven, while computer scientists are algorithm driven. Real scientists spend enormous amounts of e ort collecting data to answer their question of interest. They invent fancy measuring devices, stay up all night tending to experiments, and devote most of their thinking to how to get the data they need.
By contrast, computer scientists obsess about methods: which algorithm is better than which another algorithm, which programming language is best for a job, which program is better than which another program. The details of the data set they are working on seem comparably unexciting.
Concern about results: Real scientists care about answers. They analyze data to discover something about how the world works. Good scientists care about whether the results make sense because they care about what the answers mean.
By contrast, bad computer scientists worry about producing plausible-looking numbers. As soon as the numbers stop looking grossly wrong, they are presumed to be right. This is because they are personally less invested in what can be learned from a computation, as opposed to getting it done quickly and efficiently.
Real scientists are comfortable with the idea that data has errors. In general, computer scientists are not. Scientists think a lot about possible sources of bias or error in their data, and how these possible problems can effect the conclusions derived from them.
Good programmers use strong data-typing and parsing methodologies to guard against formatting errors, but the concerns here are different.
Aspiring data scientists must learn to think like real scientists. Your job is going to be to turn numbers into insight. It is important to understand the why as much as the how.
To be fair, it benefits real scientists to think like data scientists as well. New experimental technologies enable measuring systems on a vastly greater scale than ever possible before, through technologies like full-genome sequencing in biology and full-sky telescope surveys in astronomy. With the new breadth of view comes new levels of vision.
Traditional hypothesis-driven science was based on asking specific questions of the world and then generating the specific data needed to confirm or deny it.
This is now augmented by data-driven science, which instead focuses on generating data on a previously unheard of scale or resolution, in the belief that new discoveries will come as soon as one is able to look at it. Both ways of thinking will be important to us:
Given a problem, what available data will help us answer it?
Given a data set, what interesting problems can we apply it to?
There is another way to capture this basic distinction between software engineering and data science. It is that software developers are hired to build systems, while data scientists are hired to produce insights.
This may be a point of contention for some developers. There exist an important class of engineers who wrangle the massive distributed infrastructures necessary to store and analyze, say, financial transaction or social media data on a full Facebook or Twitter-level of scale.
These engineers are building tools and systems to support data science, even though they may not personally mine the data they wrangle. Do they qualify as data scientists?
This is a fair question, one I will finesse a bit so as to maximize the potential readership of this blog. But I do believe that the better such engineers understand the full data analysis pipeline, the more likely they will be able to build powerful tools capable of providing important insights.
A major goal of this blog is providing big data engineers with the intellectual tools to think like big data scientists.
Asking Interesting Questions from Data
Good data scientists develop an inherent curiosity about the world around them, particularly in the associated domains and applications they are working on. They enjoy talking shop with the people whose data they work with. They ask them questions:
What is the coolest thing you have learned about this eld? Why did you get interested in it? What do you hope to learn by analyzing your data set? Data scientists always ask questions.
Good data scientists have wide-ranging interests. They read the newspaper every day to get a broader perspective on what is exciting. They understand that the world is an interesting place.
Knowing a little something about everything equips them to play in other people's backyards. They are brave enough to get out of their comfort zones a bit, and driven to learn more once they get there.
Software developers are not really encouraged to ask questions, but data scientists are. We ask questions like:
What things might you be able to learn from a given data set? What do you/your people really want to know about the world? What will it mean to you once you and out?
Computer scientists traditionally do not really appreciate data. Think about the way algorithm performance is experimentally measured. Usually, the program is run on \random data" to see how long it takes.
They rarely even look at the results of the computation, except to verify that it is correct and efficient. Since the \data" is meaningless, the results cannot be important. In contrast, real data sets are a scarce resource, which required hard work and imagination to obtain.
Becoming a data scientist requires learning to ask questions about data, so let's practice. Each of the subsections below will introduce an interesting data set. After you understand what kind of information is available, try to come up with, say, ve interesting questions you might explore/answer with access to this data set.
Properties of Data
This blog is about techniques for analyzing data. But what is the underlying study that we will be studying? This section provides a brief taxonomy of the properties of data, so we can better appreciate and understand what we will be working on.
Structured vs. Unstructured Data
Certain datasets are nicely structured, like the tables in a database or spreadsheet program. Others record information about the state of the world, but in a more heterogeneous way. Perhaps it is a large text corpus with images and links like Wikipedia, or the complicated mix of notes and test results appearing in personal medical records.
Generally speaking, this blog will focus on dealing with structured data. Data is often represented by a matrix, where the rows of the matrix represent distinct items or records, and the columns represent distinct properties of these items. For example, a data set about U.S. cities might contain one row for each city, with columns representing features like state, population, and area.
When confronted with an unstructured data source, such as a collection of tweets from Twitter, our first step is generally to build a matrix to structure it. A bag of words model will construct a matrix with a row for each tweet, and a column for each frequently used vocabulary word. Matrix entry M[i; j] then denotes the number of times tweet I contains word j.
Quantitative vs. Categorical Data
Quantitative data consists of numerical values, like height and weight. Such data can be incorporated directly into algebraic formulas and mathematical models, or displayed in conventional graphs and charts.
By contrast, categorical data consists of labels describing the properties of the objects under investigation, like gender, hair color, and occupation. This descriptive information can be every bit as precise and meaningful as numerical data, but it cannot be worked with using the same techniques.
Categorical data can usually be coded numerically. For example, gender might be represented as male = 0 or female = 1. But things get more complicated when there are more than two characters per feature, especially when there is not an implicit order between them.
We may be able to encode hair colors as numbers by assigning each shade a distinct value like gray hair = 0, red hair = 1, and blond hair = 2. However, we cannot really treat these values as numbers, for anything other than simple identity testing.
Does it make any sense to talk about the maximum or minimum hair color? What is the interpretation of my hair color minus your hair color?
Most of what we do in this blog will revolve around numerical data. But keep an eye out for categorical features, and methods that work for them. Classification and clustering methods can be thought of as generating categorical labels from numerical data and will be a primary focus in this blog.
Data Science Glossary
Accuracy (Rate): A commonly used metric for evaluating a classification system across all of the classes it predicts. It denotes the proportion of data points predicted correctly. Good for balanced datasets, but inaccurate for many other cases.
Anomaly Detection: A data science methodology that focuses on identifying abnormal data points. These belong to a class of interest and are generally significantly fewer than the data points of any other class of the dataset. Anomaly detection is sometimes referred to as novelty detection.
Area Under Curve (AUC) metric: A metric for a binary classifier’s performance based on the ROC curve. It takes into account the confidence of the classifier and is generally considered a robust performance index.
Artificial Creativity: An application of AI where the AI system emulates human creativity in a variety of domains, including painting, poetry, music composition, and even problem-solving.
Artificial Intelligence (AI): A field of computer science dealing with the emulation of human intelligence using computer systems and its applications in a variety of domains. AI application in data science is a noteworthy and important factor in the field and has been since the 2000s.
Artificial Neural Network (ANN): A graph-based artificial intelligence system which implements the universal approximator idea. Although ANNs started as a machine learning system focusing on predictive analytics, it has expanded over the years to include a large variety of tasks. They are comprised of a series of nodes called neurons, which are organized in layers.
The first layer corresponds to all the inputs, the final layer to all the outputs, and the intermediary layers to a series of meta-features the ANN create, each having a corresponding weight. ANNs are stochastic in nature, so every time they are trained over a set of data, the weights are noticeably different.
Association Rules: Empirical rules derived from a set of data aimed at connecting different entities in that data. Usually, the data is unlabeled, and this methodology is part of data exploration.
Autoencoder: An artificial neural network system designed to represent codings in a very efficient manner. Autoencoders are a popular artificial intelligence system used for dimensionality reduction.
Big Data: Datasets that are so large and/or complex that it is virtually impossible to process with traditional data processing systems. Challenges include querying, analysis, capture, search, sharing, storage, transfer, and visualization.
Ability to process big data could lead to decisions that are more confident, cost-effective, less risky, and have greater operational efficiency and are generally better overall.
Binning: Also known as discretization, binning refers to the transformation of a continuous variable into a discrete one.
Bootstrapping: A resampling method for performing sensitivity analysis, using the same sample repeatedly in order to get a better generalization of the population it represents and provide an estimate of the stability of the metric we have based on this sample.
Bug (in programming): An issue with an algorithm or its implementation. The process of fixing them is called debugging.
Business Intelligence (BI): A sub-field of data analytics focusing on basic data analysis of business-produced data for the purpose of improving the function of a business. BI is not the same as data science, though it does rely mainly on statistics as a framework.
Butterfly Effect: A phenomenon studied in chaos theory where a minute change in the original inputs of a system yields a substantial change in its outputs. Originally the butterfly effect only applied to highly complex systems (e.g. weather forecasts), but it has been observed in other domains, including data science.
Chatbot: An artificial intelligence system that emulates a person on a chat application. A chatbot takes its inputs text, processes it in an efficient manner, and yields a reply in text format. A chatbot may also carry out simple tasks based on its inputs. It can reply with a question in order to clarify the objective involved.
Classification: A very popular data science methodology under the predictive analytics umbrella. Classification aims to solve the problem of assigning a label (class) to a data point based on pre-existing knowledge of categorized data available in the training set.
Cloud (computing): A model that enables easy, on-demand access to a network of shareable computing resources that can be configured and customized to the application at hand. The cloud is a very popular resource in large-scale data analytics and a common resource for data science applications.
Clustering: A data exploration methodology that aims to find groupings in the data, yielding labels based on these groupings. Clustering is very popular when processing unlabeled data, and in some cases, the labels it provides are used for classification afterward.
Computer Vision: An application of artificial intelligence where a computer is able to discern a variety of visual inputs and effectively “see” many different real-world objects in real-time. Computer vision is an essential component of all modern robotics systems.
Confidence: A metric that aims to reflect the probability of another metric is correct. Usually, it takes values between 0 and 1 (inclusive). Confidence is linked to statistics but it lends itself to heuristics and machine learning systems as well.
Confidentiality: The aspect of information security that has to do with keeping privileged information accessible to only those who should have access to it. Confidentiality is linked to privacy, though it encompasses other things, such as data anonymization and data security.
Confusion Matrix: A k-by-k matrix depicting the hits and misses of a classifier for a problem involving k classes.
For a binary problem (involving two classes only), the matrix is comprised of various combinations of hits (trues) and misses (falses) referred to as true positives (cases of value 1 predicted as 1), true negatives (cases of value 0 predicted as 0), false positives (cases of value 0 predicted as 1), and false negatives (cases of value 1 predicted as 0). The confusion matrix is the basis for many evaluation metrics.
Correlation (coefficient): A metric of how closely related two continuous variables are in a linear manner.
Cost Function: A function for evaluating the amount of damage the total of all misclassifications amounts to, based on individual costs pre-assigned to different kinds of errors. A cost function is a popular performance metric for complex classification problems.
Cross-entropy: A metric of how the addition of a variable affects the entropy of another variable.
Dark Data: Unstructured data, or any form of data where information is unusable. Dark data constitutes the majority of available data today.
Data Anonymization: The process of changing the data so that it cannot be used to identify any particular individual via the data that corresponds to him or her.
Data Analytics: A general term to describe the field involving data analysis as its main component. Data analytics is more general than data science, although the two terms are often used interchangeably.
Data Analyst: Anyone performing basic data analysis, usually using statistical approaches only, without any applicability on larger and/or more complex datasets. Data analysts usually rely on a spreadsheet application and/or basic statistics software for their work.
Data Anonymization: The process of removing or hiding personally identified information (PII) from the data analyzed.
Data Cleansing: An important part of data preparation, it involves removing corrupt or otherwise problematic data (e.g. unnecessary outliers) to ensure a stronger signal. After data cleansing, data starts to take the form of a dataset.
Data Discovery: The part of the data modeling stage in the data science pipeline that has to do with pinpointing patterns in the data that may lead to building a more relevant and more accurate model in the stages that follow.
Data Engineering: The first stage of the data science pipeline, responsible for cleaning, exploring, and processing the data so that it can become structured and useful in a model developed in the following stage of the pipeline.
Data Exploration: The part of the data engineering stage in the data science pipeline that has to do with getting a better understanding of the data through plots and descriptive statistics, as well as other methods, such as clustering.
The visuals produced here are for the benefit of the data scientists involved, and may not be used in the later parts of the pipeline.
Data Frame: A data structure similar to a database table that is capable of containing different types of variables and performing advanced operations on its elements.
Data Governance: Managing data (particularly big data) in an efficient manner so that it is stored, transferred, and processed effectively. This is done with frameworks like Hadoop and Spark.
Data Learning: A crucial step in the data science pipeline, focusing on training and testing a model for providing insights and/or being part of a data product. Data learning is in the data modeling stage of the pipeline.
Data Mining: The process of finding patterns in data, usually in an automated way. Data mining is a data exploration methodology.
Data Modeling: A crucial stage in the data science pipeline, involving the creation of a model through data discovery and data learning.
Data Point: A single row in a dataset, corresponding to a single record of a database.
Data Preparation: A part of the data engineering stage in the data science pipeline focusing on setting up the data for the stages that follow. Data preparation involves data cleansing and normalization, among other things.
Data Representation: A part of the data engineering stage in the data science pipeline, focusing on using the most appropriate data types for the variables involved, as well as the coding of the relevant information in a set of features.
Data Science: The interdisciplinary field undertaking data analytics work on all kinds of data, with a focus on big data, for the purpose of mining insights and/or building data products.
Data Security: An aspect of confidentiality that involves keeping data secure from dangers and external threats (e.g. malware).
Data Structure: A collection of data points in a structured form used in programming as well as various parts of the data science pipeline.
Data Visualization: A part of the data science pipeline focusing on generating visuals (plots) of the data, the model’s performance, and the insights found. The visuals produced here are mainly for the stakeholders of the project.
Database: An organized system for storing and retrieving data using a specialized language. The data can be structured or unstructured, corresponding to SQL and NoSQL databases. Accessing databases is a key process for acquiring data for a data science project.
Dataset: A structured data collection, usually directly usable in a data science model. Datasets may still have a lot to benefit from data engineering.
Deep Learning (DL): An artificial intelligence methodology employing large artificial neural networks to tackle highly complex problems. DL systems require a lot of data in order to yield a real advantage in terms of performance.
Dimensionality Reduction: A fairly common method in data analytics aiming to reduce the number of variables in a dataset. This can be accomplished either with meta-features, each one condensing the information of a number of features or with the elimination of several features of low quality.
Discretization: See binning.
Encryption: The process of turning comprehensive and/or useful data into gibberish using a reversible process (encryption system) and a key. The latter is usually a password, a passphrase, or a whole file. Encryption is a key aspect of data security.
Ensemble: A set of predictive analytics models bundled together in order to improve performance. An ensemble can be comprised of a set of models of the same category, but it can also consist of different model types.
Entropy: A metric of how much disorder exists in a given variable. This is defined for all kinds of variables.
Error Rate: Denotes the proportion of data points predicted incorrectly. Good for balanced datasets.
Ethics: A code of conduct for a professional. In data science, ethics revolves around things like data security, privacy, and proper handling of the insights derived from the data analyzed.
Experiment (data science related): A process involving the application of the scientific method on a data science question or problem.
F1 Metric: A popular performance metric for classification systems defined as the harmonic mean of precision and recall, and just like them, corresponds to a particular class.
In cases of unbalanced datasets, it is more meaningful than the accuracy rate. F1 belongs to a family of similar metrics, each one is a function of precision (P) and recall (R) in the form Fβ = (1+ β2) (P * R) / (β2 P + R), where β is a coefficient related to importance of precision in the particular aggregation metric Fβ.
False Negative: In a binary classification problem, it is a data point of class 1, predicted as class 0. See confusion matrix for more context.
False Positive: In a binary classification problem, it is a data point of class 0, predicted as class 1. See confusion matrix for more context.
Feature: A processed variable capable of being used in a data science model.
Features are generally the columns of a dataset.
Fitness Function: An essential part of most artificial intelligence systems,
particularly those related to optimization. It depicts how close the system is getting to the desired outcome and helps it adjust its course accordingly.
Functional Programming: A programming paradigm where the programming language is focused on functions rather than objects or processes, thereby eliminating the need of a global variable space. Scripts of functional languages are modular and easy to debug.
Fusion: Usually used in conjunction with feature (e.g. feature fusion), this relates to the merging of a set of features into a single meta-feature that encapsulates all, or at least most, of the information in those features.
Fuzzy Logic: An artificial intelligence methodology that involves a flexible approach to the states a variable takes. For example, instead of having the states “hot” and “cold” in the variable “temperature,” Fuzzy Logic allows for different levels of “hotness” making for a more human kind of reasoning. For more information about Fuzzy Logic check out MathWorks’ webpage on the topic: http://bit.ly/2sBVQ3M.
Generalization: A key characteristic of a data science model where the system is able to handle data beyond its training set in a reliable way.
Git: A version control system that is popular among developers and data scientists alike. Unlike some other systems, Git is decentralized, making it more robust.
Github: A cloud-based repository for Git, accessible through a web browser.
Graph Analytics: A data science methodology making use of Graph Theory to tackle problems through the analysis of the relationships among the entities involved.
Hadoop: An established data governance framework for both managing and storing big data on a local computer cluster or a cloud setting.
HDFS: Short for Hadoop Distributed File System, HDFS enables the storage and access of data across several computers for easier processing through a data governance system (not just Hadoop).
Hypothesis: An educated guess related to the data at hand about a number of scenarios, such as the relationship between two variables or the difference between two samples. Hypotheses are tested via experiments to determine their validity.
Heuristic: An empirical metric or function that aims to provide some useful tool or insight, to facilitate a data science method or project.
IDE: Short for Integrated Development Environment, an IDE greatly facilitates the creation and running of scripts as well as their debugging.
Index of Discernibility: A family of heuristics created by the author that aim to evaluate features (and in some cases individual data points) for classification problems.
Information Distillation: A stage of the data science pipeline which involves the creation of data products and/or the deliverance of insights and visuals based on the analysis conducted in the project.
Insight: A non-obvious and useful piece of information derived from the use of a data science model on some data.
Internet of Things (IoT): A new technological framework that enables all kinds of devices (even common appliances) to have Internet connectivity. This greatly enhances the amount of data collected and usable in various aspects of everyday life.
Julia: A modern programming language of the functional programming paradigm comprised of characteristics for both high-level and low-level languages. Its ease of use, high speed, scalability, and sufficient amount of packages make it a robust language well-suited for data science.
Jupyter: A popular browser-based IDE for various data science languages, such as Python and Julia.
Kaggle: A data science competition site focusing on the data modeling part of the pipeline. It also has a community and a job board.
K-fold Cross Validation: A fundamental data science experiment technique for building a model and ensuring that it has a reliable generalization potential.
Labels: A set of values corresponding to the points of a dataset, providing information about the dataset’s structure. The latter takes the form of classes, often linked to classification applications. The variable containing the labels is typically used as the target variable of the dataset.
Layer: A set of neurons in an artificial neural network. Inner layers are usually referred to as hidden layers and consist mainly of meta-features created by the system.
Library: See package.
Machine Learning (ML): A set of algorithms and programs that aim to process data without relying on statistical methods. ML is fast, and some methods of it are significantly more accurate than the corresponding statistical ones, while the assumptions they make about the data are fewer.
There is a noticeable overlap between ML and artificial intelligence systems designed for data science.
Minimum Squared Error (MSE): A popular metric for evaluating the performance of regression systems by taking the difference of each prediction with the target variable (error) and squaring it. The model having the smallest such squared error is usually considered the optimal one.
Mentoring: The process of someone knowledgeable and adept in a field sharing his experience and advice with others newer to the field. Mentoring can be a formal endeavor or something circumstantial, depending on the commitment of the people involved.
Metadata: Data about a piece of data. Examples of metadata are timestamps, geolocation data, data about the data’s creator, and notes.
Meta-features (super features or synthetic features): High-quality features that encapsulate large amounts of information, usually represented in a series of conventional features. Meta-features are either synthesized in an artificial intelligence system or created through dimensionality reduction.
Monte Carlo Simulation: A simulation technique for estimating probabilities around a phenomenon, without making assumptions about the phenomenon itself. Monte Carlo simulations have a variety of applications, from estimating functions to sensitivity analysis.
Natural Language Processing (NLP): A text analytics methodology focusing on categorizing the various parts of speech for a more in-depth analysis of the text involved.
Neuron: A fundamental component of an artificial neural network, usually representing an input (feature), a meta-feature, or an output. Neurons are organized in layers.
Non-negative Matrix Factorization (NMF or NNMF): An algebraic technique for splitting a matrix containing only positive values and zeros into a couple of matrices that correspond to meaningful data, useful for recommender systems.
Normalization: The process of transforming a variable so that it is of the same
range as the other variables in a dataset. This is done through statistical methods primarily and is part of the data engineering stage in the data science pipeline.
NoSQL Database: A database designed for unstructured data. Such a database is also able to handle structured data too, as NoSQL stands for Not Only SQL.
Novelty Detection: See anomaly detection.
Object-Oriented Programming (OOP): A programming paradigm where all structures, be it data or code, are handled as objects. In the case of data, objects can have various fields (referred to as attributes), while when referring to code, objects can have various procedures (referred to as methods).
Optimization: An artificial intelligence process aimed at finding the best value of a function (usually referred to as the fitness function), given a set of restrictions. Optimization is key in all modern data science systems.
Outlier: An abnormal data point, often holding a particular significance. Outliers are not always extreme values, as they can exist near the center of the dataset as well.
Over-fitting: Making the model too specialized to a particular dataset. Its main characteristic is a great performance for the training set and poor performance for any other dataset.
Package: A set of programs designed for a specific set of related tasks, sharing the same data structures and freely available to the users of a given programming language.
Packages may require other packages in order to function, which are called dependencies. Once installed, the package can be imported in the programming language and used in scripts.
Paradigm: An established way of doing things, as well as the set of similar methodologies in a particular field. Paradigms change very slowly, but when they do, they are accompanied by a change of mindset and often new scientific theory.
Pipeline: Also known as workflow, it is a conceptual process involving a variety of steps, each one of which can be comprised of several other processes.
A pipeline is essential for organizing the tasks needed to perform any complex procedure (often non-linear) and is very applicable in data science (this application is known as the data science pipeline).
Population: The theoretical total of all the data points for a given dataset. As this is not accessible, an approximate representation of the population is used through sampling.
Precision: A performance metric for classification systems focusing on a particular class. It is defined as the ratio of the true positives of that class over the total number of predictions related to that class.
Predictive Analytics: A set of methodologies of data science related to the prediction of certain variables. It includes a variety of techniques, such as classification, regression, time-series analysis, and more. Predictive analytics is a key data science methodology.
Privacy: An aspect of confidentiality that involves keeping certain pieces of information private.
Recall A performance metric for classification systems focusing on a particular class. It is defined as the ratio of the true positives of that class over the total number of data points related to that class.
Recommender System (RS): Also known as a recommendation engine, an RS is a data science system designed to provide a set of similar entities to the ones described in a given data set based on the known values of the features of these entities. Each entity is represented as a data point in the RS dataset.
Regression: A very popular data science methodology under the predictive analytics umbrella. Regression aims to solve the problem of predicting the values of a continuous variable corresponding to a set of inputs based on pre-existing knowledge of similar data, available in the training set.
Resampling: The process of sampling repeatedly in order to ensure more stable results in a question or a model. Resampling is a popular methodology for sensitivity analysis.
ROC Curve: A curve representing the trade-off between true positives and false positives for a binary classification problem, useful for evaluating the classifier used. The ROC curve is usually a zig-zag line depicting the true positive rate for each false positive rate value.
Sample: A limited portion of the data available, useful for building a model and (ideally) representative of the population it belongs to.
Sampling: The process of acquiring a sample of a population using a specialized technique. Sampling must be done properly to ensure that the resulting sample is representative of the population studied. Sampling needs to be random and unbiased.
Scala: A functional programming language, very similar to Java, that is used in data science. The big data framework Spark is based on Scala.
Scientific Process: The process of forming a hypothesis, processing the available data, and reaching a conclusion in a rigorous and reproducible manner. Conclusions are not 100% valid. Every scientific field, including data science, applies the scientific process.
Sensitivity Analysis: The process of establishing the stability of a result or how prone a model’s performance is to change if the initial data is different. It involves several methods, such as resampling and “what if” questions.
Sentiment Analysis: A text analytics method that involves inferring the sentiment polarity of a piece of text using its words and some metadata that may be attached to it.
Signal: A piece of valuable information within a collection of data. Insights derived from the analysis of the data tend to reflect the various signals identified in the data.
Spark: A big data framework focusing on managing and processing data through a series of specialized modules. Spark does not handle storing data, just handling it.
SQL: Short for Structured Query Language, SQL is a basic programming language used in databases containing structured data. Although it does not apply to big data, many modern databases are using query languages based on SQL.
Statistical Test: A test for establishing relationships between two samples based on statistical concepts. Each statistical test has a few underlying assumptions behind it.
Statistics: A sub-field of mathematics that focuses on data analysis using probability theory, a variety of distributions, and tests. Statistics involves a series of assumptions about the data involved.
There are two main types of statistics: descriptive and inferential. The former deals with describing the data at hand, while the latter with making predictions using statistical models.
Steganography: The process of hiding a file in another much larger file (usually a photo, an audio clip, or a video) using specialized software. The process does not change how the file seems or sounds. Steganography is a data security methodology.
Stochastic: Something that is probabilistic in nature (i.e. not deterministic). Stochastic processes are common in most artificial intelligence systems and other advanced machine learning systems.
Structured Data: Data that has a form that enables it to be used in all kinds of data analytics models. Structured data usually takes the form of a dataset.
Target Variable: The variable of a dataset that is the target of a predictive analytics system, such as a classifier or a regressor.
Text Analytics: The sub-field of data science that focuses on all text-related problems. It includes natural language processing (NLP), among other things.
Testing Set: The part of the dataset that is used for testing a predictive analytics model after it has been trained and before it is deployed. The testing set usually corresponds to a small portion of the original dataset.
Training Set: The part of the dataset that is used for training a predictive analytics model before it is tested and deployed. The training set usually corresponds to the largest portion of the original dataset.
Transfer Function: The function applied to the output of a neuron in an artificial neural network.
Time-series Analysis: A data science methodology aiming to tackle dynamic data problems, where the values of a target variable change over time. In time-series analysis, the target variable is also used as an input in the model.
True Negative: In a binary classification problem, it is a data point of class 0, predicted as such. See confusion matrix for more context.
True Positive: In a binary classification problem, it is a data point of class 1, predicted as such. See confusion matrix for more context.
Unstructured Data: Data that lacks any structural frame (e.g. free-form text) or data from various sources. The majority of big data is unstructured data and requires significant processing before it is usable in a model.
Versatilist: A professional who is an expert in one skill, but has a variety of related skills, usually in a tech-related field, allowing him to perform several roles in an organization. Data scientists tend to have a versatilist mentality.
Version Control System (VCS): A programming tool aiming to keep various versions of your documents (usually programming scripts and data files) accessible and easy to maintain, allowing for variants of them to co-exist with the original ones. VCs is great for collaboration of various people on the same files.