What do you mean by Big Data and its Veracity
Cloud computing provides an opportunity for organizations with limited internal resources to implement large-scale big data computing applications in a cost-effective manner. The fundamental challenges of big data computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms that can scale to search and process massive amounts of data.
The answer to these challenges is Big data that is scalable, integrated computer systems hardware and software architecture designed for parallel processing of big data computing applications.
What Is Big Data?
Big data can be defined as volumes of data available in varying degrees of complexity, generated at different velocities and varying degrees of ambiguity that cannot be processed using traditional technologies, processing methods, algorithms, or any commercial off-the-shelf solutions.
Data defined as big data includes weather, geospatial, and geographic information system (GIS) data; consumer-driven data from social media; enterprise-generated data from legal, sales, marketing, procurement, finance and human resources department; and device-generated data from sensor networks, nuclear plants, X-ray and scanning devices, and airplane engines.
The most interesting data for any organization to tap into today is social media data. The amount of data generated by consumers every minute provides extremely important insights into choices, opinions, influences, connections, brand loyalty, brand management, and much more. Social media sites not only provide consumer perspectives but also competitive positioning, trends, and access to communities formed by a common interest. Organizations today leverage the social media pages to personalize marketing of products and services to each customer.
Many additional applications are being developed and are slowly becoming a reality. These applications include using remote sensing to detect underground sources of energy, environmental monitoring, traffic monitoring, and regulation by automatic sensors mounted on vehicles and roads, remote monitoring of patients using special scanners and equipment, and tighter control and replenishment of inventories using radio-frequency identification (RFID) and other technologies.
All these developments will have associated with them a large volume of data. Social networks such as Twitter and Facebook have hundreds of millions of subscribers worldwide who generate new data with every message they send or post they make.
Every enterprise has massive amounts of emails that are generated by its employees, customers, and executives on a daily basis. These emails are all considered an asset of the corporation and need to be managed as such. After Enron and the collapse of many audits in enterprises, the U.S. government mandated that all enterprises should have a clear life-cycle management of emails and that emails should be available and auditable on a case-by-case basis.
There are several examples that come to mind like insider trading, intellectual property, competitive analysis, and much more, to justify governance and management of emails.
If companies can analyze petabytes of data (equivalent to 20 million four-drawer file cabinets filled with text files or 13.3 years of HDTV content) with acceptable performance to discern patterns and anomalies, businesses can begin to make sense of data in new ways. The table indicates the escalating scale of data.
Size of Data Scale of Data
1,000 megabytes 1 gigabyte (GB)
1,000 gigabytes 1 terabyte (TB)
1,000 terabytes 1 petabyte (PB)
1,000 petabytes 1 exabyte (EB)
1,000 exabytes 1 zettabyte (ZB)
1,000 zettabytes 1 yottabyte (YB)
The list of features for handling data volume included the following:
Nontraditional and unorthodox data processing techniques need to be innovated for processing this data type.
Metadata is essential for processing this data successfully.
Metrics and key performance indicators (KPIs) are keys to provide visualization.
Raw data do not need to be stored online for access.
Processed output is needed to be integrated into an enterprise level analytical ecosystem to provide better insights and visibility into the trends and outcomes of business exercises including customer relationship management (CRM), optimization of inventory, clickstream analysis, and more.
The enterprise data warehouse (EDW) is needed for analytics and reporting.
The business models adopted by Amazon, Facebook, Yahoo!, and Google, which became the defacto business models for most web-based companies, operate on the fact that by tracking customer clicks and navigations on the website, you can deliver personalized browsing and shopping experiences. In this process of clickstreams, there are millions of clicks gathered from users at every second, amounting to large volumes of data.
This data can be processed, segmented, and modeled to study population behaviors based on time of day, geography, advertisement effectiveness, click behavior, and guided navigation response. The result sets of these models can be stored to create a better experience for the next set of clicks exhibiting similar behaviors. The velocity of data produced by user clicks on any website today is a prime example of big data velocity.
Real-time data and streaming data are accumulated by the likes of Twitter and Facebook at a very high velocity. Velocity is helpful in detecting trends among people that are tweeting a million tweets every 3 minutes. Processing of streaming data for analysis also involves the velocity dimension.
Similarly, high velocity is attributed to data associated with the typical speed of transactions on stock exchanges; this speed reaches billions of transactions per day on certain days. If these transactions must be processed to detect potential fraud or billions of call records on cell phones daily must be processed to detect malicious activity, we are dealing with the velocity dimension.
The most popular way to share pictures, music, and data today is via mobile devices. The sheer volume of data that is transmitted by mobile networks provides insights to the providers on the performance of their network, the amount of data processed at each tower, the time of day, the associated geographies, user demographics, location, latencies, and much more.
The velocity of data movement is unpredictable and sometimes can cause a network to crash. The data movement and its study have enabled mobile service providers to improve the QoS (quality of service), and associating this data with social media inputs have enabled insights into competitive intelligence.
The list of features for handling data velocity included the following:
The system must be elastic for handling data velocity along with volume
The system must scale up and scale down as needed without increasing costs
The system must be able to process data across the infrastructure in the least processing time
System throughput should remain, stable independent of data velocity,
The system should be able to process data on a distributed platform
Data comes in multiple formats as it ranges from emails to tweets to social media and sensor data. There is no control over the input data format or the structure of the data. The processing complexity associated with a variety of formats is the availability of appropriate metadata for identifying what is contained in the actual data.
This is critical when we process images, audio, video, and large chunks of text. The absence of metadata or partial metadata means processing delays from the ingestion of data to producing the final metrics, and, more importantly, in integrating the results with the data warehouse.
Sources of data in traditional applications were mainly transactions involving financial, insurance, travel, healthcare, retail industries, and governmental and judicial processing. The types of sources have expanded dramatically and include Internet data (e.g., clickstream and social media), research data (e.g., surveys and industry reports), location data (e.g., mobile device data and geospatial data), images (e.g., surveillance, satellites, and medical scanning), emails, supply chain data (e.g., EDI—electronic data interchange, vendor catalogs), signal data (e.g., sensors and RFID devices), and videos (YouTube enters hundreds of minutes of video every minute).
Big data includes structured, semistructured, and unstructured data in different proportions based on context.
TABLE Industry Use Cases for Big Data
Product research Customer relationship management
Engineering analysis Store location and layout
Predictive maintenance Fraud detection and prevention
Process and quality metrics Supply-chain optimization
Distribution optimization Dynamic pricing
Media and telecommunications Financial services
Network optimization Algorithmic trading
Customer scoring Risk analysis
Churn prevention Fraud detection
Fraud prevention Portfolio analysis
Smart grid Demand signaling
The list of features for handling data variety included the following:
Distributed processing capabilities
Image processing capabilities
Graph processing capabilities
Video and audio processing capabilities
The veracity dimension of big data is a more recent addition than the advent of the Internet. Veracity has two built-in features: the credibility of the source and the suitability of data for its target audience.
It is closely related to trust; listing veracity as one of the dimensions of big data amounts to saying that data coming into the so-called big data applications have a variety of trustworthiness, and therefore before we accept the data for analytical or other applications, it must go through some degree of quality testing and credibility analysis. Many sources of data generate data that is uncertain, incomplete, and inaccurate, therefore making its veracity questionable.
Common Characteristics of Big Data Computing Systems
There are several important common characteristics of big data computing systems that distinguish them from other forms of computing.
1. The principle of co-location of the data and programs or algorithms to perform the computation: To achieve high performance in big data computing, it is important to minimize the movement of data. This principle—“move the code to the data”—that was designed into the data-parallel processing architecture implemented by Seisint in 2003 is extremely effective since program size is usually small in comparison to the large datasets processed by big data systems and results in much less network traffic since data can be read locally instead of across the network.
In direct contrast to other types of computing and supercomputing that utilize data stored in a separate repository or servers and transfer the data to the processing system for computation, big data computing uses distributed data and distributed file systems in which data is located across a cluster of processing nodes, and instead of moving the data, the program or algorithm is transferred to the nodes with the data that needs to be processed.
This characteristic allows processing algorithms to execute on the nodes where the data resides reducing system overhead and increasing performance.
2. Programming model utilized: Big data computing systems utilize a machine- independent approach in which applications are expressed in terms of high-level operations on data, and the runtime system transparently controls the scheduling, execution, load balancing, communications, and movement of programs and data across the distributed computing cluster.
The programming abstraction and language tools allow the processing to be expressed in terms of data flow and transformations incorporating new dataflow programming languages and shared libraries of common data manipulation algorithms such as sorting.
Conventional supercomputing and distributed computing systems typically utilize machine-dependent programming models that can require low-level programmer control of processing and node communications using conventional imperative programming languages and specialized software packages that add complexity to the parallel programming task and reduce programmer productivity. A machine-dependent programming model also requires significant tuning and is more susceptible to single points of failure.
3. Focus on reliability and availability: Large-scale systems with hundreds or thousands of processing nodes are inherently more susceptible to hardware failures, communications errors, and software bugs. Big data computing systems are designed to be fault resilient. This includes redundant copies of all data files on disk, storage of intermediate processing results on disk, automatic detection of node or processing failures, and selective recomputation of results.
A processing cluster configured for big data computing is typically able to continue operation with a reduced number of nodes following a node failure with the automatic and transparent recovery of incomplete processing.
A final important characteristic of big data computing systems is the inherent scalability of the underlying hardware and software architecture. Big data computing systems can typically be scaled in a linear fashion to accommodate virtually any amount of data, or to meet time-critical performance requirements by simply adding additional processing nodes to a system configuration in order to achieve billions of records per second processing rates (BORPS).
The number of nodes and processing tasks assigned for a specific application can be variable or fixed depending on the hardware, software, communications, and distributed file system architecture. This scalability allows computing problems once considered to be intractable due to the amount of data required or amount of processing time required to now be feasible and affords opportunities for new breakthroughs in data analysis and information processing.
Big Data Appliances
Big data analytics applications combine the means for developing and implementing algorithms that must access, consume, and manage data. In essence, the framework relies on a technology ecosystem of components that must be combined in a variety of ways to address each application’s requirements, which can range from general information technology (IT) performance scalability to detailed performance improvement objectives associated with specific algorithmic demands.
For example, some algorithms expect that massive amounts of data are immediately available quickly, necessitating large amounts of core memory. Other applications may need numerous iterative exchanges of data between different computing nodes, which would require high-speed networks.
The big data technology ecosystem stack may include the following:
1. Scalable storage systems that are used for capturing, manipulating, and analyzing massive datasets.
2.A computing platform, sometimes configured specifically for large-scale analytics, often composed of multiple (typically multicore) processing nodes connected via a high-speed network to memory and disk storage subsystems. These are often referred to as appliances.
3. A data management environment, whose configurations may range from a traditional database management system scaled to massive parallelism to databases configured with alternative distributions and layouts, to newer graph-based or other NoSQL data management schemes.
4.An application development framework to simplify the process of developing, executing, testing, and debugging new application code. This framework should include programming models, development tools, program execution and scheduling, and system configuration and management capabilities.
5. Methods of scalable analytics (including statistical and data mining models) that can be configured by the analysts and other business consumers to help improve the ability to design and build analytical and predictive models.
6. Management processes and tools that are necessary to ensure alignment with the enterprise analytics infrastructure and collaboration among developers, analysts, and other business users.
Tools and Techniques of Big Data
Current big data computing platforms use a “divide and conquer” parallel processing approach combining multiple processors and disks in large computing clusters connected using high-speed communications switches and networks that allow the data to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data.
We define a cluster as “a type of parallel and distributed system, which consists of a collection of interconnected stand-alone computers working together as a single integrated computing resource.”
This approach to parallel processing is often referred to as a “shared nothing” approach since each node consisting of a processor, local memory, and disk resources shares nothing with other nodes in the cluster. In parallel computing, this approach is considered suitable for data processing problems that are “embarrassingly parallel,” that is, where it is relatively easy to separate the problem into a number of parallel tasks and there is no dependency or communication required between the tasks other than overall management of the tasks.
These types of data processing problems are inherently adaptable to various forms of distributed computing including clusters and data grids and cloud computing.
Analytical environments are deployed in different architectural models. Even on parallel platforms, many databases are built on a shared everything approach in which the persistent storage and memory components are all shared by the different processing units.
Parallel architectures are classified by what shared resources each processor can directly access. One typically distinguishes shared memory, the shared disk, and shared nothing architectures
1. In a shared memory system, all processors have direct access to all memory via a shared bus. Typical examples are the common symmetric multiprocessor systems, where each processor core can access the complete memory via the shared memory bus. To preserve the abstraction, processor caches, buffering a subset of the data closer to the processor for fast access, have to be kept consistent with specialized protocols. Because disks are typically accessed via the memory, all processes also have access to all disks.
2. In a shared disk architecture, all processes have their own private memory, but all disks are shared. A cluster of computers connected to a SAN is a representative for this architecture.
3. In a shared nothing architecture, each processor has its private memory and private disk. The data is distributed across all disks, and each processor is responsible only for the data on its own connected memory and disks. To operate on data that spans the different memories or disks, the processors have to explicitly send data to other processors. If a processor fails, data held by its memory and disks is unavailable. Therefore, the shared nothing architecture requires special considerations to prevent data loss.
When scaling out the system, the two main bottlenecks are typically the bandwidth of the shared medium and the overhead of maintaining a consistent view of the shared data in the presence of cache hierarchies. For that reason, the shared nothing architecture is considered the most scalable one, because it has no shared medium and no shared data. While it is often argued that shared disk architectures have certain advantages for transaction processing, the shared nothing is the undisputed architecture of choice for analytical queries.
A shared-disk approach may have isolated processors, each with its own memory, but the persistent storage on disk is still shared across the system. These types of architectures are layered on top of symmetric multiprocessing (SMP) machines. While there may be applications that are suited to this approach, there are bottlenecks that exist because of the sharing, because all I/O and memory requests are transferred (and satisfied) over the same bus.
As more processors are added, the synchronization and communication need to increase exponentially, and therefore the bus is less able to handle the increased need for bandwidth. This means that unless the need for bandwidth is satisfied, there will be limits to the degree of scalability.
In contrast, in a shared-nothing approach, each processor has its own dedicated disk storage. This approach, which maps nicely to an massively parallel processing (MPP) architecture, is not only more suitable to discrete allocation and distribution of the data, it enables more effective parallelization and consequently does not introduce the same kind of bus bottlenecks from which the SMP/shared-memory and shared-disk approaches suffer. Most big data appliances use a collection of computing resources, typically a combination of processing nodes and storage nodes.
Big Data System Architecture
A variety of system architectures have been implemented for big data and large-scale data analysis applications including parallel and distributed relational database management systems that have been available to run on shared nothing clusters of processing nodes for more than two decades. These include database systems from Teradata,
Netezza, Vertica, and Exadata/Oracle and others that provide high-performance parallel database platforms. Although these systems have the ability to run parallel applications and queries expressed in the SQL language, they are typically not general-purpose processing platforms and usually run as a back-end to a separate front-end application processing system.
Although this approach offers benefits when the data utilized is primarily structured in nature and fits easily into the constraints of a relational database, and often excels for transaction processing applications, most data growth is with data in the unstructured form and new processing paradigms with more flexible data models were needed.
Internet companies such as Google, Yahoo!, Microsoft, Facebook, and others required a new processing approach to effectively deal with the enormous amount of web data for applications such as search engines and social networking. In addition, many governments and business organizations were overwhelmed with data that could not be effectively processed, linked, and analyzed with traditional computing approaches.
Several solutions have emerged including the MapReduce architecture pioneered by Google and now available in an open-source implementation called Hadoop used by Yahoo!, Facebook, and others
Web 2.0 and Big Data
Web 2.0, which includes the OSN, is one significant source of Big Data. Another major contributor is the Internet of Things. The billions of devices connecting to the Internet generate Petabytes of data. It is a well-known fact that businesses collect as much data as they can about consumers – their preferences, purchase transactions, opinions, individual characteristics, browsing habits, and so on.
Consumers themselves are generating substantial chunks of data in terms of reviews, ratings, direct feedback, video recordings, pictures and detailed documents of demos, troubleshooting, and tutorials to use the products and such, exploiting the expressiveness of the Web 2.0, thus contributing to the Big Data.
From the list of sources of data, it can be easily seen that it is relatively inexpensive to collect data. There are a number of other technology trends too that are fueling the Big Data phenomenon.
High Availability systems and storage, drastically declining hardware costs, massive parallelism in task execution, high-speed networks, new computing paradigms such as cloud computing, high performance computing, innovations in Analytics and Machine Learning algorithms, new ways of storing unstructured data, and ubiquitous access to computing devices such as smartphones and laptops are all contributing to the Big Data revolution.
Human beings are intelligent because their brains are able to collect inputs from various sources, connect them, and analyze them to look for patterns. Big Data and the algorithms associated with it help achieve the same using computer power. Fusion of data from disparate sources can yield surprising insights into the entities involved.
For instance, if there are plenty of instances of flu symptoms being reported on OSN from a particular geographical location and there is a surge in purchases of flu medication based on the credit card transactions in that area, it is quite likely that there is an onset of a flu outbreak. Given that Big Data makes no sense without the tools to collect, combine, and analyze data, some proponents even argue that Big Data is not really data, but a technology comprised of tools and techniques to extract value from huge sets of data.
Note Generating value from Big Data can be thought of as comprising two major functions: fusion, the coming together of data from various sources; and fission, analyzing that data.
There is a huge amount of data pertaining to the human body and its health. Genomic data science is an academic specialization that is gaining increasing popularity. It helps in studying disease mechanisms for better diagnosis and drug response. The algorithms used for analyzing Big Data are a game changer in genome research as well.
This category of data is so huge that the famed international journal of science, Nature, carried a new item1 about how the genome researchers are worried that the computing infrastructure may not cope with the increasing amount of data that their research generates.
Science has a lot to benefit from the developments in Big Data. Social scientists can leverage data from the OSN to identify both micros- and macro level details, such as any psychiatric conditions at an individual level or group dynamics at a macro level. The same data from OSN can also be used to detect medical emergencies and pandemics.
In the financial sector too, data from the stock markets, business news, and OSN can reveal valuable insights to help improve lending practices, set macroeconomic strategies and avert a recession.
There are various other uses of Big Data applications in a wide variety of areas. Housing and real estate business; actuaries; and government departments such as national security, defense, education, disease control, law enforcement, and energy, which are all characterized by huge amounts of data are expected to benefit from the Big Data phenomenon.
Note Where there is humongous data and appropriate algorithms applied to the data, there is wealth, value, and prosperity.
Why Big Data
A common question that arises is this: “Why Big Data, why not just data?” For data to be useful, we need to be able to identify patterns and predict those patterns for future data that is yet to be known. A typical analogy is to predict the brand of rice in a bag based on a given sample. The rice in the bag is unknown to us. We are only given a sample from it and samples of known brands of rice. The known samples are called training data in the language of Machine Learning.
The sample of unknown rice is the test data. It is common sense that the larger the sample, the better the prediction of the brand. If we are given just two grains of rice on each brand in the training data, we may base our conclusion solely based on the characteristics of those two grains, missing out on other characteristics.
In the Machine Learning parlance, this is called overfitting. If we have a bigger sample, we can recognize a number of features and a possible range of values for the features: in other words, the probability distributions of the values, and look for similar distributions in the data that is yet to be known. Hence the need for humongous data, Big Data, and not just data.
In fact, a number of algorithms that are popular with Big Data have been in existence for long. The Naïve Bayes technique, for instance, has been there since the 18th century and the Support Vector Machine model was invented in the early 1960s. They gained prominence with the advent of the Big Data revolution for reasons explained earlier.
An often-cited heuristic to differentiate “Big Data” from the conventional bytes of data is that the Big Data is too big to fit into traditional Relational Database Management Systems (RDBMS). With the ambitious plan of the Internet of Things to connect every entity of the world to everything else, the conventional RDBMS will not be able to handle the data upsurge.
In fact, Seagate predicts that the world will not be able to cope with the storage needs in a couple of years. According to them, it is “harder to manufacture capacity than to generate data.” It will be interesting to see if and how the storage industry will meet the capacity demands of the Volume from Big Data phenomenon, which brings us to the V’s of the Big Data.
Note It takes substantial data to see statistical patterns, general enough for large populations to emerge, and meaningful hypotheses generated from the data automatically.
The V’s of Big Data
An Internet search for “President Trump” in double quotes returned about 45,600,000 results in just two weeks of his inauguration (January 20 – February 5). It is an indication of what has become known as the first four V’s of Big Data: Volume, Velocity, Variety, and Veracity. The search results have all four ingredients. The corpus of 45M page results is the sheer Volume that has been indexed for the search. There could be many more on the topic.
The results exist in a wide variety of media: text, videos, images, and so on – some of it structured, some of it not – and on a wide variety of aspects of President Trump. How much of that information is true and how many of those results are about “President Trump” and not the presidential candidate Trump?
A quick scan shows that quite a few results date back to 2016, indicating that the articles are not really about “President Trump” but may be about “the presidential candidate Trump.” That brings us to the problem of the Veracity of Big Data. Veracity refers to the quality of data. How much can we rely on the Big Data?
Initially, the definition of Big Data included only the first three V’s. Visionary corporates like IBM later extended the definition to include veracity as an important aspect of Big Data.
There are more V’s that are increasingly being associated with Big Data. Elder Research came up with 42 V’s associated with Big Data and Data Science. While we will not go over all of them here, we will cover the important ones and introduce a new one, Valuation, to the mix. The next V is Variability.
The meaning of data can vary depending on the context. For instance, an error of 0.01 inch when measuring the height of a person is very different from the same error when measuring the distance between the two eye pupils in a face recognition algorithm. The 0.01" error datum has varied meanings in the two different contexts, even if both are used in the same application to recognize people.
This is often true of spoken data as well. Interpretation of speech greatly depends on the context. This notion is called the Variability of Big Data. Big Data Analytics, therefore, needs to be cognizant of the Variability of Big Data, when interpreting it.
Variability presents a formidable challenge for deriving value from the Big Data. Data, by itself, is useless. We need to Derive Value from it by analyzing it and drawing conclusions or predicting outcomes. For executives to see the value of Big Data, it needs to be presented in a way they understand.
Visualization is the representation using diagrams, charts, and projections of the Big Data in ways that are more understandable, in order for its value to be realized. It is believed that 95% of the value from Big Data is derived from 5% of its attributes or characteristics. So, only 5% of the attributes are really viable.
Viability refers to the fact that not everything about Big Data is useful. There is a need to evaluate the viability of available attributes to choose only those, which help to get the value out.
The V’s enunciated so far missed out on a very important aspect of Big Data – extracting value from the data using algorithms. We call it Valuation, thereby introducing a new V, probably the most important of all. Without the process of valuation, the data is useless.
The Data Mining, Machine Learning, and other methods from computational statistics comprise the toolset for Big Data valuation. The efficiency and effectiveness of this tool set is critical to the success of the Big Data phenomenon. Valuation is not to be confused with Value. The former is a process and the latter is the output. Valuation is the process of extracting value from Big Data.
Note Big Data is often associated with V’s: Volume, Velocity, Variety, Veracity, Value, Variability, Visualization, and Viability. We introduce a new V – Valuation – the process of extracting value from Big Data using algorithms.
The various V’s are all related. If we take the case of Social Media, it can be generally said that its veracity is inversely proportional to the velocity. The faster blogs and microblogs are posted, the lesser the thought and substantiation behind them. When videos were first posted for watching online, the volume of the Big Data grew enormously and at a great speed. Variety, therefore, contributed to the volume and velocity of the data generated.
Variety also contributes to the veracity of information. For instance, a social media post that includes a video or a picture is more likely to be true than the one with just text. Volume is inversely related to veracity because more data points may mean lesser accuracy. In the case of information and data, noise increases with the signal.
In some cases, the noise actually increases faster than the signal, impacting the utility of the data. Falsely crying “Fire” in a movie hall spreads faster than the true opinion that the movie is not worth watching. The same applies to information on the Web when noise travels faster than the signal.
Hence, the value is directly proportional to veracity and inversely related to volume and velocity. However, high-quality data does not automatically guarantee a high value. We also need efficient algorithms used for the valuation process.
From the above characteristics described in terms of V’s, it must be clear now that Big Data is quite different from the conventional schemes of data that usually resided in databases and other structured frameworks. The major differences are that the former is mostly unstructured, raw, and real time.
The traditional tools of data warehousing, business process management, and business intelligence fall grossly short when it comes to handling the all-encompassing Big Data. A number of decades-old techniques mostly discovered as part of Artificial Intelligence and Machine Learning, have now become relevant in the context of Big Data. These tools, techniques, and algorithms are able to deal with the unique characteristics of Big Data.
Note The V’s, which characterize Big Data, make it hard for the traditional ETL (Extract, Transform, Load) functions to scale. ETL methods can neither cope with the velocity of the data generation nor can deal with the veracity issues of the data. Hence there is a need for Big Data Analytics.
Veracity– TheFourth ‘V’
We have seen a number of uses of Big Data in the preceding sections. All these applications rely on the underlying quality of data. A model built on poor data can adversely impact its usefulness. Unfortunately, one of the impediments to the Big Data revolution is the quality of the data itself.
We mentioned earlier how Big Data helps in dealing with uncertainty. However, Big Data itself is characterized by uncertainty. Not all of Big Data is entirely true to be able to deal with it with certainty. When the ground truth is not reliable, even the best-quality model built on top of it will not be able to perform well. That is the problem of Veracity, the fourth ‘V’ of Big Data. Veracity is a crucial aspect of making sense out of the Big Data and getting Value out of it.
Value is often touted as the fifth ‘V’ of Big Data. Poor quality of data will result in poor quality of analysis and value from it, following the “Garbage in, garbage out” maxim. There have been significant failures because veracity issues were not properly handled. The Google Flu Trends (GFT)3 fiasco is one.
Google’s algorithms missed on quite a few aspects of veracity in drawing their conclusions and the predictions were off by 140%. If the healthcare industry makes preparations based on such predictions, they are bound to incur substantial losses.
Note Veracity of Big Data can be defined as the underlying accuracy or lack thereof, of the data in consideration, specifically impacting the ability to derive actionable insights and value out of it.
There are many causes of poor data quality. Uncertainty in the real world, deliberate falsity, oversight of parts of the domain, missing values, human bias, imprecise measurements, processing errors, and hacking all contribute to lowering the quality.
Sensor data from the Internet of Things is often impacted by poor measurements and missed out values. Of late, OSN has been targets of increasing episodes of hacking. Uncertainty in the real world is unavoidable. Rather than ignore it, we need methods to model and deal with uncertainty. Machine Learning offers such methods.
For effective models and methods using them, the dataset needs to be representative of and present a wholesome picture of the domain. For instance, governments, particularly those of the developing countries such as India, are increasingly depending on microblogging websites such as Twitter to make their decisions.
While this may work in quite a few cases, the dataset is not a wholesome representation of the government’s target population because it excludes a substantial chunk of the citizens who have no presence in social media for various reasons. Models built on less representative datasets are bound to fail.
Note Choice of the dataset is critical to the veracity factor and success of a Big Data model. Datasets must be comprehensive representations of application domains; and devoid of biases, errors, and tampered data points.
Machine Learning is the new mortar of modernization. Machine Learning algorithms are now used in many new inventions. They can be used to solve the problem of Veracity as well, not just for the uncertainty case but in other circumstances as well, such as when some data points are missing. We will see this in the subsequent blogs.
Truth is expensive – history has taught us that it costs money, marriages, and even lives. Establishing truth or limiting lies is expensive as well. In the software world too, it has been proved that Misinformation Containment is NP-Hard, implying that the problem is not likely to be solved by software running in polynomial time.
Computers will take an inordinately long time to solve a problem that has been proven to be NP-Hard, making the solution infeasible. NP-Hard problems are solved using approximate methods. Machine Learning offers algorithms that help in solving the problem of Misinformation Containment approximately.
Note Misinformation Containment (MC) is NP-Hard, unlikely to be solved accurately by algorithms completing in reasonable time.
Microblogs are a significant chunk of the Big Data, which is substantially impacted by the problem of poor quality and unreliability. According to Business Insider, it is estimated that in the month of August 2014, on average, 661 million microblogs were posted in a single day on Twitter alone.
The amount of care and caution that would have gone into that many tweets posted at that speed is anyone’s guess. Consumers rather than businesses mostly build the digital universe. It implies that most of the data in the digital universe is not sufficiently validated and does not come with the backing of an establishment like for the enterprise data.
A significant portion of the data supposed to be from businesses is not entirely true either. A simple Internet search for “super shuttle coupon” on February 17, 2017, returned, on the very first page, a result from
http://www.couponokay.com/us/super-shuttle-coupon announcing 77% off, but clicking the link returns a vague page. We know shuttle companies cannot afford to give 77% off, but the Internet search still pulls up the URL for display in the results. It may have been so much better if the search engines quantified their belief in the results displayed and eliminated at least the blatantly false ones.
The readers will also probably be able to relate their own experiences with the problem of the veracity of enterprise-backed data and information. We have all been inconvenienced at one time or other by misleading reviews on Yelp, Amazon, and the websites of other stores. The issue actually impacts the businesses more than consumers.
If a product receives many positive reviews incorrectly, businesses like Amazon may stock more of that product and when the truth dawns on the customers, may not be able to sell them as much, incurring losses.
Machine data, such as from the logs, clickstream, core dumps, and network traces, is usually quite accurate. It can be assumed that healthcare data has gone through the due diligence that is necessary to save lives, so it can be ranked relatively high in terms of accuracy.
Static Web content is associated with the reputation and accountability of a web domain, so it can be assumed to be reasonably accurate. Social media is still casual in nature, with open access and limited authentication schemes, so it suffers in terms of quality. Sensors operate in low energy, lossy, and noisy environments. Readings from sensors are therefore often inaccurate or incomplete.
However, sensor data can be crucial at times. Take, for instance, the case of the fatal accident involving the Tesla Model S car. According to the reports in the press, the accident is caused because the data from the radar sensor did not align with the inference from the computer vision- based vehicle detection system.
Therefore, the crash-avoidance system did not engage and a life was lost. While sensor readings can be improved by using better hardware, there is also a need for algorithms to accommodate the poor quality of data from the sensors, so that disasters such as the car accident can be avoided.
The first scientific step to model complexity in the world has often been to express it in math. Math helps to improve the understanding of the matter. We, therefore, use math where possible to enhance our comprehension of the subject. Algorithms can then be derived from the math, where needed.
The veracity of Web Information
The problem of the Veracity of Big Data can best be understood by one big source of Big Data – the Web. With billions of indexed pages and exabytes in size, Web Information constitutes a substantial part of the Big Data. It plays a crucial role in affecting people’s opinions, decisions, and behaviors in their day-to-day lives.
The Web probably provides the most powerful publication medium at the lowest cost. Anyone can become a journalist just with access to the Internet.
This is good because it brought a number of truths and a number of otherwise unnoticed people to the forefront. However, the same reasons that make the Web powerful also make it vulnerable to malicious activity. A substantial portion of the Web is manifested with false information. The openness of the Web makes it difficult to sometimes even precisely identify the source of the information on it.
Electronic media has been used to mislead people even in the past. On October 30, 1938, a 62-minute dramatization on the radio, of H. G. Wells’s science fiction, “The War of the Worlds” with real-sounding reports terrified many in New York, who believed that the city was under attack by aliens.
Information in the age of the Web travels even wider and faster. In microseconds, even remote corners of the world can be alerted. Information travels at almost the speed of light and so will lies and the tendency to tell lies. The unrivaled reach and reliance on the Web are increasingly being used to perpetuate lies owing to the anonymity and impersonal interface that the Web provides.
We have seen a number of people rising to fame using social media. Many a tweet has become a hotly discussed news item in the recent past. The Web is unique in allowing a few people to influence, manipulate, and monitor the minds of many. Hashtags, likes, analytics, and other artifacts help in monitoring.
The feedback from these artifacts can then be used to fine-tune the influence and manipulate the people. The Web indeed provides an excellent tool for fostering the common good. But the same power of the Web can be used against the common good to favor a select few, who have learned the art of manipulation on the Web.
In this blog, we will examine the problem of the veracity of the Web information: its effects, causes, remedies, and identifying characteristics.
Note The Web is a substantial contributor to the Big Data and is particularly prone to veracity issues.
When an earthquake struck near Richmond, Virginia, in the United States, people living about 350 miles away in New York City, read about it on the popular microblogging website, Twitter, 30 seconds before the earthquake was actually felt in their place. Governments use Twitter to locate their citizens in times of need to offer services. The Web prominently figures in multiple governments’ strategy for interacting with the people and manage perceptions.
This blog itself relies to some extent on the veracity of the Web’s information, as can be seen from the footnote citations. People the world over increasingly depend on the Web for their day-to-day activities such as finding directions to places on a map, choosing a restaurant based on the reviews online, or basing opinions on the trending topics by reviewing microblogs. All these activities and more can be adversely impacted by the inaccuracy of the content that is relied upon.
Lies, in general, have a substantial cost and lies on the Web have that cost multiplied manifold. For instance, in January 2013 1, false tweets that a couple of companies were being investigated cost the respective companies 28% and 16% of their market capitalization – a huge loss to the investors. These are blatant attempts.
There are also subtle ways that the Web has been used to manipulate opinions and perceptions. Cognitive hacking refers to this manipulation of perception for ulterior motives. Cognitive hacking is a serious threat to the veracity of Web information because it causes an incorrect change in the behavior of users. On April 10, 2018, in his testimony to the US Congress, Facebook’s Mark Zuckerberg acknowledged not doing enough to prevent foreign interference in US elections. Such is the role Social Media and cognitive hacking play in today’s governance.
All information posted on the Web with ulterior motives, conflicting with the accepted values of humanity, can be considered as having compromised veracity. There are many website schemes that fit this description. The popular ones are in the stock exchange realm, where monetary stakes are high. The US Securities and Exchange Commission (SEC) has issued several alerts and information bulletins2 to detail the fraudulent schemes and protect the investors.
There are other subtle ways that the integrity of the Web is compromised. In spamdexing or “black hat search engine optimization,” Web search engines are tricked in a number of ways to show incorrect results of the search queries. Some of these tricks include posting unrelated text matching the background color, so that it is visible to the search engines but not to the users and hiding the unrelated text in HTML code itself using empty DIVs.
The problem manifests in a number of forms. Almost every useful artifact has a potential for misuse. For instance, quite a few websites such as LinkedIn and Twitter shorten long URLs in the user posts, for ease of use.
There are websites like Tinyurl.com, which provide this service of shortening the URL, for free. The shortened URL often has no indication of the target page, a feature that has been misused to make gullible users to visit websites they would not have otherwise visited.
The shortened URL is sent in emails with misleading but attractive captions that tend to make the recipients click on it, landing them on obscure websites. The Pew Research Center estimated that 66% of the URLs shared on Twitter come from bots, which are software programs that run on the Internet to automatically perform tasks that a human being is normally expected to do.
Along these lines is the technique of cloaked URL, which appears genuine, but may either have hidden control characters or use domain forwarding to conceal the actual address of the website to which the URL opens up. Spoofing websites in these and other ways has been going on since almost the inception of the Web.
Miscreants create websites that look exactly like the original authentic websites and have a similar URL, but with misleading content and fake news. A more serious form of spoofing is phishing, where the users are misled into sharing their sensitive information such as passwords, social security numbers, and credit card details when they visit a fake website that looks almost identical to the corresponding genuine website.
Note The problem of fraud on the Web takes many forms. The problem is complex from a technical and sociological perspective.
Web 2.0 generated several petabytes of data from people all over the world, so much so that today, the digital universe is mostly built by consumers and ordinary users rather than businesses. Consumer-generated data has grown manifold from the time the 2.0 avatar of the Web came into existence.
It implies that most of the data is not validated and does not come with the backing of an establishment like that of the business data. The Web’s role as the single source of truth in many instances has been misused to serve hidden agendas. In spite of its anthropomorphic role, unlike human beings, the Web does not have a conscience. Still, there is often more reliance on the Web than on the spoken word.
Like with many inventions such as atomic energy, when envisioning the Web, the euphoria of its anticipated appropriate use seems to have dominated the caution to prevent misuse. On one hand, indeed, the Web may not have grown as it did today if it was restrictive to permit only authenticated, genuine, and high-quality content. On the other hand, the Web’s permissiveness is causing a complete lack of control on the huge chunk of the Big Data that the Web contributes.
The original purpose of creating the Web was apparently to encourage peaceful, mutually beneficial communication among the people the world over. However, the loose controls enabled not just people, but also swarms of masquerading bots, to automatically post misleading information on the Web for malicious motives.
When the author presented his work on determining truth in Web microblogs in a conference, a research student in the audience expressed the opinion that the Web is a casual media for information exchange and should not be judged for truthfulness. Insisting on a truthful Web will make it less interesting and far less used, was her rationalization.
It is a reflection on the perception of a substantial number of contributors of the Web content. The casual attitude of the users posting information is one of the major causes for the lack of veracity of Web information.
The gray line between permitting constructive content to be posted and restricting abusive information is hard to discern. Policing malicious online conduct is certainly not easy.
There is plenty of incentive for perpetrating a fraud on the Web. As we shall see in a later blog, even the presidential elections of a country were impacted by the information posted on the Web, and at times, incorrectly. The biggest impact of public opinion is probably on the politicians.
Political agenda is probably the most significant cause of manipulation using the Web. The next motivating factor is money, as we saw in the case of stock price manipulation. Monetary rewards also accrue in the form of advertising on spoofed and fake news websites. Revenge and public shaming is also a common cause, so much so that a majority of the states in the United States have passed laws against posting revenge porn online.
Cognitive hacking is particularly effective on the Web because the interaction lacks a number of components that help in gauging the veracity of the spoken word. For instance, if the speaker is in front, the genuineness of the words is reflected in body language, eye contact, tone, and the continuous feedback loop between the speaker and the listener.
As lives become increasingly individualistic, people depend more and more on the Web for information, missing some of the veracity components shown in the bidirectional arrow, which are present in a live conversation. There are many cases where websites have been used to cash on this situation to mislead the public about ulterior motives, often illegally.
In an increasing trend, software is being written not only to do what is told but also to craft its own course of action from the training data and a preferred outcome. This trend also is contributing to the growing amount of fake data. For instance, speech synthesis software is now capable of impersonating people’s voices.
Lyrebird, a company which makes such software, allows one to record a minute from a person’s voice. Using the recording, the company’s software generates a unique key, something like the DNA of a person’s voice. Using this key, the software can then generate any speech with the person’s voice.
Stanford University’s Face2Face4 can take a person’s face and re-render a video of a different person to make it appear as if the video is that of this new person. The software replaces the original face with the obtained one and generates the facial expressions of the faked face, making it appear as if the person in the video is the new person.
Interestingly, projects like these are started to actually detect video manipulation, but end up being used more for actually manipulating the videos. Combined with the speech-synthesizing software described in the above paragraph, it is possible to make the videos look genuine. It is probably not too far in time that fake videos are generated entirely new from scratch.
Generating fraudulent content is inexpensive as well, contributing to the veracity issues. There are multiple businesses, which generate fake news and indulge in fraudulent manipulations on the Web for an alarmingly low cost. A Chinese marketing company, Xiezuobang, charges just $15 for an 800-word fake news article.
There are a number of companies that offer similar fraudulent services to manipulate the data from the Web. Buying followers, +1’s, and likes on social media; inappropriately boosting a website’s or video’s rank in search results; and fraudulent clicking on “pay-per-click” advertisements is not only possible, but also affordable. In fact, the cost of faking on the Web is often far less than advertising genuine content.
Some of the malicious hacking, such as “Click Fraud,” impacting the clickstream data, which is an important component of the Big Data, has been reined-in by use of technology. But newer, affordable mechanisms to dismantle the integrity of Big Data keep springing up from time to time.
Note It is much easier to mislead people on the Web than in a face-to-face communication.
The impact of fraud on the Web has been grievous and even fatal in some cases. There is extensive literature on the ramifications of the falsity on the Web. Hard to believe, but the leniency of the Web infrastructure has permitted websites that advertise criminal services such as murder and maiming.
Such services often are based on lies and fraudulent information. Even the most popular websites such Craigslist and Facebook were used to seek hitmen. In one instance, a post on Facebook read, “… this girl knocked off right now” for $500. There are many untold stories abound about much worse scenarios from the “Dark Web,” which operate on the Internet, but discretely, requiring special software, setup, and authorizations for access.
The effects of fraud on the Web varies with the schemes we discussed earlier. For instance, popular among the stock price manipulation schemes is the “Pump-and-Dump” pattern where the stock price is first “pumped” by posting misleading, incorrect, and upbeat information about the company on the Web and then “dumping” the miscreants’ stocks into the market at an overpriced value.
The hype affects many investors and the company in the long run. However, the impact is probably felt most when the Web is used for anti-democratic activities, manipulating opinions to the tune of changing the outcomes of countries’ presidential elections.
Similar to the website spoofing we discussed earlier, Typosquatting or URL hijacking is when a typo in the URL of a reputed website leads to an entirely different website that displays annoying advertisements or maliciously impacts the user in some other ways. Whitehouse.com, even today does not seem to do anything with the home of the president of the United States but instead was used to advertise unrelated stuff.
At least now, the website includes a fine print that they “are not affiliated or endorsed by U.S. Government,” which was not the case a year ago. If users wanting to type whitehouse.gov instead type whitehouse.com, they will land upon this unexpected website, which claims to be celebrating its 20 years of existence.
There are also cases of cybersquatting when domains were acquired with bad intentions. For instance, the domain, madonna.com was acquired by a person unrelated to the famous pop singer, Madonna. The website was used to display porn instead of information about the singer. The impact in such cases ranges from a mere annoyance to serious loss of reputation.
Back in 1995 PETA, which stands for “People for the Ethical Treatment of Animals,” was shamed by peta.org, which incorrectly talked about “People Eating Tasty Animals.” We continue to hear about how reputed websites get hacked and defaced.
A group called “Lizard Squad” defaced the home pages of a number of companies, including Lenovo and Google (in Vietnamese) in the past. At one time, the United Kingdom’s Vogue magazine’s website was defaced to show dinosaurs wearing hats. Quite a few times, political websites, such as that of the militant group, Hamas in 2001, was hacked to redirect to a porn website.
Even Internet companies, which wield a substantial control over the Web were not immune to these attacks. It was mentioned earlier that Google’s Vietnamese home page was defaced by the “Lizard Squad.” In 2013 and again in 2014, millions of Yahoo’s user accounts were hacked, impacting the company’s net worth adversely.
In its “State of Website Security in 2016” report, Google has warned that the number of hacked websites has increased by 32%, which is quite alarming. The hackers are becoming more ingenious with time and deploying advanced technologies to perpetrate crime.
There is the case of hackers using 3D rendering on Facebook photos to try to trick the authentication systems using facial recognition to gain entry. Mobile Virtual Reality is capable of providing from a 2D picture of someone’s face, the detail in 3D that is needed by the authentication systems. As we know, breach of security of systems has significant consequences.
Inaccuracy on the Web has a ripple effect into the applications, which rely on the veracity of the Web. The problem gets compounded as more and more things from the Internet of Things space start leveraging the Web. If the devices connecting to the Web are not secure enough, bots can be used to attack websites in various ways or even bring them down.
The cause and effect of the problem of the veracity of Big Data form a vicious cycle, one feeding into the other. An effective remedy needs to break this vicious cycle. Extinguishing the causes or diminishing the effects or both can help in containing the problem. Disincentivizing, punishing, and preventing the occurrence are some good strategies that various entities fighting Web-based fraud have used.
Given the repercussions and ramifications of fraud on the Web, several governments have made it a national priority to address the abuse. Governments like that of Russia maintain websites that contain fake news. Interestingly, the European Union to reviews and posts disinformation,8 but that which is the pro-Russian government.
There are also a number of independent organizations and websites like Snopes. com that are constantly involved in debunking false information floating on the Web and the Internet in general. Social media providers to are actively engaged in curbing falsity on their sites.
Suspension of fake accounts, awareness campaigns, software enhancements to flag fraud in the newsfeeds, and allowing users to provide credibility ratings are some of the experiments that the Internet companies are trying. Governments have been passing laws against online crime from time to time.
A number of crimes highlighted in the previous sections are covered by various sections of the law of the land. For instance, the person posting false tweets about the two companies, causing their stock prices to go down substantially, was charged under Section 10(b) of the Securities Exchange Act of 1934 and Rule 10b-5.
The crime also caused SEC to issue an investor alert9 in November 2015. However, the law or the alerts haven’t caused the Web fraud in the securities market to cease. New cases are still being unearthed and prosecuted.
The FBI set up a form at its Internet Crime Complaint Center (IC3) for reporting online crimes and publishes an annual report on the topic. However, the author himself reported a grievous crime using the form but did not hear back from the FBI or even get a reference number for the report, suggesting that the legal remedies are inadequate.
Moreover, the law does not even exist or is too easy on a lot of mischiefs played on the Web. It is imperative that the problem be addressed using techniques from not just legal, but other disciplines, particularly from technology.
Technology itself played an important role in resolving several legal mysteries, but it is far from getting the needed credibility. A simple application of the Bayes’ theorem played a major role in a number of legal cases, some of which are listed at https://sites.google.com/site/ bayeslegal/legal-cases-relevant-to-Bayes.
As of date, the website lists 81 such cases. However, as the multiple blogs and the Internet search for the string, “and to express the probability of some event having happened in percentage terms” reveal, a British court of appeals banned the use of Bayes’ theorem. There is a need to instill confidence in technological solutions to the problem of veracity, by strengthening the approaches.
Note A simple application of the Bayes’ theorem learned in high school math helped solve the mysteries of numerous legal cases.
Privacy protections make prosecuting an online crime difficult, particularly when the crime occurs across the countries’ borders. Even if prosecution succeeds, the damage done is often irreparable and the time and matter lost is irretrievable. The culprits often do not have the resources to pay the damages, such as in case of the loss to a company’s net worth owing to a few individuals’ “Pump-and-Dump” activity.
Hence, prevention is more important than punishment. Technology is far better at prevention than legal remedies. The subsequent blogs delve more into the technologies that can help in preventing online fraud.
Current technology also needs to be rechristened to address the veracity concerns of data. As an example, by prioritizing popularity over trustworthiness in showing the search results, search algorithms like PageRank are ignoring the problem. Ranking information retrieved based on its trustworthiness is a much more difficult task than recursively computing the number of links to a page, which the current search algorithms focus on.
Internet companies such as Facebook, Google, and Twitter have been coming up with ways to counter fake and malicious content. Google, in the fourth quarter of 2016, for instance, banned 200 companies in its ad network for their misleading content. Facebook, in the first 90 days of 2018, deleted 583 million fake accounts and 865.8 million posts in addition to blocking millions of attempts to create fake accounts every day.
The real world has been a source of many engineering solutions. Quite a few concepts in Computer Science, such as the Object-Oriented paradigm, are inspired by real-world phenomenon. Veracity is a major problem in the real world too. How is it tackled there? How do the law enforcement officers know if a person is telling the truth? It is with the experience of examining a number of people involved in a variety of cases that they are able to make out the truth.
A similar concept in the software world is Machine Learning. By examining patterns in a number of cases (training data), Machine Learning algorithms are able to come to conclusions. One of the conclusions can be whether the data is true or false. We will examine more details about how the Machine Learning techniques are used to solve the Big Data veracity problem.
Biometric identification, two-factor authentication, strong passwords, and such basic security features may be able to prevent data breaches, hacking, and information theft. But as we saw, like thieves that are always one step ahead, hackers are much more ingenious in their ways.
Can we develop a Central Intelligence Agency (CIA)-like platform to detect the threats from hackers? The Threat Intelligence Platform (TIP) comes close to this. Just like the CIA, the TIP envisages to gathering threat information from various sources, correlates and analyzes this information, and acts based on the analysis. The field is still emerging and companies such as Lockheed Martin are working on robust solutions.
Note The real world has been a source of inspiration for many engineering solutions, including to the problem of the veracity of Big Data.
In spite of the vulnerability of the Web, a tiny segment of it is reasonably maintaining its integrity. Wikipedia pages are relied upon by not only individuals but also by search companies. In spite of allowing anonymity, barring a few incidents of conflicted and “revenge” editing, Wikipedia has immensely benefited from its contributors in maintaining the integrity and high standards of the content.
Unlike the real-world example of too many cooks spoiling the broth, collaborated editing has surprisingly proven to be highly successful. It is yet to be seen if collaborated editing is the silver bullet remedy to the problem of veracity, but in general what makes a website trustworthy? In the next section, we will examine the ingredients of the websites, which look relatively more reliable.
Characteristics of a Trusted Website
Trusted websites have certain characteristic features. A trusted website is analogous to a trusted person. We trust a person when he is objective, devoid of any hidden agenda or ulterior motive, represents an established company, speaks in clear and formal sentences, demonstrates expertise, is helpful, is recommended by trustworthy people, easily accessible, appears formal and real in general, and keeps himself abreast with latest developments, to name a few. The same goes for websites. We normally do not suspect weather websites because the information is objective.
There is hardly any motive to manipulate the weather information. The information is simple and easy to interpret. It is not hard to see if the information is real or not because the temperature or humidity can be readily felt and contrasted with what is posted on the webpage. Any predictions can also be historically correlated with real happenings. All weather websites worth their name will keep them updated often, so as to stay in business.
Correlation of the information to established facts is also a good indication of veracity. For instance, if the webpage has a number of journal articles referenced, it is likely to be more trustworthy. The source of a website, such as the government or a reputed publisher also is a factor in determining the credibility of a website.
The tone, style, clarity, and formality of the language used are also good indicators of the veracity of the information. A number of advertisements appearing on the webpage, particularly the annoying ones popping up windows, blocking content, or forcing a mouse click, make the website less reliable. Some of these advertisements can even inject spyware or viruses.
Authentic websites tend to include contact information such as a phone number and an email address. Factors such as appearance, consistency, quality of language used, community involvement and support, intelligibility, succinctness, granularity, well-organized and layered structure, currency, and usefulness of the information also contribute to the veracity of the information presented on a website.
On the other hand, the absence of any individual names or contact information, misspellings, unprofessional look and layout, vague references, intrusive advertisements, and such are tell-tale signs of a fraudulent website.
Note A trusted website often exhibits many characteristics of a trusted person.
BASE (Basically Available, Soft State, Eventual Consistency)
BASE follows an optimistic approach to accepting stale data and approximate answers while favoring availability. Some ways to achieve this are by supporting partial failures without total system failures, decoupling updates on different tables (i.e., relaxing consistency), and item potent operations that can be applied multiple times with the same result.
In this sense, BASE describes more a spectrum of architectural styles than a single model. The eventual state of consistency can be provided as a result of a read-repair, where any outdated data is refreshed with the latest version of the data as a result of the system detecting stale data during a read operation.
Another approach is that of weak consistency. In this case, the read operation will return the first value found, not checking for staleness. Any stale nodes discovered are simply marked for updating at some stage in the future. This is a performance-focused approach but has the associated risk that data retrieved may not be the most current. In the following sections, we will discuss several techniques for implementing services following the BASE principle.
Conventional storage techniques may not be adequate for big data and, hence, the cloud applications. To scale storage systems to cloud-scale, the basic technique is to partition and replicate the data over multiple independent storage systems. The word independent is emphasized since it is well-known that databases can be partitioned into mutually dependent subdatabases that are automatically synchronized for reasons of performance and availability.
Partitioning and replication increase the overall throughput of the system since the total throughput of the combined system is the aggregate of the individual storage systems. To scale both the throughput and the maximum size of the data that can be stored beyond the limits of traditional database deployments, it is possible to partition the data and store each partition in its own database.
For scaling the throughput only, it is possible to use replication. Partitioning and replication also increase the storage capacity of a storage system by reducing the amount of data that needs to be stored in each partition. However, this creates synchronization and consistency problems and discussion of this aspect is out of scope for this blog.
The other technology for scaling storage is known by the name not only SQL(NoSQL). NoSQL was developed as a reaction to the perception that conventional databases, focused on the need to ensure data integrity for enterprise applications, were too rigid to scale to cloud levels. As an example, conventional databases enforce a schema on the data being stored, and changing the schema is not easy. However, changing the schema may be a necessity in a rapidly changing environment like the cloud.
NoSQL storage systems provide more flexibility and simplicity compared to relational databases. The disadvantage, however, is greater application complexity. NoSQL systems, for example, do not enforce a rigid schema.
The trade-off is that applications have to be written to deal with data records of varying formats (schema). The BASE is the NoSQL operating premise, in the same way, that traditional transactionally focused databases use ACID: one moves from a world of certainty in terms of data consistency to a world where all we are promised is that all copies of the data will, at some point, be the same.
Partitioning and replication techniques used for scaling are as follows:
1. The first possible method is to store different tables in different databases (as in multidatabase systems, MDBS).
2. The second approach is to partition the data within a single table onto different databases. There are two natural ways to partition the data from within a table are to store different rows in different databases and to store different columns in different databases (more common for NoSQL databases).
As stated previously, one technique for partitioning the data to be stored is to store different tables in different databases, leading to the storage of the data in an MDBS.
To increase the throughput of transactions from the database, it is possible to have multiple copies of the database. A common replication method is a master-slave replication. The master and slave databases are replicas of each other. All writes go to the master and the master keeps the slaves in sync.
However, reads can be distributed to any database. Since this configuration distributes the reads among multiple databases, it is a good technology for read-intensive workloads.
For write-intensive workloads, it is possible to have multiple masters, but then ensuring consistency if multiple processes update different replicas simultaneously is a complex problem. Additionally, time to write increases, due to the necessity of writing to all masters and the synchronization overhead between the masters rapidly, becomes a limiting overhead.
Row Partitioning or Sharding
In cloud technology, sharding is used to refer to the technique of partitioning a table among multiple independent databases by row. However, partitioning of data by row in relational databases is not new and is referred to as horizontal partitioning in parallel database technology. The distinction between sharding and horizontal partitioning is that horizontal partitioning is done transparently to the application by the database, whereas sharding is explicit partitioning done by the application.
However, the two techniques have started converging, since traditional database vendors have started offering support for more sophisticated partitioning strategies. Since sharding is similar to horizontal partitioning, we first discuss different horizontal partitioning techniques. It can be seen that a good sharding technique depends on both the organization of the data and the type of queries expected.
The different techniques of sharding are as follows:
1. Round-robin partitioning:
The round-robin method distributes the rows in a round-robin fashion over different databases. In the example, we could partition the transaction table into multiple databases so that the first transaction is stored in the first database, the second in the second database, and so on.
The advantage of round-robin partitioning is its simplicity. However, it also suffers from the disadvantage of losing associations (say) during a query, unless all databases are queried. Hash partitioning and range partitioning do not suffer from the disadvantage of losing record associations.
2.Hash partitioning method:
In this method, the value of a selected attribute is hashed to find the database into which the tuple should be stored. If queries are frequently made on an attribute (say Customer_Id, then associations can be preserved by using this attribute as the attribute that is hashed so that records with the same value of this attribute can be found in the same database.
The range partitioning technique stores record with “similar” attributes in the same database. For example, the range of Customer_Id could be partitioned between different databases. Again, if the attributes chosen for grouping are those on which queries are frequently made, record association is preserved and it is not necessary to merge results from different databases.
Range partitioning can be susceptible to load imbalance unless the partitioning is chosen carefully. It is possible to choose the partitions so that there is an imbalance in the amount of data stored in the partitions (data skew) or in the execution of queries across partitions (execution skew). These problems are less likely in round robin and hash partitioning since they tend to uniformly distribute the data over the partitions.
Thus, hash partitioning is particularly well suited to large-scale systems. Round robin simplifies a uniform distribution of records but does not facilitate the restriction of operations to single partitions. While range partitioning does support this, it requires knowledge about the data distribution in order to properly adjust the ranges.
Row versus Column-Oriented Data Layouts
Most traditional database systems employ a row-oriented layout, in which all the values associated with a specific row are laid out consecutively in memory. That layout may work well for transaction processing applications that focus on updating specific records associated with a limited number of transactions (or transaction steps) at a time.
These are manifested as algorithmic scans performed using multiway joins; accessing whole rows at a time when only the values of a smaller set of columns are needed may flood the network with extraneous data that is not immediately needed and ultimately will increase the execution time.
Big data analytics applications scan, aggregate, and summarize over massive datasets. Analytical applications and queries will only need to access the data elements needed to satisfy join conditions. With row-oriented layouts, the entire record must be read in order to access the required attributes, with significantly more data read than is needed to satisfy the request.
Also, the row-oriented layout is often misaligned with the characteristics of the different types of memory systems (core, cache, disk, etc.), leading to increased access latencies. Subsequently, row-oriented data layouts will not enable the types of joins or aggregations typical of analytic queries to execute with the anticipated level of performance.
Hence, a number of appliances for big data use a database management system that uses an alternate, columnar layout for data that can help to reduce the negative performance impacts of data latency that plague databases with a row-oriented data layout. The values for each column can be stored separately, and because of this, for any query, the system is able to selectively access the specific column values requested to evaluate the join conditions.
Instead of requiring separate indexes to tune queries, the data values themselves within each column form the index. This speeds up data access while reducing the overall database footprint, while dramatically improving query performance. The simplicity of the columnar approach provides many benefits, especially for those seeking a high-performance environment to meet the growing needs of extremely large analytic datasets.
NoSQL Data Management
NoSQL or “not only SQL” suggests environments that combine traditional SQL (or SQL- like query languages) with alternative means of querying and access. NoSQL data systems hold out the promise of greater flexibility in database management while reducing the dependence on more formal database administration.
NoSQL databases have more relaxed modeling constraints, which may benefit both the application developer and the end-user analysts when their interactive analyses are not throttled by the need to cast each query in terms of a relational table-based environment.
Different NoSQL frameworks are optimized for different types of analyses. For example, some are implemented as key_value stores, which nicely align to certain big data programming models, while another emerging model is a graph database, in which a graph abstraction is implemented to embed both semantics and connectivity within its structure.
In fact, the general concepts for NoSQL include schemaless modeling in which the semantics of the data is embedded within a flexible connectivity and storage model; this provides for automatic distribution of data and elasticity with respect to the use of computing, storage, and network bandwidth in ways that do not force specific binding of data to be persistently stored in particular physical locations.
NoSQL databases also provide for integrated data caching that helps reduce data access latency and speed performance. A relatively simple type of NoSQL data store is a key_value store, a schemaless model in which distinct character strings called keys are associated with values (or sets of values, or even more complex entity objects)—not unlike hash table data structure.
If you want to associate multiple values with a single key, you need to consider the representations of the objects and how they are associated with the key. For example, you may want to associate a list of attributes with a single key, which may suggest that the value stored with the key is yet another key_value store object itself.
Key_value stores are essentially very long, and presumably thin tables (in that there are not many columns associated with each row). The table’s rows can be sorted by the key value to simplify finding the key during a query. Alternatively, the keys can be hashed using a hash function that maps the key to a particular location (sometimes called a “bucket”) in the table.
The representation can grow indefinitely, which makes it good for storing large amounts of data that can be accessed relatively quickly, as well as allows massive amounts of indexed data values to be appended to the same key_value table, which can then be shared or distributed across the storage nodes.
Under the right conditions, the table is distributed in a way that is aligned with the way the keys are organized, so that the hashing function that is used to determine where any specific key exists in the table can also be used to determine which node holds that key’s bucket (i.e., the portion of the table holding that key).
NoSQL data management environments are engineered for the following two key criteria:
1. Fast accessibility, whether that means inserting data into the model or pulling it out via some query or access method.
2. Scalability for volume, so as to support the accumulation and management of massive amounts of data.
The different approaches are amenable to extensibility, scalability, and distribution and these characteristics blend nicely with programming models (like MapReduce) with straightforward creation and execution of many parallel processing threads. Distributing a tabular data store or a key_value store allows many queries/accesses to be performed simultaneously, especially when the hashing of the keys maps to different data storage nodes.
Employing different data allocation strategies will allow the tables to grow indefinitely without requiring significant rebalancing. In other words, these data organizations are designed for high-performance computing of reporting and analysis.
The idea of running databases in memory was used by the business intelligence (BI) product company QlikView. In-memory allows the processing of massive quantities of data in main memory to provide immediate results from analysis and transaction. The data to be processed is ideally real-time data or as close to real time as is technically possible.
Data in main memory (RAM) can be accessed 100,000 times faster than data on a hard disk; this can dramatically decrease access time to retrieve data and make it available for the purpose of reporting, analytics solutions, or other applications.
The medium used by a database to store data, that is, RAM is divided into pages. In-memory databases save changed pages in savepoints, which are asynchronously written to persistent storage in regular intervals. Each committed transaction generates a log entry that is written to nonvolatile storage—this log is written synchronously.
In other words, a transaction does not return before the corresponding log entry has been written to persistent storage—in order to meet the durability requirement that was described earlier—thus ensuring that in-memory databases meet (and pass) the ACID test. After a power failure, the database pages are restored from the save points; the database logs are applied to restore the changes that were not captured in the save- points. This ensures that the database can be restored in memory to exactly the same state as before the power failure.
Developing Big Data Applications
For most big data appliances, the ability to achieve scalability to accommodate growing data volumes is predicated on multiprocessing—distributing the computation across the collection of computing nodes in ways that are aligned with the distribution of data across the storage nodes.
One of the key objectives of using a multiprocessing nodes environment is to speed application execution by breaking up large “chunks” of work into much smaller ones that can be farmed out to a pool of available processing nodes. In the best of all possible worlds, the datasets to be consumed and analyzed are also distributed across a pool of storage nodes.
As long as there are no dependencies forcing any one specific task to wait to begin until another specific one ends, these smaller tasks can be executed at the same time, that is, “task parallelism.” More than just scalability, it is the concept of “automated scalability” that has generated the present surge of interest in big data analytics (with the corresponding optimization of costs).
A good development framework will simplify the process of developing, executing, testing, and debugging new application code, and this framework should include the following:
1. A programming model and development tools
2. Facility for program loading, execution, and process and thread scheduling
3. System configuration and management tools
The context for all of these framework components is tightly coupled with the key characteristics of a big data application—algorithms that take advantage of running lots of tasks in parallel on many computing nodes to analyze lots of data distributed among many storage nodes. Typically, a big data platform will consist of a collection (or a “pool”) of processing nodes; the optimal performances can be achieved when all the processing nodes are kept busy, and that means maintaining a healthy allocation of tasks to idle nodes within the pool.
Any big application that is to be developed must map to this context, and that is where the programming model comes in. The programming model essentially describes two aspects of application execution within a parallel environment:
How is an application coded?
How that code maps to the parallel environment?
MapReduce programming model is a combination of the familiar procedural/imperative approaches used by Java or C++ programmers embedded within what is effectively a functional language programming model such as the one used within languages like Lisp and APL. The similarity is based on MapReduce’s dependence on two basic operations that are applied to sets or lists of data value pairs:
1. Map, which describes the computation or analysis applied to a set of input key-value pairs to produce a set of intermediate key-value pairs.
2. Reduce, in which the set of values associated with the intermediate key-value pairs output by the Map operation are combined to provide the results.
A MapReduce application is envisioned as a series of basic operations applied in a sequence to small sets of many (millions, billions, or even more) data items. These data items are logically organized in a way that enables the MapReduce execution model to allocate tasks that can be executed in parallel.
Aadhaar project undertaken by the Unique Identification Authority of India (UIDAI) has the mission of identifying 1.2 billion citizens of India uniquely and reliably to build the largest biometric identity repository in the world (while eliminating duplication and fake identities) and provide an online, anytime anywhere, multifactor authentication service. This makes possible to identify any individual and get it authenticated at any time, from any place in India, in less than a second.
The UIDAI project is a Hadoop-based program that is well into production. At the time of this writing, over 700 million people have been enrolled and their identity information has been verified. The target is to reach a total of at least 1 billion enrolments during 2015. Currently, the enrolment rate is about 10 million people every 10 days, so the project is well positioned to meet that target.
In India, there is no social security card, and much of the population lacks a passport. Literacy rates are relatively low, and the population is scattered across hundreds of thousands of villages. Without adequately verifiable identification, it has been difficult for many citizens to set up a bank account or otherwise participate in a modern economy.
For India’s poorer citizens, this problem has even more dire consequences. The government has extensive programs to provide widespread relief for the poor—for example, through grain subsidies to those who are underfed and through government-sponsored work programs for the unemployed. Yet many who need help do not have access to benefit programs, in part because of the inability to verify who they are and whether they qualify for the programs.
In addition, there is a huge level of so-called “leakage” of government aid that disappears to apparent fraud. For example, it has been estimated that over 50% of funds intended to provide grain to the poor goes missing and that fraudulent claims for “ghost workers” siphon off much of the aid intended to create work for the poor.
There are clearly immense benefits from a mechanism that uniquely identifies a person and ensures instant identity verification. The need to prove one’s identity only once will bring down transaction costs. A clear identity number can transform the delivery of social welfare programs by making them more inclusive of those communities now cut off from such benefits due to their lack of identification.
It also enables the government to shift from indirect to direct benefit delivery by directly reaching out to the intended beneficiaries. A single universal identity number is also useful in eliminating fraud and duplicate identities since individuals can no longer be able to represent themselves differently to different agencies. This results in significant savings to the state exchequer.
Aadhaar is in the process of creating the largest biometric database in the world, one that can be leveraged to authenticate identities for each citizen, even on site in rural villages. A wide range of mobile devices from cell phones to micro scanners can be used to enroll people and to authenticate their identities when a transaction is requested. People will be able to make payments at remote sites via micro-ATMs. Aadhaar ID authentication will be used to verify qualification for relief food deliveries and to provide pension payments for the elderly.
Implementation of this massive digital identification system is expected to save the equivalent of millions and perhaps billions of dollars each year by thwarting efforts at fraud. While the UIDAI project will have broad benefits for the Indian society as a whole, the greatest impact will be for the poorest people.
All application components are built using open source components and open standards. Aadhaar software currently runs across two of the data centers within India managed by UIDAI and handles 1 million enrolments a day and at the peak doing about 600 trillion biometric matches a day.
The current system already has about 4 PB (4,000 terabytes) of raw data and continues to grow as new enrolments come in. Aadhaar Authentication service is built to handle 100 million authentications a day across both the data centers in an active-active fashion and is benchmarked to provide sub-second response time. Central to Aadhaar system is its biometric subsystem that performs de-duplication and authentication in an accurate way.
Application modules are built on a common technology platform that contains frameworks for persistence, security, messaging, etc. The Platform standardizes on a technology stack based on open standards and using open source where prudent. A list of extensively used open source technology stacks is as follows:
Spring Framework—Application container for all components and runtime
Spring Batch—Runtime for various batch jobs
Spring Security—For all application security needs
Mule ESB—Runtime for loosely coupled SEDA stages of enrolment flow
Hadoop Stack—HDFS, Hive, HBase, Pig, and Zookeeper
Pentaho—Open source BI Solution
Quartz—Scheduling of batch jobs
MongoDB—NoSQL Document Database
MySQL—RDBMS for storing relational data
Apache Solr—Index for full-text search
Apache Tomcat—Web container
Several other open source libraries for random number generation, hashing, advanced data structures, HTML UI components, and others.
Big Data Technologies
Relational databases and data warehouses are necessary to run corporate businesses. They were designed for specific functionality in which they excel. Hadoop was not designed to replace them. Hadoop adds additional capabilities and features that together with relational databases and data warehouses are defining the next evolution of enterprise data platforms.
Previously, it was believed that you could have speed, reliability, and price, but not all three at the same time. Hadoop changes that because with Hadoop you can have speed, reliability, and optimal price (commodity hardware) at the same time.
Hadoop is a software solution where all the components are designed from the ground up to be an extremely parallel high-performance platform that can store large volumes of information cost-effectively. It handles very large ingestion rates, easily works with structured, semistructured, and unstructured data eliminates the business data latency problem, is an extremely low cost in relation to traditional systems, has a very low entry cost point, and is linearly scalable in cost-effective increments.
A Hadoop cluster can be purchased with a small capitalization expenditure (CAPEX) and run under a greatly reduced operational expenditure (OPEX) when compared to proprietary vertically scalable platforms.
Hadoop greatly reduces the cost of managing terabytes to petabytes of data and provides a distributed processing framework to finish the tasks as fast as it can by utilizing commodity hardware. It essentially is a Swiss army knife for big data projects because of the vast array of data access tools it exposes to various users—analysts, data scientists, and many more. Batch processing in Hadoop is an old news but still important.
The advent of interactive and real-time data processing is now a reality and critical for organizations. Fast data that allows decisions in near-real time can be just as important as big data capabilities.
Functional Programming Paradigm
The functional programming paradigm treats computation as the evaluation of mathematical functions with zero (or minimal) maintenance of states or data updates. As opposed to procedural programming in languages, such as C or Java, it emphasizes that the application is written completely as functions that do not save any state. Such functions are called pure functions.
This is the first similarity with MapReduce abstraction. All input and output values are passed as parameters and the map and reduce functions are not expected to save the state. However, the values can be input and output from a file system or a database to ensure persistence of the computed data. Programs are written using pure functions eliminate side effects.
So, the output of a pure function depends solely on the inputs provided to it. Calling a pure function twice with the same value for an argument will give the same result both times. Lisp is one such popular functional programming language in which two powerful recursion schemes—called map and reduce—enable powerful decomposition and reuse of code.
Parallel Architectures and Computing Models
MapReduce provides a parallel execution platform for data parallel applications. The following two sections describe core concepts involved in understanding such systems.
A variant of SIMD is called single program, multiple data (SPMD) model, in which the same program executes on multiple computer processes. While SIMD can achieve the same result as SPMD, SIMD systems typically execute in lockstep with a central controlling authority for program execution. As can be seen, when multiple instances of the map function are executed in parallel, they work on different data streams using the same map function.
In essence, though the underlying hardware can be a MIMD machine (a computer cluster), the MapReduce platform follows an SPMD model to reduce programming effort. Of course, while this holds for simple use cases, a complex application may involve multiple phases—each of which is solved with MapReduce—in which case the platform will be a combination of SPMD and MIMD.
Data Parallelism versus Task Parallelism
Data parallelism is a way of performing parallel execution of an application on multiple processors. It focuses on distributing data across different nodes in the parallel execution environment and enabling simultaneous sub-computations on these distributed data across the different compute nodes.
This is typically achieved in SIMD mode (single instruction, multiple data mode) and can have either a single controller controlling the parallel data operations or multiple threads working in the same way on the individual compute nodes (SPMD). In contrast, task parallelism focuses on distributing parallel execution threads across parallel computing nodes.
These threads may execute the same or different threads and exchange messages through shared memory or explicit communication messages, as per the parallel algorithm. In the most general case, each of the threads of a task-parallel system can be performing completely different tasks but coordinating to solve a specific problem. In the most simplistic case, all threads can be executing the same program and, based on their node IDs, differentiating to perform any variation in the task- responsibility.
Most common Task-Parallel algorithms follow the Master–Worker model, where there are a single master and multiple workers. The master distributes the computation to different workers based on scheduling rules and other task-allocation strategies. MapReduce falls under the category of data parallel SPMD architectures.
Owing to the use of functional programming paradigm, the individual mapper processes that process the split data are not aware of (or dependent on) the results of other mapper processes. Also, since the order of execution of the mapper function does not matter, one can reorder or parallelize the execution.
Thus, this inherent parallelism enables the mapper function to scale and execute on multiple nodes in parallel. Along the same lines, the reduce functions also run in parallel, and each instance works on a different output key. All the values are processed independently, again facilitating implicit data parallelism. The extent of parallel execution is determined by the number of maps and reduces tasks configured at the time of job submission.
The MapReduce architecture and programming model pioneered by Google is an example of a modern systems architecture designed for processing and analyzing large data sets and is being used successfully by Google in many applications to process massive amounts of raw web data.
The MapReduce system runs on top of the Google File System, within which data are loaded, they are partitioned into chunks, and each chunk is replicated. Data processing is co-located with data storage: When a file needs to be processed, the job scheduler consults a storage metadata service to get the host node for each chunk and then schedules a mapping process on that node, so that data locality is exploited efficiently.
The MapReduce architecture allows programmers to use a functional programming style to create a map function that processes a key-value pair associated with the input data to generate a set of intermediate key-value pairs and a reduce function that merges all intermediate values associated with the same intermediate key.
Users define a map and a reduce function as follows:
1. The map function processes a (key, value) pair and returns a list of intermediate (key, value) pairs:
map (in_key, in_value) • list(out_key, intermediate_value).
2. The reduce function merges all intermediate values having the same intermediate key:
reduce (out_key, list(intermediate_value)) • list(out_value).
The former processes an input key-value pair, producing a set of intermediate pairs. The latter is in charge of combining all the intermediate values related to a particular key, outputting a set of merged output values (usually just one). MapReduce is often explained illustrating a possible solution to the problem of counting the number of occurrences of each word in a large collection of documents. The following pseudocode refers to the functions that need to be implemented.
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents for each word w in input_value: EmitIntermediate(w, “1”);
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts int result = 0;
for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
The map function emits in output each word together with an associated count of occurrences (in this simple example, just one). The reduce function provides the required result by summing all the counts emitted for a specific word. MapReduce implementations (e.g., Google App Engine and Hadoop) then automatically parallelize and execute the program on a large cluster of commodity machines.
The runtime system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing required inter-machine communication.
The programming model for MapReduce architecture is a simple abstraction where the computation takes a set of input key-value pairs associated with the input data and produces a set of output key-value pairs. In the Map phase, the input data is partitioned into input splits and assigned to Map tasks associated with processing nodes in the cluster.
The Map task typically executes on the same node that contains its assigned partition of data in the cluster. These Map tasks perform user-specified computations on each input key-value pair from the partition of input data assigned to the task and generate a set of intermediate results for each key.
The shuffle and sort phase then takes the intermediate data generated by each Map task, sorts this data with intermediate data from other nodes, divides this data into regions to be processed by the reduce tasks, and distributes this data as needed to nodes where the Reduce tasks will execute. All Map tasks must complete prior to the shuffle and sort and reduce phases. The number of Reduce tasks does not need to be the same as the number of
Map tasks. The Reduce tasks perform additional user-specified operations on the intermediate data, possibly merging values associated with a key to a smaller set of values to produce the output data. For more complex data processing procedures, multiple MapReduce calls may be linked together in sequence.
The MapReduce programs can be used to compute derived data from documents, such as inverted indexes, and the processing is automatically parallelized by the system that executes on large clusters of commodity type machines, highly scalable to thousands of machines.
Since the system automatically takes care of details such as partitioning the input data, scheduling and executing tasks across a processing cluster and managing the communications between nodes, programmers with no experience in parallel programming can easily use a large distributed processing environment.
Google File System
Google File System (GFS) is the storage infrastructure that supports the execution of distributed applications in Google’s computing cloud. GFS was designed to be a high-performance, scalable distributed file system for very large data files and data-intensive applications providing fault tolerance and running on clusters of commodity hardware. GFS is oriented to very large files, dividing and storing them in fixed-size chunks of 64 MB by default that is managed by nodes in the cluster called chunk servers.
Each GFS consists of a single master node that acts as a name server and multiple nodes in the cluster that act as chunk servers using a commodity Linux-based machine (node in a cluster) running a user-level server process. Chunks are stored in plain Linux files that are extended only as needed and replicated on multiple nodes to provide high availability and improve performance.
GFS has been designed with the following assumptions:
The system is built on top of commodity hardware that often fails.
The system stores a modest number of large files; multi-GB files are common and should be treated efficiently, and small files must be supported, but there is no need to optimize for that.
The workloads primarily consist of two kinds of reads: large streaming reads and small random reads.
The workloads have also many large sequential writes that append data to files.
High sustained bandwidth is more important than low latency.
The architecture of the file system is organized into a single master, which contains the metadata of the entire file system, and a collection of chunk servers, which provide storage space. From a logical point of view, the system is composed of a collection of software daemons that implement either the master server or the chunk server.
A file is a collection of chunks for which the size can be configured at the file system level. Chunks are replicated on multiple nodes to tolerate failures. Clients look up the master server and identify the specific chunk of a file they want to access. Once the chunk is identified, the interaction happens between the client and the chunk server.
Applications interact through the file system with a specific interface supporting the usual operations for file creation, deletion, read, and write. The interface also supports snapshots and records append operations that are frequently performed by applications.
GFS is conceived considering the fact that failures in a large distributed infrastructure are common rather than a rarity; therefore, specific attention is given to the implementation of a highly available, lightweight, and fault-tolerant infrastructure. The potential single point of failure of the single-master architecture has been addressed by giving the possibility of replicating the master node on any other node belonging to the infrastructure.
Big-table is the distributed storage system designed to scale up to petabytes of data across thousands of servers. Big-table provides storage support for several Google applications that expose different types of workload: from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.
Big-table’s key design goals are wide applicability, scalability, high performance, and high availability. To achieve these goals, Big table organizes the data stored in tables, the rows of which are distributed over the distributed file system supporting the middleware, which is the Google File System.
From a logical perspective, a table can be defined as a multidimensional sorted map indexed by a key that is represented by a string of arbitrary length. A table is organized into rows and columns; columns can be grouped in a column family, which allows specific optimization for better access control, the storage and the indexing of data.
A simple data access model constitutes the interface for client applications that can address data at the granularity level of the single column of a row. Moreover, each column value is stored in multiple versions that can be automatically time-stamped by Big-table or by client applications
Data is stored in Big-table as a sparse, distributed, persistent multidimensional sorted map, which is indexed by a row key, column key, and a timestamp. Rows in a Big-table are maintained in order by the row key, and row ranges become the unit of distribution and load balancing called a tablet. Each cell of data in a Big-table can contain multiple instances indexed by the timestamp. Big-table uses GFS to store both data and log files.
The API for Big-table is flexible, providing data management functions, such as creating and deleting tables, and data manipulation functions by the row key, including operations to read, write, and modify data.
Index information for Big-tables utilizes tablet information stored in structures similar to a B+Tree. MapReduce applications can be used with Big-table to process and transform data. Google has implemented many large-scale applications (e.g., Google Earth) that utilize Big-table for storage.
In 2006, after struggling with the same “big data” challenges related to indexing massive amounts of information for its search engine, and after watching the progress of the Nutch project, Yahoo! hired Doug Cutting and decided to adopt Hadoop as its distributed framework for solving its search engine challenges.
Yahoo! spun out the storage and processing parts of Nutch to form Hadoop as an open source Apache project, and the Nutch web crawler remained its own separate project. Shortly thereafter, Yahoo! began rolling out Hadoop as a means to power analytics for various production applications. The platform was so effective that Yahoo! merged its search and advertising into one unit to better leverage Hadoop technology.
In the past 10 years, Hadoop has evolved from its search engine-related origins to one of the most popular general-purpose computing platforms for solving big data challenges. It is rapidly becoming the foundation for the next generation of data-based applications. It is predicted that Hadoop will be driving a big data market that should hit more than $23 billion by 2016.
Since the launch of the first Hadoop-centered company, Cloudera, in 2008, dozens of Hadoop-based start-ups have attracted hundreds of millions of dollars in venture capital investment. Simply put, organizations have found that Hadoop offers a proven approach to big data analytics.
Apache Hadoop has revolutionized data management and processing. Hadoop’s technical capabilities have made it possible for organizations across a range of industries to solve problems that were previously impractical. These capabilities include the following:
1.Scalable processing of massive amounts of data on commodity hardware
2. Flexibility for data processing, regardless of the format and structure (or lack of structure) of the data
Enterprises are using Hadoop for resolving business problems:
Content optimization and engagement—Companies are focusing on optimizing content for rendering on different devices supporting different content formats. Many media companies require that a large amount of content be processed in different formats. Also, content engagement models must be mapped for feedback and enhancements.
Enhancing fraud detection for banks and credit card companies—Companies are utilizing Hadoop to detect transaction fraud. By providing analytics on large clusters of commodity hardware, banks are using Hadoop, applying analytic models to a full set of transactions for their clients, and providing near-real-time fraud-in- progress detection.
Shopping pattern analysis for retail product placement—Businesses in the retail industry are using Hadoop to determine products most appropriate to sell in a particular store based on the store’s location and the shopping patterns of the population around it.
Network analytics and mediation—Real-time analytics on a large amount of data generated in the form of usage transaction data, network performance data, cell-site information, device-level data, and other forms of back-office data is allowing companies to reduce operational expenses and enhance the user experience on networks.
Traffic pattern recognition for urban development—Urban development often relies on traffic patterns to determine requirements for road network expansion. By monitoring traffic during different times of the day and discovering patterns, urban developers can determine traffic bottlenecks, which allow them to decide whether additional streets/street lanes are required to avoid traffic congestions during peak hours.
Social media marketing analysis—Companies are currently using Hadoop for brand management, marketing campaigns, and brand protection. By monitoring, collecting, and aggregating data from various Internet sources, such as blogs, boards, news feeds, tweets, and social media, companies are using Hadoop to extract and aggregate information about their products, services, and competitors, discovering patterns and revealing upcoming trends important for understanding their business.
Components of Hadoop Ecosystem
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes the following components:
Hadoop Common: The common utilities that support the other Hadoop subprojects
Avro: A data serialization system that provides dynamic integration with scripting languages
Cassandra: A scalable multi-master database with no single point of failure
Chukwa: A data collection system for managing large distributed systems
HBase: A scalable, distributed database that supports structured data storage for large tables
HDFS: A distributed file system that provides high-throughput access to application data
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying
MapReduce: A software framework for distributed processing of large data sets on compute clusters
Mahout: A scalable machine learning and data mining library
Pig: A high-level data-flow language and execution framework for parallel computation
ZooKeeper: A high-performance coordination service for distributed applications
Like the Google implementation, the Hadoop MapReduce architecture creates fixed-size input splits from the input data and assign the splits to Map tasks. The local output from the Map tasks is copied to Reduce nodes, where it is sorted and merged for processing by Reduce tasks that produce the final output.
The Hadoop MapReduce architecture is functionally similar to the Google implementation except that the base programming language for Hadoop is Java instead of C++. The implementation is intended to execute on clusters of commodity processors utilizing Linux as the operating system environment, but can also be run on a single system as a learning environment.
Hadoop clusters also utilize the “shared nothing” distributed processing paradigm linking individual systems with the local processor, memory, and disk resources using high-speed communications switching capabilities typically in rack-mounted configurations.
The flexibility of Hadoop configurations allows small clusters to be created for testing and development using desktop systems or any system running Unix/Linux providing a JVM environment; however, production clusters typically use homogeneous rack-mounted processors in a data center environment.
Principles and Patterns Underlying the Hadoop Ecosystem
Hadoop is different from previous distributed approaches in the following ways:
Data is distributed in advance.
Data is replicated throughout a cluster of computers for reliability and availability.
Data processing tries to occur where the data is stored, thus eliminating bandwidth bottlenecks.
In addition, Hadoop provides a simple programming approach that abstracts the complexity evident in previously distributed implementations. As a result, Hadoop provides a powerful mechanism for data analytics, which consists of the following:
1. The vast amount of storage—Hadoop enables applications to work with thousands of computers and petabytes of data. Over the past decade, computer professionals have realized that low-cost “commodity” systems can be used together for high-performance computing applications that once could be handled only by supercomputers.
Hundreds of “small” computers may be configured in a cluster to obtain aggregate computing power that can exceed by far that of a single supercomputer at a cheaper price. Hadoop can leverage clusters in excess of thousands of machines, providing huge storage and processing power at a price that an enterprise can afford.
2. Distributed processing with fast data access—Hadoop clusters provides the capability to efficiently store vast amounts of data while providing fast data access. Prior to Hadoop, parallel computation applications experienced difficulty distributing execution between machines that were available on the cluster.
This was because the cluster execution model creates demand for shared data storage with very high I/O performance. Hadoop moves execution toward the data. Moving the applications to the data alleviates many of the high-performance challenges. In addition, Hadoop applications are typically organized in a way that they process data sequentially. This avoids random data access (disk seek operations), further decreasing I/O load.
3. Reliability, failover, and scalability—In the past, implementers of parallel applications struggled to deal with the issue of reliability when it came to moving to a cluster of machines. Although the reliability of an individual machine is fairly high, the probability of failure grows as the size of the cluster grows. It will not be uncommon to have daily failures in a large (thousands of machines) cluster.
Because of the way that Hadoop was designed and implemented, a failure (or set of failures) will not create inconsistent results. Hadoop detects failures and retries execution (by utilizing different nodes). Moreover, the scalability support built into Hadoop’s implementation allows for seamlessly bringing additional (repaired) servers into a cluster, and leveraging them for both data storage and execution.
Storage and Processing Strategies
A computation task has four characteristic dimensions as follows:
Computation—transforming information to produce new information
Database Access—access to reference information needed by the computation
Database Storage—long-term storage of information (needed for later access)
Networking—delivering questions and answers
The ratios among these quantities and their relative costs are pivotal: it is fine to send a GB over the network if it saves years of computation, but it is not economical to send a kilobyte question if the answer could be computed locally in a second. Consequently, on-demand computing is only economical for very CPU-intensive (100,000 instructions per byte or a CPU day-per gigabyte of network traffic) applications; for most other applications, preprovisioned computing is likely to be more economical.
Within the Hadoop ecosystem, the primary components responsible for storage and processing are HDFS and MapReduce respectively. A typical Hadoop workflow consists of loading data into the HDFS and processing using the MapReduce API or several tools that rely on MapReduce as an execution framework; both of these are direct implementations of Google’s own GFS and MapReduce technologies. Both HDFS and MapReduce exhibit common architectural principles like:
1. Both are designed to run on clusters of commodity servers (i.e., low to medium specification servers)
2 Both have an architecture where a software cluster sits on the physical servers and manages aspects such as application load balancing and fault tolerance, without relying on high-end hardware to deliver these capabilities
3. Both scale their capacity by adding more servers (scale-out) as opposed to the previous models of using larger hardware (scale-up)
4. Both have mechanisms to detect and work around failures
5. Both provide most of their services transparently, allowing the user to concentrate on the problem at hand
Characteristics of HDFS file system are as follows:
HDFS stores files in blocks that are typically at least 64 or 128 Mb in size, much larger than the 4–32 kb seen in most conventional file systems.
HDFS is optimized for throughput over latency; it is very efficient at streaming reads of large files but poor when seeking for many small ones.
HDFS is optimized for workloads that are generally write-once and read many.
HDFS uses replication for handling disk failures rather than by having physical redundancies in disk arrays or similar strategies,. Each of the blocks comprising a file is stored on multiple nodes within the cluster, and a service called the NameNode constantly monitors to ensure that failures have not dropped any block below the desired replication factor. If this does happen, then it schedules the making of another copy within the cluster.
Hadoop provides a standard specification (that is, interface) for the map and reduces phases, and the implementation of these are often referred to as mappers and reducers. A typical MapReduce application will comprise a number of mappers and reducers, and it is not unusual for several of these to be extremely simple. The developer focuses on expressing the transformation between the source and the resultant data, and the Hadoop framework manages all aspects of job execution and coordination.
MapReduce is an API, an execution engine, and a processing paradigm; it provides a series of transformations from a source into a result data set. In the simplest case, the input data is fed through a map function and the resultant temporary data is then fed through a reduce function. MapReduce works best on semistructured or unstructured data.
Instead of data conforming to rigid schemas, the requirement is instead that the data can be provided to the map function as a series of key-value pairs. The output of the map function is a set of other key-value pairs, and the reduce function performs aggregation to collect the final set of results.
The most common deployment model for Hadoop sees the HDFS and MapReduce clusters deployed on the same set of servers. Each host that contains data and the HDFS component to manage the data also hosts a MapReduce component that can schedule and execute data processing. When a job is submitted to Hadoop, it can use the locality optimization to schedule data on the hosts where data resides as much as possible, thus minimizing network traffic and maximizing performance.
Hadoop 2 alias YARN
Introduction of Hadoop 2 has had a very different effect on the two primary components of Hadoop, namely, HDFS storage, and MapReduce processing.
The HDFS in Hadoop 2 is more resilient and can be more easily integrated into existing workflows and processes.
HDFS deploys a master-slave model; this model works well and has been scaled to clusters with tens of thousands of nodes at companies such as Yahoo! However, though it is scalable, there is a resiliency risk; if the NameNode becomes unavailable, then the entire cluster is rendered effectively unusable. No HDFS operations can be performed, and since the vast majority of installations use HDFS as the storage layer for services, such as MapReduce, they also become unavailable even if they are still running without problems.
The slave nodes (called DataNodes) hold the actual file system data. In particular, each host running a DataNode will typically have one or more disks onto which files containing the data for each HDFS block are written. The DataNode itself has no understanding of the overall file system; its role is to store, serve, and ensure the integrity of the data for which it is responsible.
The master node (called the NameNode) is responsible for knowing which of the DataNodes holds which block and how these blocks are structured to form the file system. When a client looks at the file system and wishes to retrieve a file, it is via a request to the NameNode that the list of required blocks is retrieved.
The NameNode stores the file system metadata to a persistent file on its local file system. If the NameNode host crashes in a way that this data is not recoverable, then all data on the cluster is effectively lost forever. The data will still exist on the various DataNodes, but the mapping of which blocks comprise which files is lost. On this account, in Hadoop 1, the best practice was to have the NameNode synchronously write its file system metadata to both local disks and at least one remote network volume (typically via NFS).
In Hadoop 2, NameNode High Availability (HA) is one of the major features that provides both a standby NameNode that can be automatically promoted to service all requests should the active NameNode fail, but also builds additional resilience for the critical file system metadata on top of this mechanism.
MapReduce in Hadoop 1 not only provided the framework for processing, but it also manages the allocation of this computation across the cluster; it not only directs data to and between the specific map and reduce tasks but also determines where each task would run, and managed the full job life cycle, monitoring the health of each task and node, rescheduling if any of them failed, and so on.
From the user’s perspective, the actual scale of the data and cluster for computation is transparent, and aside from affecting the time taken to process the job, it does not change the interface with which to interact with the system. MapReduce’s batch processing model was ill-suited to problem domains where faster response times were required.
For instance, Hive which provides a SQL-like interface onto HDFS data actually converted the statements into MapReduce jobs that are then typically executed in a batch processing mode.
In Hadoop 2, there is a focus on enabling different processing models on the Hadoop platform. The role of job scheduling and resource management is separated from that of executing the actual application and is implemented by YARN. YARN is responsible for managing the cluster resources, and so MapReduce exists as an application that runs atop the YARN framework. Both semantically and practically, the MapReduce interface in Hadoop 2 is completely compatible with that in Hadoop 1, but effectively, MapReduce has become a hosted application on the YARN framework.
The significance of this split is that other applications can be written that provide processing models more focused on the actual problem domain and can offload all the resource management and scheduling responsibilities to YARN.
The latest versions of many different execution engines have been ported onto YARN, either in a production-ready or experimental state, and it has shown that the approach can allow a single Hadoop cluster to run everything from batch-oriented MapReduce jobs through fast-response SQL queries to continuous data streaming and even to implement models such as graph processing and the Message Passing Interface (MPI) from the High-Performance Computing (HPC) world.
Since 2008, there have been a number of Hadoop distributions offered by different companies excelling in some or other area. Selection of an appropriate distribution should be based on various criteria such as
1. Performance: The ability of the cluster to ingest input data and emit output data at a quick rate becomes very important for low-latency analytics. This input-output cost forms an integral part of the data processing workflow. Scaling up hardware is one way to achieve low latency independent of the Hadoop distribution. However, this approach will be expensive and saturate out quickly. Architecturally, low I/O latency can be achieved in different ways:
a. Some distributions reduce the number of intermediate data-staging layers between the data source or the data sink and the Hadoop cluster.
b. Some distributions provide streaming writes into the Hadoop cluster in an attempt to reduce intermediate staging layers. Operators used for filtering, compressing, and lightweight data processing can be plugged into the streaming layer to preprocess the data before it flows into storage.
c. Some vendors optimize their distributions for particular hardware, increasing job performance per node. The Apache Hadoop distribution is written in Java, a language that runs in its own virtual machine. Though this increases application portability, it comes with overheads such as an extra layer of indirection during execution by means of byte-code interpretation and background garbage collection; it is not as fast as an application compiled directly for a target hardware.
d. Some distributions optimize features such as compression and decompression for certain hardware types.
2. Scaling: Over time, data outgrows the physical capacity of the computer and storage resources provisioned by an organization. Scaling can be done vertically or horizontally: vertical scaling or scaling up is tightly bound to hardware advancements and is expensive; lack of elasticity is another downside with vertical scaling. Horizontal scaling or scaling out is a preferred mode of scaling the compute and storage.
Scaling costs will depend on the existing architecture and how it complements and complies with the Hadoop distribution that is being evaluated. Scaling out should be limited to the addition of more nodes and disks to the cluster network, with minimal configuration changes. However, distributions might impose different degrees of difficulty, both in terms of effort and cost on scaling a Hadoop cluster. Scaling out might mean heavy administrative and deployment costs, rewriting a lot of the application’s code, or a combination of both.
3.Reliability: A distribution focusing on reliability provides high availability of its components out of the box. Any distributed system is subject to partial failures. Failures can stem from hardware, software, or network issues, and have a smaller meantime when running on commodity hardware. Dealing with these failures without disrupting services or compromising data integrity is the primary goal of any highly available and consistent system.
Distributions that reduce manual tasks for cluster administrators are more reliable; human intervention is directly correlated to higher error rates. Failovers are critical periods for systems as they operate with lower degrees of redundancy; any error during these periods can be disastrous for the application. Lower is the recovery time from failure and better is the availability of the system. Hence, automated failover handling usually means the system can recover and run in a short period of time.
Eliminating Single Point of Failures (SPOF) ensures availability; the usual means of eliminating SPOFs is to increase the redundancy of components. Apache Hadoop 1 had a single NameNode; consequently, any failure to the NameNode’s hardware meant the entire cluster becoming unusable. With Hadoop 2 there is the concept of a secondary NameNode and hot standbys that can be used to restore the name node in the event of NameNode failure.
The integrity of data also needs to be maintained during normal operations and when failures are encountered. The integrity of data can be ensured by measures like data checksums for error detection and possible recovery, data replication, data mirroring, snapshots, and so on. Replication ensures data availability coupled with the rack-aware smart placement of data and handling Mirroring helps recovery from site failures by asynchronous replication across the Internet. Snapshotting aids disaster recovery and also facilitates offline access to data without disrupting production.
4. Manageability: Capabilities of Hadoop management tools are key differentiators when choosing an appropriate distribution for an enterprise. Management tools need to provide centralized cluster administration, resource management, configuration management, and user management. Job scheduling, automatic software upgrades, user quotas, and centralized troubleshooting are other desirable features. Deploying and managing the Apache Hadoop open source distribution requires an internal understanding of the source code and configuration.
There are a number of distributions of Hadoop available to prospective users.
Different computational frameworks and resource management solutions are available in Hadoop. The foundation of Hadoop is the two core frameworks YARN and HDFS. These two frameworks deal with storage and processing:
YARN, which stands for Yet Another Resource Negotiator, is the foundation for distributed processing in Hadoop. YARN can be considered a distributed data operating system because it is responsible for controlling the allocation of computing resources across a Hadoop platform. YARN greatly increases the scale-out capability of Hadoop by allowing applications with different runtime characteristics to run with a single resource management model for the distributed cluster.
Without a single resource management model, it was not possible to control how applications with different runtime characteristics used hardware resources. This often required separate clusters for Hadoop, NoSQL, Spark, and so on. Users did not want large-batch processes to take away resources from other processes that need to return data in near time. If required, YARN allows different frameworks, such as Hadoop, NoSQL, and Spark, to run under a single resource management model.
Hadoop Distributed File System (HDFS) is a distributed file system that spreads data blocks across the storage defined for the Hadoop cluster. Data are usually distributed in 64MB or 128MB–1GB block sizes. The data has three replicas (default) for high availability. The Hadoop distributed file system (HDFS) is a distributed file system that spreads blocks for a file across multiple disks on separate servers.
Spreading the blocks across the disks enables HDFS to leverage the throughput and input or output operations per second (IOPS) that all the disks can generate. Each block has three replicas (default) for high availability. HDFS can work with different types of storage, but for performance, local disks are recommended. Just a bunch of disks (JBOD) minimizes the overhead that can occur with the RAID, SAN, and NFS systems.
Cloudera Distribution of Hadoop (CDH)
Cloudera was formed in March 2009 with a primary objective of providing Apache Hadoop software, support, services, and training for enterprise-class deployment of Hadoop and its ecosystem components. The software suite is branded as Cloudera Distribution of Hadoop (CDH). The company is one of the Apache Software Foundation sponsors pushes most enhancements it makes during support and servicing of Hadoop deployments upstream back into Apache Hadoop.
CDH is in its fifth major version right now and is considered a mature Hadoop distribution. The paid version of CDH comes with a proprietary management software, Cloudera Manager.
MapR was also founded in 2009 with a mission to bring enterprise-grade Hadoop. The Hadoop distribution they provide has a significant proprietary code when compared to Apache Hadoop. There are a handful of components where they guarantee compatibility with existing Apache Hadoop projects. Key proprietary code for the MapR distribution is the replacement of HDFS with a POSIX-compatible NFS. Another key feature is the capability of taking snapshots.
MapR comes with its own management console. The different grades of the product are named as M3, M5, and M7. M5 is a standard commercial distribution from the company, M3 is a free version without high availability, and M7 is a paid version with a rewritten HBase API.
Hortonworks Data Platform (HDP)
The Yahoo! Hadoop team spurned off to form Hortonworks in June 2011, a company with objectives similar to Cloudera. Their distribution is branded as Hortonworks Data Platform (HDP). The HDP suite’s Hadoop and other software are completely free, with paid support and training. Hortonworks also pushes enhancements upstream, back to Apache Hadoop.
HDP is in its second major version currently and is considered the rising star in Hadoop distributions. It comes with a free and open source management software called Ambari.
Greenplum is a marquee parallel data store from EMC. EMC integrated Greenplum within Hadoop, giving way to an advanced Hadoop distribution called Pivotal HD. This move alleviated the need to import and export data between stores such as Greenplum and HDFS, bringing down both costs and latency.
The HAWQ technology provided by Pivotal HD allows efficient and low-latency query execution on data stored in HDFS. The HAWQ technology has been found to give 100 times more improvement on certain MapReduce workloads when compared to Apache Hadoop. HAWQ also provides SQL processing in Hadoop, increasing its popularity among users who are familiar with SQL.
Storage and Processing Strategies
Characteristics of Big Data Storage Methods
1. Data is distributed across several nodes: Because big data refers to much larger data sets, the data would be distributed across several nodes each being a commodity machine with a typical storage capacity of 2–4 Tb. Even with smaller sized data sets this distribution is beneficial because
Each data block is replicated across more than one node (the default Hadoop replication factor is 3) rendering the overall system resilient to failure.
Because of the prevailing parallel processing method, several nodes participate in the data processing resulting in a 5–10 times improvement in performance.
2. Applications (along with dependent libraries) are moved to the processing nodes loaded with data: Unlike the conventional 3-tier architecture where the data is transported to the centralized application tier for processing, big data cannot handle this network overhead. Moving terabytes of data to the application tier will saturate the networks and introduce considerable inefficiencies, possibly leading to system failure.
3. Data is processed local to the nodes: Since data is distributed across several nodes and applications are moved to these processing nodes, it is evident that data gets processed local or as close as possible to the data as possible. Assimilating the results of such node-local task being small compared with the overall volume of raw data itself processed across all of these node-local tasks, the consequent network overhead is manageable.
4. Sequential reads preferred over random reads: Unlike the conventional relational database management system (RDBMS) systems that are majorly random access-oriented, big data systems prefer sequential oriented access to the benefit of throughput for the overall system.
Characteristics of Big Data Processing Methods
1. Massively parallel processing (MPP) database systems: MPP systems employ some application-specific pre-defined form of splitting or partitioning of data (across the nodes) based on values contained in a column or a set of columns, for instance, sales for different countries. While such an approach is not applicable to ad hoc queries, it is beneficial only when the manner of the distribution of data is consistent with the manner the data is accessed for processing.
To address this limitation, it is common for such systems to store the data multiple times, split by different criteria. Depending on the query, the appropriate data set is accessed.
2. In-memory database systems: An in-memory database is like an in-memory MPP database with a SQL interface. From an operational perspective, in-memory database systems are identical to MPP systems with each node having a significant amount of memory, and most of the data is preloaded into this memory. SAP HANA operates on this principle. One of the major disadvantages of the commercial implementations of in-memory databases is that there are considerable hardware and software lock-in; consequently, they also tend usually to be expensive.
3. Bulk-synchronous parallel (BSP) systems: BSP system is composed of a list of processes (identical to the map processes) that synchronize on a barrier, send data to the Master node, and exchange relevant information. Once the iteration is completed, the Master node will indicate to each processing node to resume the next iteration. Synchronizing on a barrier is a commonly used concept in parallel programming.
It is used when many threads are responsible for performing their own tasks but need to agree on a checkpoint before proceeding. This pattern is needed when all threads need to have completed a task up to a certain point before the decision is made to proceed or abort with respect to the rest of the computation (in parallel or in sequence).
4. MapReduce systems: MapReduce system needs the user to define a mapping process and a reduction process. When Hadoop is being used to implement MapReduce, the data is typically distributed in 64–128 Mb blocks and each block is replicated twice (a replication factor of 3 is the default in Hadoop). In the example of computing sales for the year 2000 and ordered by country, the entire sales data would be loaded into the HDFS as blocks (64–128 Mb in size).
When the MapReduce process is launched, the system would first transfer all the application libraries (comprising the user-defined map and reduce processes) to each node. Each node will schedule a map task that sweeps the blocks comprising the sales data file. Each Mapper (on the respective node) will read records of the block and filter out the records for the year 2014.
Each Mapper will then output a record comprising a key-value pair. Key will be the country and value will be the sales number from the given record if the sales record is for the year 2014.
Finally, a configurable number of Reducers will receive the key-value pairs from each of the Mappers. Keys will be assigned to specific Reducers to ensure that a given key is received by one and only one Reducer. Each Reducer will then add up the sales value number for all the key-value pairs received. The data format received by the Reducer is key (country), and a list of values for that key (sales records for the year 2014). The output is written back to the HDFS.
The client will then sort the result by states after reading it from the HDFS. The last step can be delegated to the Reducer because the Reducer receives its assigned keys in the sorted order. In this example, we need to restrict the number of Reducers to one to achieve this, however. Because communication between Mappers and Reducers causes network I/O, it can lead to bottlenecks.
MapReduce job terminates at the end of its processing cycle.
NoSQL databases have been classified into four subcategories:
Column family stores: An extension of the key-value architecture with columns and column families, the overall goal was to process distributed data over a pool of infrastructure; for example, HBase and Cassandra.
Key-value pairs: This model is implemented using a hash table where there are a unique key and a pointer to a particular item of data creating a key-value pair; for example, Voldemort.
Document databases: This class of databases is modeled after Lotus Notes and similar to key-value stores (K-V Stores). The data is stored as a document and is represented in JSON or XML formats. The biggest design feature is the flexibility to list multiple levels of key-value pairs; for example, Riak and CouchDB.
Graph databases: Based on the graph theory, this class of database supports the scalability across a cluster of machines. The complexity of representation for extremely complex sets of documents is evolving; for example, Neo4j.
Column-Oriented Stores or Databases
Hadoop HBase is the distributed database that supports the storage needs of the Hadoop distributed programming platform. HBase is designed by taking inspiration from Google Big-table; its main goal is to offer real-time read/write operations for tables with billions of rows and millions of columns by leveraging clusters of commodity hardware. The internal architecture and logic model of HBase is very similar to Google Big-table, and the entire system is backed by the HDFS, which mimics the structure and services of GFS.
Key-Value Stores (K-V Stores) or Databases
Apache Cassandra is a distributed object store form an aging large amount of structured data spread across many commodity servers. The system is designed to avoid a single point of failure and offer a highly reliable service. Cassandra was initially developed by Facebook; now it is part of the Apache incubator initiative.
Facebook in the initial years had used a leading commercial database solution for their internal architecture in conjunction with some Hadoop. Eventually, the tsunami of users led the company to start thinking in terms of unlimited scalability and focus on availability and distribution.
The nature of the data and its producers and consumers did not mandate consistency but needed unlimited availability and scalable performance. The team at Facebook built an architecture that combines the data model approaches of Big-table and the infrastructure approaches of Dynamo with scalability and performance capabilities, named Cassandra.
Cassandra is often referred to as hybrid architecture since it combines the column-oriented data model from Big-table with Hadoop MapReduce jobs, and it implements the patterns from Dynamo like eventually consistent, gossip protocols, a master-master way of serving both read and write requests. Cassandra supports a full replication model based on NoSQL architectures.
The Cassandra team had a few design goals to meet, considering the architecture at the time of first development and deployment was primarily being done at Facebook. The goals included
Tunable trade-offs between consistency, durability, and latency
Low cost of ownership
Amazon Dynamo is the distributed key-value store that supports the management of information on several of the business services offered by Amazon Inc. The main goal of Dynamo is to provide an incrementally scalable and highly available storage system. This goal helps in achieving reliability at a massive scale, where thousands of servers and network components build an infrastructure serving 10 million requests per day. Dynamo provides a simplified interface based on getting/put semantics, where objects are stored and retrieved with a unique identifier (key).
The main goal of achieving an extremely reliable infrastructure has imposed some constraints on the properties of these systems. For example, ACID properties of data have been sacrificed in favor of a more reliable and efficient infrastructure. This creates what it is called an eventually consistent model (i.e., in the long term all the users will see the same data).
Document-oriented databases or document databases can be defined as a schemaless and flexible model of storing data like documents, rather than relational structures. The document will contain all the data it needs to answer specific query questions. Benefits of this model include
Ability to store dynamic data in unstructured, semi-structured, or structured formats.
Ability to create persisted views from a base document and store the same for analysis.
Ability to store and process large data sets.
The design features of document-oriented databases include
Schema-free—there is no restriction on the structure and format of how the data needs to be stored. This flexibility allows an evolving system to add more data and allows the existing data to be retained in the current structure.
Document store—objects can be serialized and stored in a document, and there is no relational integrity to enforce and follow.
Ease of creation and maintenance—a simple creation of the document allows complex objects to be created once and there is minimal maintenance once the document is created.
No relationship enforcement—documents are independent of each other and there is no foreign key relationship to worry about when executing queries. The effects of concurrency and performance issues related to the same are not a bother here.
Open formats—documents are described using JSON, XML, or some derivative, making the process standard and clean from the start.
Built-in versioning—documents can get large and messy with versions. To avoid conflicts and keep processing efficiencies, versioning is implemented by most solutions available today.
Document databases express the data as files in JSON or XML formats. This allows the same document to be parsed for multiple contexts and the results scrapped and added to the next iteration of the database data.
From an infrastructure point of view, the two systems support data replication and high availability. CouchDB ensures ACID properties on data. MongoDB supports sharding, which is the ability to distribute the content of a collection among different nodes.
Graph Stores or Databases
Social media and the emergence of Facebook, LinkedIn, and Twitter have accelerated the emergence of the most complex NoSQL database, the graph database. The graph database is oriented toward modeling and deploying data that is graphical by the construct. For example, to represent a person and their friends in a social network, we can either write code to convert the social graph into key-value pairs on a Dynamo or Cassandra or simply convert them into a node-edge model in a graph database, where managing the relationship representation is much more simplified.
A graph database represents each object as a node and the relationships as an edge. This means the person is a node and household is a node, and the relationship between them is an edge. Like the classic ER model for RDBMS, we need to create an attribute model for a graph database. We can start by taking the highest level in a hierarchy as a root node (similar to an entity) and connect each attribute as its sub-nodes.
To represent different levels of the hierarchy we can add a subcategory or sub-reference and create another list of attributes at that level. This creates a natural traversal model like a tree traversal, which is similar to traversing a graph. Depending on the cyclic property of the graph, we can have a balanced or skewed model. Some of the most evolved graph databases include Neo4j, infinite graph, GraphDB, and AllegroGraph.
Comparison of NoSQL Databases
1. Column-based databases allow for rapid location and return of data from one particular attribute. They are potentially very slow with writing; however, data may need to be shuffled around to allow a new data item to be inserted. As a rough guide then, traditional transactionally orientated databases will probably fair better in an RDBMS.
Column-based will probably thrive in areas where the speed of access to non-volatile data is important, for example, in some decision support applications. You only need to review marketing material from commercial contenders, like Ingres Vectorwise, to see that business analytics is seen as the key market and speed of data access the main product differentiator.
2. If you do not need large and complex data structures and can always access your data using a known key, then key-value stores have a performance advantage over most RDBMS. Oracle has a feature within their RDBMS that allows you to define a table at an index-organized table (IOT), and this works in a similar way.
However, you do still have the overhead of consistency checking, and these IOTsare often just a small part of a larger schema. RDBMS have a reputation for poor scaling in distributed systems, and this is where key-value stores can be a distinct advantage.
3. Document-centric databases are good where the data is difficult to structure. Web pages and blog entries are two oft-quoted examples. Unlike RDBMS, which imposes structure by their very nature, document-centric databases allow free-form data to be stored. The onus is then on the data retriever to make sense of the data that is stored.
This blog presented an overview of technologies on which big data computing depends: Hadoop MapReduce and NoSQL. The blog discusses the functional programming paradigm followed by the Google MapReduce algorithm and its reference implementation. In the later part of the blog, it provides a brief on the various types of NoSQL databases including column-oriented stores, key-value stores, document-oriented databases, and graph stores.