Big Data Analytics (Best Tutorial 2019)

Big Data Analytics

What is Big Data Analytics

The big data phenomenon has rapidly become pervasive across the spectrum of industries and sectors. It typically describes the incredibly large volume of data that is collected, stored, and managed. This large volume of data is being analyzed to gain insight to make informed decisions.


To this end, big data analytics is emerging as a subdiscipline of the field of business analytics involving the application of unique architectures and platforms, technologies, unique programming languages, and open-source tools.


The key underlying principle is the utilization of distributed processing to address the large volume and simultaneous complexity and real-time nature of the analytics.


Very large datasets have existed for decades—the key difference is the emergence of the collection and storage of unstructured data primarily from social media, etc.


The data gathered from unconventional sources such as blogs, online chats, email, sensors, tweets, etc., and information gleaned from nontraditional sources such as blogs, social media, email, sensors, pictures, audio and video multimedia utilizing web forms, mobile devices, scanners, etc., hold the potential of offering different types of analytics, such as descriptive, predictive, and prescriptive.


From a comparative perspective, big data did exist in the 1960s, 1970s, 1980s, and 1990s, but it was mostly structured data (e.g., numerical/quantitative) in flat files and relational databases.


With the emergence of the Internet and the rapid proliferation of web applications and technologies, there has been an exponential increase in the accumulation of unstructured data as well. This has led to an escalating and pressing opportunity to analyze this data for decision-making purposes.


For example, it is universal knowledge that Amazon, the online retailer, utilizes big data analytics to apply predictive and prescriptive analytics to forecast what products a customer ought to purchase.


All of the visits, searches, personal data, orders, etc., are analyzed using complex analytics algorithms.


Likewise, from a social media perspective, Facebook executes analytics on the data collected via the users’ accounts. Google is another historical example of a company that analyzes a whole breadth and depth of data collected via the searches results tracking.


Examples can be found not only in Internet-based companies, but also in industries such as banking, insurance, healthcare, and others, and in science and engineering. Recognizing that big data analytics is here to stay, we next discuss the primary characteristics.




Like big data, the analytics associated with big data is also described by three primary characteristics: volume, velocity, and variety (Big Data Analytics). There is no doubt data will continue to be created and collected, continually leading to an incredible volume of data. Second, this data is being accumulated at a rapid pace, and in real time.


This is indicative of velocity. Third, gone are the days of data being collected in standard quantitative formats and stored in spreadsheets or relational databases. Increasingly, the data is in multimedia format and unstructured. This is the variety characteristic.


Considering volume, velocity, and variety, the analytics techniques have also evolved to accommodate these characteristics to scale up to the complex and sophisticated analytics needed.


Some practitioners and researchers have introduced a fourth characteristic: veracity. The implication of this is data assurance. That is, both the data and the analytics and outcomes are error-free and credible.


Simultaneously, the architectures and platforms, algorithms, methodologies, and tools have also scaled up in granularity and performance to match the demands of big data.


For example, big data analytics is executed in distributed processing across several servers (nodes) to utilize the paradigm of parallel computing and a divide and process approach.


It is evident that the analytics tools for structured and unstructured big data are very different from traditional business intelligence (BI) tools.


The architectures and tools for big data analytics have to necessarily be of industrial strength. Likewise, the models and techniques such as data mining and statistical approaches, algorithms, visualization techniques, etc., have to be mindful of the characteristics of big data analytics.


For example, the National Oceanic and Atmospheric Administration (NOAA) uses big data analytics to assist with climate, ecosystem, and environment, weather forecasting and pattern analysis, and commercial translational applications.


NASA engages big data analytics for aeronautical and other types of research. Pharmaceutical companies are using big data analytics for drug discovery, analysis of clinical trial data, side effects and reactions, etc.


Banking companies are utilizing big data analytics for investments, loans, customer demographics, etc. Insurance and healthcare provider and media companies are other big data analytics industries.


The 4Vs are a starting point for the discussion about big data analytics. Other issues include the number of architectures and platform, the dominance of the open-source paradigm in the availability of tools, the challenge of developing methodologies, and the need for user-friendly interfaces.


While the overall cost of the hardware and software is declining, these issues have to be addressed to harness and maximize the potential of big data analytics. We next delve into the architectures, platforms, and tools.




The conceptual framework for a big data analytics project is similar to that for a traditional business intelligence or analytics project. The key difference lies in how the processing is executed.


In a regular analytics project, the analysis can be performed with a business intelligence tool installed on a stand-alone system such as a desktop or laptop. Since the big data is large by definition, the processing is broken down and executed across multiple nodes.


While the concepts of distributed processing are not new and have existed for decades, their use in analyzing very large datasets is relatively new as companies start to tap into their data repositories to gain insight to make informed decisions.


Additionally, the availability of open-source platforms such as Hadoop/MapReduce on the cloud has further encouraged the application of big data analytics in various domains.


Third, while the algorithms and models are similar, the user interfaces are entirely different at this time. Classical business analytics tools have become very user-friendly and transparent.


On the other hand, big data analytics tools are extremely complex, programming intensive, and need the application of a variety of skills. The data can be from internal and external sources, often in multiple formats, residing at multiple locations in numerous legacy and other applications.


All this data has to be pooled together for analytics purposes. The data is still in a raw state and needs to be transformed. Here, several options are available.


A service-oriented architectural approach combined with web services (middleware) is one possibility. The data continues to be in the same state, and services are used to call, retrieve, and process the data.


On the other hand, data warehousing is another approach wherein all the data from the different sources are aggregated and made ready for processing. However, the data is unavailable in real time.


Via the steps of extract, transform, and load (ETL), the data from diverse sources is cleansed and made ready. Depending on whether the data is structured or unstructured, several data formats can be input to the Hadoop/MapReduce platform.


In this next stage in the conceptual framework, several decisions are made regarding the data input approach, distributed design, tool selection, and analytic models.


Finally, to the far right, the four typical applications of big data analytics are shown. These include queries, reports, online analytic processing (OLAP), and data mining.


Visualization is an overarching theme across the four applications. A wide variety of techniques and technologies have been developed and adapted to aggregate, manipulate, analyze, and visualize big data.


These techniques and technologies draw from several fields, including statistics, computer science, applied mathematics, and economics.




The most significant platform for big data analytics is the open-source distributed data processing platform Hadoop (Apache platform), initially developed for routine functions such as aggregating web search indexes.


It belongs to the class NoSQL technologies (others include CouchDB and MongoDB) that have evolved to aggregate data in unique ways.


Hadoop has the potential to process extremely large amounts of data by mainly allocating partitioned data sets to numerous servers (nodes), which individually solve different parts of the larger problem and then integrate them back for the final result.


It can serve in the twin roles of either as a data organizer or as an analytics tool. Hadoop offers a great deal of potential in enabling enterprises to harness the data that was, until now, difficult to manage and analyze.


Specifically, Hadoop makes it possible to process extremely large volumes of data with varying structures (or no structure at all).


However, Hadoop can be complex to install, configure, and administer, and there is not yet readily available individuals with Hadoop skills. Furthermore, organizations are not ready as well to embrace Hadoop completely.


It is generally accepted that there are two important modules in Hadoop


1. The Hadoop Distributed File System (HDFS). This facilitates the underlying storage for the Hadoop cluster. When data for the analytics arrives in the cluster, HDFS breaks it into smaller parts and redistributes the parts among the different servers (nodes) engaged in the cluster.


Only a small chunk of the entire data set resides on each server/node, and it is conceivable each chunk is duplicated on other servers/nodes.


2. MapReduce.

Since the Hadoop platform stores the complete data set in small pieces across a connected set of servers/nodes in a distributed fashion, the analytics tasks can be distributed across the servers/nodes too.


Results from the individual pieces of processing are aggregated or pooled together for an integrated solution. MapReduce provides the interface for the distribution of the subtasks and then the gathering of the outputs. MapReduce is discussed further below.


A major advantage of parallel/distributed processing is graceful degradation or capability to cope with possible failures. Therefore, HDFS and MapReduce are configured to continue to execute in the event of a failure.


HDFS, for example, monitors the servers/nodes and storage devices continually. If a problem is detected, it automatically reroutes and restores data onto an alternative server/node. In other words, it is configured and designed to continue processing in light of a failure.


In addition, replication adds a level of redundancy and backup. Similarly, when tasks are executed, MapReduce tracks the processing of each server/node. If it detects any anomalies such as reduced speed, going into a hiatus, or reaching a dead end, the task is transferred to another server/node that holds the duplicate data.


Overall, the synergy between HDFS and MapReduce in the cloud environment facilitates industrial strength, scalable, reliable, and fault-tolerant support for both the storage and analytics.


In an example, it is reported that Yahoo! is an early user of Hadoop. Its key objective was to gain insight from the large amounts of data stored across the numerous and disparate servers. The integration of the data and the application of big data analytics was mission critical. Hadoop appeared to be the perfect platform for such an endeavor.


Presently, Yahoo! is apparently one of the largest users of Hadoop and has deployed it on thousands on servers/nodes. The Yahoo! Hadoop cluster apparently holds huge “log files” of user-clicked data, advertisements, and lists of all Yahoo! published content.


From a big data analytics perspective, Hadoop is used for a number of tasks, including correlation and cluster analysis to find patterns in the unstructured data sets.


Some of the more notable Hadoop-related application development-oriented initiatives include Apache Avro (for data serialization), Cassandra and HBase (databases), Chukka (a monitoring system specifically designed with large distributed systems in view), Hive (provides ad hoc Structured Query Language (SQL)-like queries for data aggregation and summarization):


Mahout (a machine learning library), Pig (a high-level Hadoop programming language that provides a data flow language and execution framework for parallel computation), Zookeeper (provides coordination services for distributed applications), and others. The key ones are described below.




MapReduce, as discussed above, is a programming­ framework developed by Google that supports the underlying Hadoop platform to process the big data sets residing on distributed servers (nodes) in order to produce the aggregated results.


The primary component of an algorithm would map the broken up tasks (e.g., calculations) to the various locations in the distributed file system and consolidate the individual results (the reduce step) that are computed at the individual nodes of the file system.


In summary, the data mining algorithm would perform computations at the server/node level and simultaneously in the overall distributed system to summate the individual outputs.


It is important to note that the primary Hadoop MapReduce application programming interfaces (APIs) are mainly called from Java. This requires skilled programmers. In addition, advanced skills are indeed needed for development and maintenance.


In order to abstract some of the complexity of the Hadoop programming framework, several application development languages have emerged that run on top of Hadoop. Three popular ones are Pig, Hive, and Jaql. These are briefly described below.


Pig and PigLatin


Pig was originally developed at Yahoo! The Pig programming language is configured to assimilate all types of data (structured/unstructured, etc.). Two key modules are comprised in it: the language itself, called PigLatin, and the runtime version in which the PigLatin code is executed. 


According to Zikopoulos et al., the initial step in a Pig program is to load the data to be subject to analytics in HDFS.


This is followed by a series of manipulations wherein the data is converted into a series of mapper and reducer tasks in the background. Last, the program dumps the data to the screen or stores the outputs at another location.


The key advantage of Pig is that it enables the programmers utilizing Hadoop to focus more on the big data analytics and less on developing the mapper and reducer code.




While Pig is robust and relatively easy to use, it still has a learning curve. This means the programmer needs to become proficient.


To address this issue, Facebook has developed a runtime Hadoop support architecture that leverages SQL with the Hadoop platform.


This architecture is called Hive; it permits SQL programmers to develop Hive Query Language (HQL) statements akin to typical SQL statements. However, HQL is limited in the commands it recognizes.


Ultimately, HQL statements are decomposed by the Hive Service into MapReduce tasks and executed across a Hadoop cluster of servers/nodes. Also, since Hive is dependent on Hadoop and MapReduce executions, queries may have lag time in processing up to several minutes.


This implies Hive may not be suitable for big data analytics applications that need rapid response times, typical of relational databases. Lastly, Hive is a read-based programming artifact; it is therefore not appropriate for transactions that engage in a large volume of write instructions.



Jaql’s primary role is that of a query language for JavaScript Object Notational (JSON). However, its capability goes beyond LSON. It facilitates the analysis of both structured and nontraditional data.


Pointedly, Jaql enables the functions of select, join, group, and filter of the data that resides in HDFS. In this regard, it is analogous to a hybrid of Pig and Hive.


Jaql is a functional, declarative query language that is designed to process large data sets. To facilitate parallel processing, Jaql converts high-level queries into low-level queries consisting of MapReduce tasks.



Zookeeper is yet another open-source Apache project that allows a centralized infrastructure with various services; this provides for synchronization across a cluster of servers. Zookeeper maintains common objects required in large cluster situations (like a library).


Examples of these typical objects include configuration information, hierarchical naming space, and others. Big data analytics applications can utilize these services to coordinate parallel processing across big clusters.


This necessitates a centralized management of the entire cluster in the context of such things as name services, group services, synchronization services, configuration management, and others.


Furthermore, several other open-source projects that utilize Hadoop clusters require these types of cross-cluster services.


The availability of these in a Zookeeper infrastructure implies that projects can be embedded by Zookeeper without duplicating or requiring constructing all over again. A final note: Interface with Zookeeper hap-pens via Java or C interfaces presently.




HBase is a column-oriented database management system that sits on the top of HDFS. In contrast to traditional relational database systems, HBase does not support a structured query language such as SQL. The applications in HBase are developed in Java much similar to other MapReduce applications.


In addition, HBase does support application development in Avro, REST, or Thrift. HBase is built on concepts similar to how HDFS has a NameNode (master) and slave nodes, and MapReduce comprises JobTracker and TaskTracker slave nodes.


A master node manages the cluster in HBase, and regional servers store parts of the table and execute the tasks on the big data.



Cassandra, an Apache project, is also a distributed database system. It is designated as a top-level project modeled to handle big data distributed across many utility servers.


Also, it provides reliable service with no particular point of failure. It is also a NoSQL system. Facebook originally developed it to support its inbox search. The Cassandra database system can store 2 million columns in a single row.


Similar to Yahoo!’s needs, Facebook wanted to use the Google BigTable architecture that could provide a column-and-row database structure; this could be distributed across a number of nodes.


But BigTable faced a major limitation—its use of a master node approach made the entire application depend on one node for all read-write coordination—the antithesis of parallel processing.


Cassandra was built on a distributed architecture named Dynamo, designed by Amazon engineers.


Amazon used it to track what its millions of online customers were entering into their shopping carts. Dynamo gave Cassandra an advantage over BigTable; this is due to the fact that Dynamo is not dependent on anyone master node.


Any node can accept data for the whole system, as well as answer queries. Data is replicated on multiple hosts, creating stability and eliminating the single point of failure.



Many tasks may be tethered together to meet the requirements of a complex analytics application in MapReduce. The open-source project Oozie to an extent streamlines the workflow and coordination among the tasks.


Its functionality permits programmers to define their own jobs and the relationships between those jobs. It will then automatically schedule the execution of the various jobs once the relationship criteria have been complied with.



Lucene is yet another widely used open-source Apache project predominantly used for text analytics/searches; it is incorporated into several open-source projects.


Lucene precedes Hadoop and has been a top-level Apache project since 2005. Its scope includes full-text indexing and library search for use within a Java application



Avro, also an Apache project, facilitates data serialization services. The data definition schema is also included in the data file. This makes it possible for an analytics application to access the data in the future since the schema is also stored along with.


Versioning and version control are also added features of use in Avro. Schemas for prior data are available, making schema modifications possible.



Mahout is yet another Apache project whose goal is to generate free applications of distributed and scalable machine learning algorithms that support big data analytics on the Hadoop platform.


Mahout is still an ongoing project, evolving to include additional algorithms. The core widely used algorithms for classification, clustering, and collaborative filtering are implemented using the map/reduce paradigm.



Streams deliver a robust analytics platform for analyzing data in real time. Compared to BigInsights, Streams applies the analytics techniques on data in motion.


But like BigInsights, Streams is appropriate not only for structured data but also for nearly all other types of data—the nontraditional semistructured or unstructured data coming from sensors, voice, text, video, financial, and many other high-volume sources.


Overall, in summary, there are numerous vendors, including AWS, Cloudera, Hortonworks, and MapR Technologies, among others, who distribute open-source Hadoop platforms. Numerous proprietary options are also available, such as IBM’s BigInsights.


Further, many of these are cloud versions that make it more widely available. Cassandra, HBase, and MongoDB, as described above, are widely used for the database component. In the next section, we offer an applied big data analytics methodology to develop and implement a big data project in a company.




While several different methodologies are being developed in this rapidly emerging discipline, here a practical hands-on methodology is outlined. The table shows the main stages of such a methodology. In stage 1, the


Outline of Big Data Analytics Methodology


Stage 1 Concept design

  • Establish the need for a big data analytics project
  • Define problem statement
  • Why is project important and significant?


Stage 2 Proposal

  • Abstract—summarize the proposal
  • Introduction
  • What is a problem being addressed?
  • Why is it important and interesting?
  • Why big data analytics approach?
  • Background material
  • Problem domain discussion
  • Prior projects and research


Stage 3 Methodology

  • Hypothesis development
  • Data sources and collection
  • Variable selection (independent and dependent variables)
  • ETL and data transformation
  • Platform/tool selection
  • Analytic techniques
  • Expected results and conclusions
  • Policy implications
  • Scope and limitations
  • Future research
  • Implementation
  • Develop conceptual architecture
  • −− Show and describe the component
  • −− Show and describe big data analytics platform/tools
  • Execute steps in the methodology
  • Import data
  • Perform various big data analytics using various techniques and algorithms (e.g., word count, association, classification, clustering, etc.)
  • Gain insight from outputs
  • Draw conclusion
  • Derive policy implications
  • Make informed decisions


Stage 4 Presentation and walkthrough Evaluation


interdisciplinary big data analytics team develops a concept design. This is the first cut at briefly establishing the need for such a project since there are trade-offs in terms of cheaper options, risk, problem-solution alignment, etc.


Additionally, a problem statement is followed by a description of project importance and significance. Once the concept design is approved in principle, one proceeds to stage 2, which is the proposal development stage.


Here, more details are filled. Taking the concept design as input, an abstract highlighting the overall methodology and implementation process is outlined.


This is followed by an introduction to the big data analytics domain: What is the problem being addressed? Why is it important and interesting to the organization?


It is also necessary to make the case for a big data analytics approach. Since the complexity and cost are much higher than those of traditional analytics approaches, it is important to justify its use.


Also, the project team should provide background information on the problem domain and prior projects and research done in this domain.


Both the concept design and the proposal are evaluated in terms of the 4Cs:

  • Completeness: Is the concept design complete?
  • Correctness: Is the design technically sound? Is correct terminology used?
  • Consistency: Is the proposal cohesive, or does it appear choppy? Is there flow and continuity?
  • Communicability: Is proposal formatted professionally? Does report communicate design in easily understood language?


Next, in stage 3, the steps in the methodology are fleshed out and implemented. The problem statement is broken down into a series of hypotheses. Please note these are not rigorous, as in the case of statistical approaches. Rather, they are developed to help guide the big data analytics process.


Simultaneously, the independent and dependent variables are identified. In terms of analytics itself, it does not make a major difference to classify the variables. However, it helps identify causal relationships or correlations. The data is collected (longitudinal data, if necessary), described, and transformed to make it ready for analytics.


A very important step at this point is platform/ tool evaluation and selection. For example, several options, as indicated previously, such as AWS Hadoop, Cloudera, IBM BigInsights, etc., are available.


A major criterion is whether the platform is available on a desktop or on the cloud. The next step is to apply the various big data analytics techniques to the data. These are not different from routine analytics. They’re only scaled up to large datasets.


Through a series of iterations and what if analysis, insight is gained from the big data analytics. From the insight, informed decisions can be made and policy shaped. In the final steps, conclusions are offered, scope and limitations are identified, and the policy implications discussed.


In stage 4, the project and its findings are presented to the stakeholders for action. Additionally, the big data analytics project is validated using the following criteria:


Robustness of analyses, queries, reports, and visualization


  • Variety of insight
  • Substantiveness of the research question
  • Demonstration of big data analytics application
  • Some degree of integration among components
  • Sophistication and complexity of analysis


An implementation is a staged approach with feedback loops built in at each stage to minimize risk of failure. The users should be involved in the implementation.


It is also an iterative process, especially in the analytics step, wherein the analyst performs what if analysis. The next section briefly discusses some of the key challenges in big data analytics.




For one, a big data analytics platform must support, at a minimum, the key functions necessary for processing the data.


The criteria for platform evaluation may include availability, continuity, ease of use, scalability, ability to manipulate at different levels of granularity, privacy and security enablement, and quality assurance.


Additionally, while most currently available platforms are open source, the typical advantages and limitations of open-source platforms apply.


They have to be shrink-wrapped, made user-friendly, and transparent for big data analytics to take off. Real-time big data analytics is a key requirement in many industries, such as retail, banking, healthcare, and others.


The lag between when data is collected and processed has to be addressed. The dynamic availability of the numerous analytics algorithms, models, and methods in a pull-down type of menu is also necessary for large-scale adoption.


The in-memory processing, such as in SAP’s Hana, can be extended to the Hadoop/MapReduce framework.


 The various options of local processing (e.g., a network, desktop/laptop), cloud computing, software as a service (SaaS), and service-oriented architecture (SOA) web services delivery mechanisms have to be explored further.


The key managerial issues of ownership, governance, and standards have to be addressed as well.


Interleaved into these are the issues of continuous data acquisition and data cleansing. In the future, ontology and other design issues have to be discussed. Furthermore, an appliance- driven approach (e.g., access via mobile computing and wireless devices) has to be investigated.


We next discuss big data analytics in a particular industry, namely, healthcare and the practice of medicine.




The healthcare industry has great potential in the application of big data analytics. From evidence-based to personalized medicine, from outcomes to a reduction in medical errors, the pervasive impact of big data analytics in healthcare can be felt across the spectrum of healthcare delivery.


Two broad categories of applications are envisaged: big data analytics in the business and delivery side (e.g., improved quality at lower costs) and in the practice of medicine (aid in diagnosis and treatment).


The healthcare industry has all the necessary ingredients and qualities for the application of big data analytics—data intensive, critical decision support, outcomes-based, improved delivery of quality health care at reduced costs (in this regard, the transformational role of health information technology such as big data analytics applications is recognized), and so on.


However, one must keep in mind the historical challenges of the lack of user acceptance, lack of interoperability, and the need for compliance regarding privacy and security. Nevertheless, the promise and potential of big data analytics in healthcare cannot be overstated.


In terms of examples of big data applications, it is reported that the Department of Veterans Affairs (VA) in the United States has successfully demonstrated several healthcare information technologies (HIT) and remote patent monitoring programs.


The VA health system generally outperforms the private sector in following recommended processors for patient care, adhering to clinical guidelines, and achieving greater rates of evidence-based drug therapy.


These achievements are largely possible because of the VA’s performance-based accountability framework and disease management practices enabled by electronic medical records (EMRs) and HIT.


Another example is how California-based integrated managed care consortium Kaiser Permanente connected clinical and cost data early on, thus providing the crucial data set that led to the discovery of Vioxx’s adverse drug effects and the subsequent withdrawal of the drug from the market.


Yet another example is the National Institute for Health and Clinical Excellence, part of the UK’s National Health Service (NHS), pioneering use of large clinical data sets to investigate the clinical and cost effectiveness of new drugs and expensive existing treatments.


The agency issues appropriate guidelines on such costs for the NHS and often negotiates prices and market access conditions with pharmaceutical and medical products (PMP) industries.


Further, the Italian Medicines Agency collects and analyzes clinical data on the experience of expensive new drugs as part of a national cost-effectiveness program.


The agency can impose conditional reimbursement status on new drugs and can then reevaluate prices and market access conditions in light of the results of its clinical data studies.




In this section, we describe our ongoing prototype research project in the use of the Hadoop/MapReduce framework on the AWS for the analysis of unstructured cancer blog data.


Health organizations and individuals such as patients are using blog content for several purposes. Health and medical blogs are rich in unstructured data for insight and informed decision making.


While current applications such as web crawlers and blog analysis are good at generating statistics about the number of blogs, top 10 sites, etc., they are not advanced/useful or scalable computationally to help with analysis and extraction of insight.


First, the blog data is growing exponentially (volume); second, they’re posted in real time and the analysis could become outdated very quickly (velocity); and third, there is a variety of content in the blogs.


Fourth, the blogs themselves are distributed and scattered all over the Internet. Therefore, blogs in particular and social media in general are great candidates for the application of big data analytics.


To reiterate, there has been an exponential increase in the number of blogs in the healthcare area, as patients find them useful in disease management and developing support groups.


Alternatively, healthcare providers such as physicians have started to use blogs to communicate and discuss medical information.


Examples of useful information include alternative medicine and treatment, health condition management, diagnosis–treatment information, and support group resources.


This rapid proliferation in health- and medical-related blogs has resulted in huge amounts of unstructured yet potentially valuable information is available for analysis and use.


Statistics indicate health-related bloggers are very consistent at posting to blogs. 


The analysis and interpretation of health-related blogs are not trivial tasks. Unlike many of the blogs in various corporate domains, health blogs are far more complex and unstructured.


The postings reflect two important facets of the bloggers and visitors: the individual patient care and disease management (fine granularity) to generalized medicine (e.g., public health).


Hadoop/MapReduce defines a framework for implementing systems for the analysis of unstructured data. In contrast to structured information, whose meaning is expressed by the structure or the format of the data, the meaning of unstructured information cannot be so inferred.


Examples of data that carry unstructured information include natural language text and data from audio or video sources. More specifically, an audio stream has a well-defined syntax and semantics for rendering the stream on an audio device, but its music score is not directly represented.


Hadoop/ MapReduce is sufficiently advanced and sophisticated computationally to aid in the analysis and understanding of the content of health-related blogs.


At the individual level (document-level analysis) one can perform analysis and gain insight into the patient in longitudinal studies. At the group level (collection-level analysis) one can gain insight into the patterns of the groups (network behavior, e.g., assessing the influence within the social group).


for example, in a particular disease group, the community of participants in an HMO or hospital setting, or even in the global community of patients (ethnic stratification).


The results of these analyses can be generalized. While the blogs enable the formation of social networks of patients and providers, the uniqueness of the health/medical terminology comingled with the subjective vocabulary of the patient compounds the challenge of interpretation.


Discussing at a more general level, while blogs have emerged as contemporary modes of communication within a social network context, hardly any research or insight exists in the content analysis of blogs. The blog world is characterized by a lack of particular rules on the format, how to post, and the structure of the content itself.


Questions arise: How do we make sense of the aggregate content? How does one interpret and generalize?


In health blogs in particular, what patterns of diagnosis, treatment, management, and support might emerge from a meta-analysis of a large pool of blog postings? How can the content be classified? What can natural clusters be formed about the topics?


What associations and correlations exist between key topics? The overall goal, then, is to enhance the quality of health by reducing errors and assisting in clinical decision making.


Additionally, one can reduce the cost of healthcare delivery through the use of these types of advanced health information technology. Therefore, the objectives of our project include the following:


  • 1. To use Hadoop/MapReduce to perform analytics on a set of cancer blog postings from Yahoo!
  • 2. To develop a parsing algorithm and association, classification, and clustering technique for the analysis of cancer blogs
  • 3. To develop a vocabulary and taxonomy of keywords (based on existing medical nomenclature)
  • 4. To build a prototype interface
  • 5. To contribute to social media analysis in the semantic web by generalizing the models from cancer blogs


The following levels of development are envisaged: first level, patterns of symptoms, management (diagnosis/treatment); the second level, glean insight into disease management at individual/group levels;


and third level, clinical decision support (e.g., a generalization of patterns, syntactic to semantic)—informed decision making. Typically, the unstructured information in blogs comprises the following:


  • Blog topic (posting): What issue or question does the blogger (and comments) discuss?
  • Disease and treatment (not limited to): What cancer type and treatment (and other issues) are identified and discussed?
  • Other information: What other related topics are discussed? What links are provided?
  • What Can We Learn from Blog Postings?


Unstructured information related to blog postings (bloggers), including responses/comments, can provide insight into diseases (cancer), treatment (e.g., alternative medicine, therapy), support links, etc.


  • 1. What are the most common issues patients have (bloggers/responses)?
  • 2. What are the cancer types (conditions) most discussed? Why?
  • 3. What therapies and treatments are being discussed? What medical and nonmedical information is provided?
  • 4. Which blogs and bloggers are doing a good job of providing relevant and correct information?
  • 5. What are the major motivations for the postings (comments)? Is it classified by roles, such as provider (physician) or patient?
  • 6. What are the emerging trends in disease (symptoms), treatment, and therapy (e.g., alternative medicine), support systems, and information sources (links, clinical trials)?


What Are the Phases and Milestones?

This project envisions the use of Hadoop/MapReduce on the AWS to facilitate distributed processing and partitioning of the problem-solving process across the nodes for manageability.


Additionally, supporting plug-ins are used to develop an application tool to analyze health-related blogs. The project is scoped to content analysis of the domain of cancer blogs at Yahoo!.


Phase 1 involved the collection of blog postings from Yahoo! into a Derby application. 


Phase 2 consisted of the development and configuration of the architecture—keywords, associations, correlations, clusters, and taxonomy.


Phase 3 entailed the analysis and integration of extracted information in the cancer blogs—preliminary results of initial analysis (e.g., patterns that are identified).


Phase 4 involved the development of taxonomy.


Phase 5 proposes to test the mining model and develop the user interface for deployment. We propose to develop a comprehensive text mining system that integrates several mining techniques, including association and clustering, to effectively organize the blog information and provide decision support in terms of search by keywords.



Big data analytics is transforming the way companies are using sophisticated information technologies to gain insight from their data repositories to make informed decisions.


This data-driven approach is unprecedented, as the data collected via the web and social media is escalating by the second. In the future, we’ll see the rapid, widespread implementation and use of big data analytics across the organization and the industry. In the process, the several challenges highlighted above need to be addressed.


As it becomes more mainstream, issues such as guaranteeing privacy, safeguarding security, establishing standards and governance, and continually improving the tools and technologies would garner attention.


Big data analytics and applications are at a nascent stage of development, but the rapid advances in platforms and tools can accelerate their maturing process.


Big Data NoSQL Databases

Big Data NoSQL Databases

A NoSQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability.


NoSQL databases are often highly optimized key-value stores intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput.


NoSQL databases are finding significant and growing industry use in big data and real-time web applications. NoSQL systems are also referred to as “not only SQL” to emphasize that they may, in fact, allow SQL-like query languages to be used.


The term NoSQL is more of an approach or way to address data management versus being a rigid definition.


Different types of NoSQL databases exist, and they often share certain characteristics but are optimized for specific types of data, which then require different capabilities and features. NoSQL may mean not only SQL, or it may mean “No” SQL.


When using Apache Hive to run SQL in NoSQL databases, those queries are converted to MapReduce(2) and run as a batch operation to process large volumes of data in parallel.


A “No” SQL database may use APIs or JavaScript to access data versus traditional SQL. APIs can also be used to access the data in NoSQL to process interactive and real-time queries.


When accessing big data, there is a demand for databases that are nonrelational and that are designed to work with semi-structured, unstructured, and structured data.


A nonrelational database provides a flexible data model, layout, and format. Predefined schemas are not required for NoSQL databases.


NoSQL databases have traditionally been nonrelational databases that are distributed and designed to scale to very large sizes. NoSQL databases are designed to address the challenges of big data.


NoSQL databases are very scalable, have high availability, and provide a high level of parallelization for processing large volumes of data quickly. NoSQL databases have different characteristics and features.


For example, they can be key-value based, column-based, document-based, or graph-based. NoSQL data stores may be optimized for key or value, columnar, document-oriented, XML, graph, and object data structures.


Each NoSQL database can emphasize different areas of the CAP theorem (Brewer theorem).


The CAP theorem effectively states that distributed databases can have consistency, availability, and partition (CAP) tolerance, but not all three at the same time.


In other words, the Cap theorem states that a database can excel in only two of the following areas: consistency (all DataNodes observe the same data at the same time), availability (every request for data will get a response of success or failure), and partition tolerance (the data platform will continue to run even if parts of the system are not available).


Choosing a NoSQL database requires choosing what the CAP priorities are. If the data are distributed and replicated, and nodes or networks between the nodes go down, a decision must be made between data consistency (having the current version of the data) and availability.


NoSQL databases can address this with an eventual consistency model (Cassandra). Eventual consistency means a DataNode is always available to handle data requests, but at a point in time, the data may be inconsistent but still relatively accurate.


NoSQL databases choose their priority for CAP tolerance. For instance,

  • HBase and Accumulo are consistent and partition tolerant.
  • CouchDB is partition tolerant and available.
  • Neo4j is available and consistent.


Choosing the correct NoSQL database requires an understanding of factors such as the requirements, SLAs (latency), initial size and growth projections, costs, feature or functionality, and concurrency.


It’s also important to understand that NoSQL databases have their own defaults and may have the capability of changing their priorities for CAP. It is also important to understand the roadmap of these NoSQL databases because they are evolving quickly.


NoSQL systems have been divided into four major categories, namely, Column Databases, Key-Value Databases, Document Databases, and Graph Databases


1. Column Databases: These systems partition a table by column into column families, where each column family is stored in its own files. They also allow versioning of data values.


Column-oriented data stores are used when key-value data stores reach their limits because you want to store a very large number of records with a very large amount of information that goes beyond the simple nature of the key-value store.


The main benefit of using columnar databases is that you can quickly access a large amount of data.


A row in an RDBMS is a continuous disk entry and multiple rows are stored in different disk locations, which makes them more difficult to access; in contrast, in columnar databases, all cells that are part of a column are stored contiguously.


As an example, consider performing a lookup for all blog titles in an RDBMS. For tackling millions of records common on the web, it might be costly in terms of disk entries, whereas in columnar databases, such a search would represent only one access.


Such databases are very handy for retrieving large amounts of data from a specific family, but the tradeoff is that they lack flexibility. The most used columnar database is Google Cloud BigTable, especially Apache HBase and Cassandra.


One of the other benefits of columnar databases is ease of scaling because data are stored in columns; these columns are highly scalable in terms of the amount of information they can store. This is why they are used mainly for keeping nonvolatile, long-living information and in scaling use cases.


2. Key-Value Databases: These systems have a simple data model based on fast access by the key to the value associated with the key; the value can be a record or an object or a document or even have a more complex data structure.


Key-value databases are the easiest NoSQL databases to understand. These data stores basically act like a dictionary and work by matching a key to a value.


They are often used for high-performance use cases in which basic information needs to be stored—for example, when session information may need to be written and retrieved very quickly.


These data stores really perform well and are efficient for this kind of use case; they are also usually highly scalable.


Key-value data stores can also be used in a queuing context to ensure that data would not be lost, such as in logging architecture or search engine indexing architecture use cases.


Redis and Riak KV are the most famous key-value data stores; Redis is more widely used and has an in-memory K-V store with optional persistence.


Redis is often used in web applications to store session-related data, such as node or PHP web applications; it can serve thousands of session retrievals per second without altering the performance.


3. Document Databases: These systems store data in the form of documents using well-known formats such as JavaScript Object Notation (JSON). Documents are accessible via their document id, but can also be accessed rapidly using other indexes.


Columnar databases are not the best for structuring data that contain deeper nesting structures—that is where document-oriented data stores come into play.


Data are indeed stored into key-value pairs, but these are all compressed into what is called a document. This document relies on a structure or encoding such as XML, but most of the time, it relies on JavaScript Object Notation (JSON).


Document-oriented databases are used whenever there is a need to nest information. For instance, for representing an account in your application may have the following information:


Basic information: first name, last name, birthday, profile picture, URL, creation date, and so on


Additional information: address, authentication method (password, Facebook, etc.), interests, and so on NoSQL document stores are often used in web applications because representing an object with the nested object is fairly easy; moreover, integrating with front-end JavaScript technology is seamless because both technologies work with JSON.


Although document databases are more useful structurally and for representing data, they also have their downside. They basically need to acquire the whole document—for example, when they are reading for a specific field—and this can dramatically affect performance. Famous NoSQL document databases are MongoDB, Couchbase, and Apache CouchDB.


4. Graph Databases:

 Graph Databases

Data are represented as graphs, and related nodes can be found by traversing the edges using path expressions. Graph databases are really different from other types of databases discussed above.


They use a different paradigm to represent the data—a tree-like structure with nodes and edges that are connected to each other through paths called relations.


Those databases rise with social networks, for instance, to represent a user’s friend network, their friends’ relationships, and so on. With the other types of data stores, it is possible to store a friend’s relationship to a user in a document, but still, it can then be really complex to store friends’ relationships;


It is better to use a graph database and create nodes for each friend, connect them through relationships, and browse the graph depending on the need and scope of the query.


The most famous graph database is Neo4j, and it is used for use cases that have to do with complex relationship information, such as connections between entities and others entities that are related to them, but it is also used in classification use cases.


Characteristics of NoSQL Systems

  • This subsection discusses the characteristics of NoSQL systems.
  • NoSQL Characteristics Related to Distributed Systems and Distributed Databases
  • The below section makes a reference to the characteristics of distributed systems and distributed databases.


1. Availability, Replication, and Eventual Consistency: Many applications that use NoSQL systems require continuous system availability. To accomplish this, data are replicated over two or more nodes in a transparent manner, so that if one node fails, the data are still available on other nodes.


Replication improves data availability and can also improve read performance because read requests can often be serviced from any of the replicated data nodes.


However, write performance becomes more cumbersome because an update must be applied to every copy of the replicated data items; this can slow down write performance if serializable consistency is required.


 Many NoSQL applications do not require serializable consistency, so more relaxed forms of consistency known as eventual consistency are used.


2. Replication: Two major replication models are used in NoSQL systems: master-slave and master-master replication:


a. The master-slave replication requires one copy to be the master copy; all write operations must be applied to the master copy and then propagated to the slave copies, usually using eventual consistency, that is, the slave copies will eventually be the same as the master copy. For read, the master-slave paradigm can be configured in various ways.


One configuration requires all reads to also be at the master copy, so this would be similar to the primary site or primary copy methods of distributed concurrency control, with similar advantages and disadvantages.


Another configuration would allow reads at the slave copies but would not guarantee that the values are the latest writes since writes to the slave nodes can be done after they are applied to the master copy.


b. The master-master replication allows reads and writes at any of the replicas but may not guarantee that reads at nodes that store different copies see the same values.


Different users may write the same data item concurrently at different nodes of the system, so the values of the item will be temporarily inconsistent.


A reconciliation method to resolve conflicting write operations of the same data item at different nodes must be implemented as part of the master-master replication scheme.


3. Sharding of Files:


In many NoSQL applications, files (or collections of data objects) can have many millions of records (or documents or objects), and these records can be accessed concurrently by thousands of users. So, it is not practical to store the whole file in one node.


Sharding or horizontal partitioning of the file records is often employed in NoSQL systems. This serves to distribute the load of accessing the file records to multiple nodes.


The combination of sharing the file records and replicating the shards works in tandem to improve load balancing as well as data availability.


4. High-Performance Data Access:

High-performance data access: In many NoSQL applications, it is necessary to find individual records or objects (data items) from among the millions of data records or objects in a file.


The majority of accesses to an object will be by providing the key value rather than by using complex query conditions. The object key is similar to the concept of object id. To achieve this, most systems use one of two techniques: hashing or range partitioning on object keys:


a. Hashing, a hash function h(K) is applied to the key K, and the location of the object with key K is determined by the value of h(K).


b. Range partitioning, the location is determined via a range of key values; for example, location i would hold the objects whose key values K are in the range Kimin ≤ K ≤ Kimax.


In applications that require range queries, where multiple objects within a range of key values are retrieved, range partitioning is preferred.


Other indexes can also be used to locate objects based on attribute conditions different from the key K.


5. Scalability: There are two kinds of scalability in distributed systems: scale-up and scale-out. Scale-up scalability, on the other hand, refers to expanding the storage and computing power of existing nodes.


In NoSQL systems, scale-out scalability is employed while the system is operational, so techniques for distributing the existing data among new nodes without interrupting system operation are necessary. 


In NoSQL systems, scale-out scalability is generally used, where the distributed system is expanded by adding more nodes for data storage and processing as the volume of data grows.


NoSQL Characteristics Related to Data Models and Query Languages

Data Models

1. Not Requiring a Schema:

The flexibility of not requiring a schema is achieved in many NoSQL systems by allowing semistructured, self-describing data. The users can specify a partial schema in some systems to improve storage efficiency, but it is not required to have a schema in most of the NoSQL systems.


As there may not be a schema to specify constraints, any constraints on the data would have to be programmed in the application programs that access the data items.


There are various languages for describing semistructured data, such as JavaScript Object Notation (JSON) and Extensible Markup Language (XML). JSON is used in several NoSQL systems, but other methods for describing semi-structured data can also be used.


2. Less Powerful Query Languages:

Many applications that use NoSQL systems may not require a powerful query language such as SQL, because search (read) queries in these systems often locate single objects in a single file based on their object keys.


NoSQL systems typically provide a set of functions and operations as a programming application programming interface (API), so reading and writing the data objects are accomplished by calling the appropriate operations by the programmer.


In many cases, the operations are called CRUD operations, to create, read, update, and delete. In other cases, they are known as SCRUD because of an added Search (or Find) operation.


Some NoSQL systems also provide a high-level query language, but it may not have the full power of SQL; only a subset of SQL querying capabilities would be provided.


In particular, many NoSQL systems do not provide join operations as part of the query language itself; the joins need to be implemented in the application programs.


Versioning: Some NoSQL systems provide storage of multiple versions of the data items, with the timestamps of when the data version was created.


Column Databases

Column Databases

A category of NoSQL systems is known as column-based or wide-column systems. The Google distributed storage system for big data, known as BigTable, is a well-known example of this class of NoSQL systems, and it is used in many Google applications that require large amounts of data storage, such as Gmail.


Big-Table uses the Google File System (GFS) for data storage and distribution. An open source system known as Apache HBase is somewhat similar to Google Big-Table, but it typically uses the Hadoop Distributed File System (HDFS) for data storage. HBase can also use Amazon’s Simple Storage System (known as S3) for data storage.


A column, which has a name and a value, is a basic unit of storage in a column family database. A set of columns makes up a row: rows can have identical columns, or they can have different columns.


A set of related columns can be grouped together in collections called column families. Column family databases are designed for rows with a varying number of columns; column families can support even millions of columns.


Column families are organized into groups of data items that are frequently used together. Column families for a single row may or may not be near each other when stored on a disk, but columns within a column family are stored together in persistent storage, making it more likely that reading a single block can satisfy a query.


For a column database, a keyspace is the top-level data structure in that all other data structures are contained within a keyspace; typically, there is one keyspace for every application.


Column families are stored in a keyspace. Each row in a column family is uniquely identified by a row key, thus making a column family analogous to a table in a relational database.


However, data in relational database tables is not necessarily maintained in a predefined order, and rows in relational tables are not versioned the way they are in column family databases.


Tables in relational databases have a relatively fixed structure, and the relational database management system can take advantage of that structure when optimizing the layout of data on drives when storing data for write operations and when retrieving data for reading operations.


Columns in a relational database table are not as dynamic as in column family databases— adding a column in a relational database requires changing its schema definition; adding a column in a column family database merely requires making a reference to it from a client application.


for example, inserting a value to a column name. Unlike a relational database table, the set of columns in a column family can vary from one row to another. Column databases do not require a fixed schema—developers can add columns as needed.


Similarly, in relational databases, data about an object can be stored in multiple tables. For example, a supplier might have a name, address, and contact information in one table; a list of past purchase orders in another table; and a payment history in another.


To reference data from all three tables at once requires a join operation between all these tables. Columnar databases are typically denormalized or structured so that all relevant information about an object exists in a single row.


Query languages for column family databases may look similar to SQL. The query language can support

SQL-like terms such as INSERT, UPDATE, DELETE, and SELECT

Column family-specific operations such as CREATE COLUMNFAMILY




Cassandra is a distributed, open source data model that handles petabytes of data distributed across several commodity servers.


The robust support, high data availability, and no single point of failure by Cassandra is guaranteed by the fact that the data clusters span multiple data centers.


The system offers master-less and peer-peer asynchronous replicas; the peer-peer clustering (a same role for each node) helps achieve low latency, high throughput, and no single point of failure.


In the data model, the rows are partitioned into tables; the very first column of the table forms the primary key, also known as the partition key. Cassandra does not support joins and subqueries but encourages denormalization for achieving high data availability.


The Cassandra data model believes in parallel data operation that results in high throughput and low latency and adopts data replication strategy to ensure high availability for writing too.


Unlike in relational data models, in Cassandra, the data are duplicated on multiple peer nodes, enabling reliability and fault tolerance.


In Cassandra, whenever a write operation occurs, the data are stored in the form of a structure in the memory called as memtable. The same data are also appended to the commit log on the hard disk that provides configurable durability.


Each and every writer on a Cassandra node is also received by the commit log, and these durable writes are permanent and survive even after a hardware failure; the durable write, set to true, is advisable to avoid the risk of losing data.


Though customizable, the right amount of memory is dynamically allocated by the Cassandra system and a configurable threshold is set.


During the writing operation, when the contents exceed the limit of memtable, the data, including indexes, are queued, ready to be flushed to Sorted String Tables (SSTables) on disk through sequential I/O.


The Cassandra model comes equipped with a SQL-like language called Cassandra Query Language (CQL) to query data.


Cassandra Features

Cassandra Features

The basic unit of storage in Cassandra is a column. A Cassandra column consists of a name-value pair, where the name also behaves as the key; each of these key-value pairs is a single column and is always stored with a timestamp value.


The timestamp is used to expire data, resolve write conflicts, deal with stale data, and do other things. Once the column data are no longer used, space can be reclaimed later during a compaction phase.


Cassandra puts the standard and super column families into keyspaces. A keyspace is similar to a database in RDBMS where all column families related to the application are stored.


1. Consistency: Upon receipt of a write, the data are first recorded in a commit log and then written to an in-memory structure known as memorable. A write operation is considered successful once it is written to the commit log and the memtable.


Writes are batched in memory and periodically written out to structures known as SSTable. SSTables are not written to again after they are flushed; if there are changes to the data, a new SSTable is written. Unused SSTables are reclaimed by compactation.


a. Using Consistency Level ONE read operations: If we have a consistency setting of ONE as the default for all read operations, then upon a read request, Cassandra returns the data from the first replica, even if the data are stale.


If the data are stale, subsequent reads will get the latest (newest) data; this process is known as reading repair. The low consistency level is good to use for indifference to the staleness of data or for high read performance requirements.


Write operations: Cassandra would write to a node’s commit log and return a response to the client. The consistency of ONE is good to use for indifference to the loss of writes (which may happen if the node goes down before the write is replicated to other nodes) or for very high write performance requirements and also do not mind if some writes are lost.


b.Using Consistency Level QUORUM: Using the QUORUM consistency setting for both read and write operations ensures that the majority of the nodes respond to the read and the column with the newest timestamp is returned back to the client, while the replicas that do not have the newest data are repaired via the read repair operations.


During write operations, the QUORUM consistency setting means that the write has to propagate to the majority of the nodes before it is considered successful and the client is notified.


c.Using Consistency Level ALL: Using ALL as consistency level means that all nodes will have to respond to reads or writes, which will make the cluster not tolerant to faults—even when one node is down, the write or read is blocked and reported as a failure. It is essential to tune the consistency levels as the application requirements change.


2.Transactions: A write is atomic at the low level, which means inserting or updating columns for a given row key will be treated as a single write and will either succeed or fail.


Writes are first written to commit logs and memtables and are considered good only when the right to commit log and memtable is successful. If a node goes down, the commit log is used to apply changes to the node, just like the redo log in Oracle.


3.Availability: Cassandra is, by design, highly available, since there is no master in the cluster and every node is a peer in the cluster. The availability of a cluster can be increased by reducing the consistency level of the requests.


Availability is governed by the (R + W) > N formula, where W is the minimum number of nodes where the write must be successfully written, R is the minimum number of nodes that must respond successfully to a read, and N is the number of nodes participating in the replication of data. You can tune the availability by changing the R and W values for a fixed value of N.


4.Query: When designing the data model in Cassandra, as it does not have a rich query language, it is advised to make the columns and column families optimized for reading the data.


As data are inserted in the column families, data in each row are sorted by column names. If we have a column that is retrieved much more often than other columns, it is better performance-wise to use that value for the row key instead.


5.Scaling: Scaling an existing Cassandra cluster is a matter of adding more nodes. As no single node is a master when we add nodes to the cluster, we are improving the capacity of the cluster to support more writes and reads.


This type of horizontal scaling allows you to have maximum uptime as the cluster keeps serving requests from the clients while new nodes are being added to the cluster.


Google BigTable

Google BigTable

BigTable, primarily used by Google for various important projects, is envisioned as a distributed storage system, designed to manage petabytes of data, distributed across several thousand commodity servers allowing for even further horizontal scaling.


At the hardware level, these thousand commodity servers are grouped under the distributed clusters, which, in turn, are connected through the central switch.


Each cluster is constituted by various racks and a rack can, in turn, consist of several commodity computing machines that communicate to each other through the rack switches.


In fact, these are the switches that are connected to the central switch to help in inter rack or intercluster communications.


The BigTable efficiently deals with latency and data size issues. It is a compressed, proprietary high-performance storage system, underpinned by the Google File System (GFS) for log and data files storage, and Google SSTable (String Sorted Table) that stores the BigTable data internally.


SSTable offers a persistent, immutable ordered map with keys and values as byte strings, with the key being the primary source for searching. The SSTable is a collection of blocks of typical (though configurable) size of 64 kb.


Unlike a relational database, it is a sparse, multi-dimensional sorted map, indexed by a row key, a column key, and a timestamp. The map stores each value as an uninterrupted array of bytes.


In the BigTable data model, there is no support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions across row keys. As the BigTable data model is not relational in nature, it does not support join operations and there is no SQL-type language support also.


This may pose challenges to those users who largely rely on SQL-like languages for data manipulation. In the BigTable data model, at the fundamental hardware level, interact communication is less efficient than in intra rack communication, that is communication within a rack through the rack switches.


In a write operation, a valid transaction is logged into the tablet log. The commit is the group commit to improving the throughput. On the commit moment of the transaction, the data are inserted into the memory table (memtable).


During the write operation, the size of the memtable continues to grow, and once it reaches the threshold, the current memtable freezes and a new memtable is generated. The former is converted into an SSTable and finally written into GFS.


In a read operation, the SSTables and the memtable form an efficient merged view and the valid read operation is performed on it.


HBase Data Model and Versioning

Data Model

1.HBase Data Model: The data model in HBase organizes data using the concepts of namespaces, tables, column families, column qualifiers, columns, rows, and data cells.


A column is identified by a combination of (column family: column qualifier). Data are stored in a self-describing form by associating columns with data values, where data values are strings.


HBase also stores multiple versions of a data item, with a timestamp associated with each version, so versions and timestamps are also part of the HBase data model (this is similar to the concept of attribute versioning in temporal databases).


As with other NoSQL systems, unique keys are associated with stored data items for fast access, but the keys identify cells in the storage system. Because the focus is on high performance when storing huge amounts of data, the data model includes some storage-related concepts.


a. Tables and rows:

Data in HBase is stored in tables, and each table has a table name. Data in a table are stored as self-describing rows.


Each row has a unique row key, and row keys are strings that must have the property that they can be lexicographically ordered, that is characters that do not have a lexicographic order in the character set cannot be used as part of a low key.


b. Column families, column qualifiers, and columns:

A table is associated with one or more column families. Each column family will have a name, and the column families associated with a table must be specified when the table is created and cannot be changed later.


While creating a table, the table name is followed by the names of the column families associated with the table.

creating a table:

create ‘EMPLOYEE’, ‘Name’, ‘Address’, ‘Details’


When the data are loaded into a table, each column family can be associated with many column qualifiers, but the column qualifiers are not specified as part of creating a table.


So the column qualifiers make the model a self- describing data model because the qualifiers can be dynamically specified as new rows are created and inserted into the table.


A column is specified by a combination of ColumnFamily: ColumnQualifier. Basically, column families are a way of grouping together related columns (attributes in relational terminology) for storage purposes, except that the column qualifier names are not specified during table creation.


Rather, they are specified when the data are created and stored in rows, so the data are self describing since any column qualifier name can be used in a new row of data.

 style="margin: 0px; width: 949px; height: 154px;">put ‘EMPLOYEE’, ‘row1’, ‘Name:Fname’, ‘Amitabh’ put ‘EMPLOYEE’, ‘row1’, ‘Name:Lname’, ‘Bacchan’ put ‘EMPLOYEE’, ‘row1’, ‘Name:Nickname’, ‘Amit’ put ‘EMPLOYEE’, ‘row1’, ‘Details:Job’, ‘Engineer’ put ‘EMPLOYEE’, ‘row1’, ‘Details:Review’, ‘Good’ put ‘EMPLOYEE’, ‘row2’, ‘Name:Fname’, ‘Hema’
put ‘EMPLOYEE’, ‘row2’, ‘Name:Lname’, ‘Malini’ put ‘EMPLOYEE’, ‘row2’, ‘Name:Mname’, ‘S’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Job’, ‘DBA’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Supervisor’, ‘Prakash Padukone’ put ‘EMPLOYEE’, ‘row3’, ‘Name:Fname’, ‘Prakash’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Minit’, ‘M’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Lname’, ‘Padukone’ put ‘EMPLOYEE’, ‘row3’, ‘Name:Suffix’, ‘Mr.’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Job’, ‘CEO’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Salary’, ‘3,100,000’

However, it is important that the application programmers know which column qualifiers belong to each column family, even though they have the flexibility to create new column qualifiers on the fly when new data rows are created.


c. Versions and timestamps: HBase can keep several versions of a data item, along with the timestamp associated with each version. The timestamp is a long integer number that represents the system time when the version was created, so newer versions have larger timestamp values.


HBase uses midnight “January 1, 1970 UTC” as timestamp value zero, and uses a long integer that measures the number of milliseconds since that time as the system timestamp value (this is similar to the value returned by the Java utility java.util.Date.getTime() and is also used in MongoDB).


It is also possible for the user to define the timestamp value explicitly in a date format rather than using the system-generated timestamp.


d. Cells: A cell holds a basic data item in HBase. The key (address) of a cell is specified by a combination of (table, row, column family, column qualifier, timestamp).


If the timestamp is left out, the latest version of the item is retrieved unless a default number of versions is specified, say the last three versions.


The default number of versions to be retrieved, as well as the default number of versions that the system needs to keep, are parameters that can be specified during table creation.


e. Namespaces: A namespace is a collection of tables. A namespace basically specifies a collection of one or more tables that are typically used together by user applications, and it corresponds to a database that contains a collection of tables in relational terminology.


HBase CRUD Operations

HBase has low-level CRUD (create, read, update, delete) operations, as in many of the NoSQL systems. The formats of some of the basic CRUD operations in HBase are illustrated below.


Creating a table: create <tablename>, <column family>, <column family>, … Inserting Data: put <tablename>, <rowid>, <column family>:<column qualifier>, <value>


Reading Data (all data in a table): scan <tablename> Retrieve Data (one item): get <tablename>,<rowid>

Reading Data

  • HBase only provides low-level CRUD operations. It is the responsibility of the application programs to implement more complex operations, such as
  • The join operation between rows in different tables.
  • The create operation creates a new table and specifies one or more column families associated with that table, but it does not specify the column qualifiers, as we discussed earlier.


  • The put operation is used for inserting new data or new versions of existing data items.
  • The get operation is for retrieving the data associated with a single row in a table, and the scan operation retrieves all the rows.


HBase Storage and Distributed System Concepts

Each HBase table is divided into a number of regions, where each region will hold a range of the row keys in the table; this is why the row keys must be lexicographically ordered.


Each region will have a number of stores, where each column family is assigned to one store within the region.


Regions are assigned to region servers (storage nodes) for storage. A master server (master node) is responsible for monitoring the region servers and for splitting a table into regions and assigning regions to region servers.


HBase is built on top of both HDFS and Zookeeper.


HBase uses the Apache Zookeeper open source system for services related to managing the naming, distribution, and synchronization of the HBase data on the distributed HBase server nodes, as well as for coordination and replication services.


Zookeeper can itself have several replicas on several nodes for availability, and it keeps the data it needs in main memory to speed access to the master servers and region servers.


HBase also uses Apache Hadoop Distributed File System (HDFS) for distributed file services.


Key-Value Databases

A key-value data store is a more complex variation on the array data structure. The main characteristic of key-value stores is the fact that every value (data item) must be associated with a unique key, and that retrieving the value by supplying the key must be very fast.


Key-value pair databases are modeled on two components: keys and values; keys are identifiers associated with corresponding values. A namespace is a collection of identifiers; keys must be unique within a namespace.


If a namespace corresponds to an entire database, all keys in the database must be unique. Some key-value databases provide for different namespaces within a database by setting up separate data structures or buckets for separate collections of identifiers.


Values can be as simple as a string, such as a name, number, or more complex values, such as images or binary objects, and so on. Because key-value databases allow virtually any data type in values, it is up to the programmer to determine the range of valid values and enforce those choices as required.


Key-value databases share three essential features as follows:


Simplicity: Key-value databases work with a simple data model; it allows the addition of attributes when needed.


Speed: With a simple associative array data structure and design features to optimize performance, key-value databases can deliver high-throughput, data-intensive operations.


Scalability: It is the capability to add or remove servers from a cluster of servers as needed to accommodate the load on the system. Key-value pair databases are the simplest form of NoSQL databases.



Riak is a key-value open source NoSQL data model that has been developed by Basho Technologies. The model is considered a fine implementation of Amazon’s Dynamo principles and distributes the data through nodes by employing the consistent hashing in an ordinary key-value system into buckets, simply the namespaces.


Riak consists of the supported client libraries for several programming languages such as Erlang, Java, PHP, Python, Ruby, and C/C++. that help in setting or retrieving the value of the key, the most prominent operations a user performs with Riak.


The model has received wide acceptance in companies such as AT&T, AOL, and - What's Your Question?.


Similar to other efficient NoSQL data models, Riak also has the potential fault-tolerant availability and by default replicates the key-value store at three places across the nodes of the cluster.


In unfavorable circumstances such hardware failure, the node outage will not completely shut down the write operation, rather its master-less peer-to-peer architecture allows a neighboring node (beyond default three) to respond to the write operation at that moment, and the data can be read back later.


Riak Features

1.Consistency: In distributed key-value store implementations such as Riak, the eventually consistent model of consistency is implemented.


Since the value may have already been replicated to other nodes, Riak has two ways of resolving update conflicts: either the newest write wins and the older writes lose, or both (all) values are returned, allowing the client to resolve the conflict.


2.Transactions: Riak uses the concept of quorum implemented by using the W value— replication factor—during the write API call. Consider a Riak cluster with a replication factor of 5 and suppose we supply the W value of 3.


When writing, the write is reported as successful only when it is written and reported as a success on at least three of the nodes.


This allows Riak to have to write tolerance; in our example, with N equal to 5 and with a W value of 3, the cluster can tolerate N−W = 2 nodes being down for write operations, though we would still have lost some data on those nodes for reading.


3.Query: Key-value stores can query by the key.


4.Scaling: Many key-value stores scale by using sharding: the value of the key determines on which node the key is stored.


Amazon Dynamo

Amazon runs a worldwide e-commerce platform that serves tens of millions of customers at peak times using tens of thousands of servers located in many data centers around the world.


Reliability is one of the most important requirements because even the slightest outage has significant financial consequences and impacts customer trust; there are strict operational requirements on Amazon’s platform in terms of performance, reliability, and efficiency, and to support Amazon’s continuous growth, the platform needs to be highly scalable.


The Dynamo system is a highly available and scalable distributed key-value-based data store built for supporting Amazon’s internal applications. Dynamo is used to manage the state of services that have very high-reliability requirements and need tight control over the tradeoffs among availability, consistency, cost-effectiveness, and performance.


There are many services on Amazon’s platform that need only primary-key access to a data store. 


The customary use of relational databases would lead to inefficiencies and limit the ability to scale and provide high availability. Dynamo provides a simple primary-key-only interface to meet the requirements of these applications.


The query model of the Dynamo system relies on simple read and write operations to a data item that is uniquely identified by a key. The state is stored as binary objects (blobs) identified by unique keys. No operations span multiple data items.


Dynamo’s partitioning scheme relies on a variant of consistent hashing mechanisms to distribute the load across multiple storage hosts. In this mechanism, the output range of a hash function is treated as a fixed circular space or ring (i.e., the largest hash value wraps around to the smallest hash value).


Each node in the system is assigned a random value within this space, which represents its position on the ring.


Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring and then walking the ring clockwise to find the first node with a position larger than the item’s position.


In the Dynamo system, each data item is replicated at N hosts, where N is a parameter configured per instance. Each key K is assigned to a coordinator node. The coordinator is in charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the (N − 1) clockwise successor nodes in the ring.


This results in a system where each node is responsible for the region of the ring between it and its Nth predecessor. Node B replicates the key K at nodes C and D in addition to storing it locally. Node D will store the keys that fall in the ranges (A, B), (B, C), and (C, D).


The list of nodes that are responsible for storing a particular key is called the preference list. The system is designed so that every node in the system can determine which nodes should be in this list for any particular key.


Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principle advantage of consistent hashing is that the departure or arrival of a node affects only its immediate neighbors and other nodes remain unaffected.


DynamoDB Data Model

DynamoDB Data Model

The basic data model in DynamoDB uses the concepts of tables, items, and attributes. A table in DynamoDB does not have a schema; it holds a collection of self-describing items.


Each item will consist of a number of (attribute, value) pairs, and attribute values can be single-valued or multivalued. So basically, a table will hold a collection of items, and each item is a self-describing record (or object).


DynamoDB also allows the user to specify the items in JSON format, and the system will convert them to the internal storage format of DynamoDB.


When a table is created, it is required to specify a table name and a primary key; the primary key will be used to rapidly locate the items in the table. Thus, the primary key is the key and the item is the value for the DynamoDB key-value store. The primary key attribute must exist in every item in the table.


The primary key can be one of the following two types:


1. A single attribute:

The DynamoDB system will use this attribute to build a hash index on the items in the table. This is called a hash type primary key. The items are not ordered in storage on the value of the hash attribute.


2. A pair of attributes:

This is called a hash and range type primary key. The primary key will be a pair of attributes (A, B): attribute A will be used for hashing, and because there will be multiple items with the same value of A, the B values will be used for ordering the records with the same A value. A table with this type of key can have additional secondary indexes defined on its attributes.


For example, if we want to store multiple versions of some type of items in a table, we could use ItemID as hash and Date or Timestamp as the range in a hash, and range type as the primary key.


Document Databases

Document Databases

Document-based or document-oriented NoSQL systems typically store data as collections of similar documents.


The individual documents somewhat resemble “complex objects” or XML documents, but a major difference between document-based systems versus object and object-relational systems, and XML is that there is no requirement to specify a schema—rather, the documents are specified as self-describing data.


Although the documents in a collection should be similar, they can have different data elements (attributes), and new documents can have new data elements that do not exist in any of the current documents in the collection.


The system basically extracts the data element names from the self-describing documents in the collection, and the user can request that the system create indexes on some of the data elements.


Documents can be specified in various formats, such as XML (Kale 2015). A popular language to specify documents in NoSQL systems is JavaScript Object Notation (JSON).


JSON, short for JavaScript Object Notation, is a lightweight computer data interchange format, easy for humans to read and write, and, at the same time, easy for machines to parse and generate.


Another key feature of JSON is that it is completely language independent. Programming in JSON does not raise any challenge for programmers experienced with the C family of languages.


Though XML is a widely adopted data exchange format, there are several reasons for preferring JSON to XML:


Data entities exchanged with JSON are typed, while XML data are typeless (some of the built-in data types available in JSON are a string, number, array, and Boolean); XML data, on the other hand, are all strings.


JSON is lighter and faster than XML as on-the-wire data format.


JSON integrates natively with JavaScript, a very popular programming language used to develop applications on the client side; consequently, JSON objects can be directly interpreted in JavaScript code, while XML data need to be parsed and assigned to variables through the tedious usage of Document Object Model APIs (DOM APIs).


JSON is built on two structures:


1.A collection of name/value pairs. In various languages, this is realized as an object, record, structure, dictionary, hash table, keyed list, or associative array.


2. An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence. Document databases use a key-value approach to storing data but with important differences from key-value databases:


A document database stores values as semistructured documents, typically in a standard format such as JavaScript Object Notation (JSON) or Extensible Markup Language (XML). Instead of storing each attribute of an entity with a separate key, document databases store multiple attributes in a single document.


Unlike relational databases, document databases do not necessitate defining a fixed schema before adding data to the database—adding a document to the database automatically creates the underlying data structures needed to support the document.


The lack of a fixed schema gives developers more flexibility with document databases than they have with relational databases. Documents can also have embedded documents and lists of multiple values within a document, which effectively eliminates the need for joining documents the way you join tables in a relational database.


Partitioning, which refers to splitting a document database and distributing different parts of the database to different servers, is of two types:


Vertical Partitioning: It is the process of improving database performance by separating columns of a relational table into multiple separate tables or documents into multiple separate collections; this technique is particularly useful when you have some columns or documents that are accessed more frequently than others.


Horizontal partitioning: It is the process of improving database performance by rows in a relational database or dividing a database by documents in a document database with the separate parts or shards being stored shard-wise on separate servers. Sharding is done based upon a range of values or a hash function or a list of values.


Document databases provide application programming interfaces (APIs) or query languages that enable you to retrieve documents based on attribute values. They offer support for querying structures with multiple attributes, like relational databases, but offer more flexibility with regard to variation in the attributes used by each document.


Document databases are probably the most popular type of NoSQL database.



A cluster of Unreliable Commodity Hardware Data Base (CouchDB) is an open source, NoSQL, a web-inclined database that uses JSON to store the data and relies on JavaScript as its query language.


It uses HTTP for data access and merits with multi-master replication. CouchDB does not use tables at all for data storage or relationship management; rather it is a collection of independent documents, maintaining their self-contained schemas and data, which form the database.


In CouchDB, multi-version concurrency control helps to avoid locks during a write operation and document metadata holds the revision information and any offline data updates are merged and stale ones are deleted. CouchDB is equipped with a built-in administration web interface, called Futon.



MongoDB (term extracted from the word humongous) is an open source, document-oriented NoSQL database. It enables master-slave replication:


Master performs reads and writes.

Slave copies the data received from the master performs the read operation and backs up the data. Slaves do not participate in write operations but may select an alternate master in case the current master fails.


Unlike the traditional relational databases, MongoDB uses a binary format of JSON-like documents and dynamic schemas. MongoDB enables flexible schemas and common fields of various documents in a collection can have disparate types of data.


The query system of MongoDB can return particular fields and query set compass search by fields, range queries, regular expression search, and so on, and may also include user-defined complex JavaScript functions.


The latest release of the data model is equipped with more advanced concepts such as sharding that address the scaling issues, and the replica sets that assist in automatic failover and recovery.


Sharding can be understood as a process that addresses the demand of the data growth by storing data records across multiple machines without any downtime.


Each shard (server) functions as an independent database and all the shards collectively form a single logical database. Each config server possesses the metadata about the cluster such as information about the shard and data chunks on the shard.


For protection and safety, there exist multiple config servers, such that if one of these servers goes down, another takes charge without any interruption in shard functioning.


Mongod is the key database process and represents one shard. In a replica set, one of the mongod processes serves as the master and if it goes down, another is delegated for carrying out the master’s responsibilities: Mongos intermediates as a routing process between the client and the shared database.


MongoDB Features


1.Consistency: Consistency in the MongoDB database is configured by using the replica sets and choosing to wait for the writes to be replicated to all the slaves or a given number of slaves. Every write can specify the number of servers the write has to be propagated to before it returns as successful.


2.Transactions: In traditional RDBMS, transactions mean modifying the database with insert, update, or delete commands over different tables and then deciding to keep the changes or not by using commit or rollback.


These constructs are generally not available in NoSQL solutions—a write either succeeds or fails. Transactions at the single-document level are known as “atomic” transactions.


By default, all writes are reported as successful. A finer control over the write can be achieved by using the WriteConcern parameter. We ensure that order is written to more than one node before it is reported successful by using WriteConcern. REPLICAS_SAFE.


Different levels of WriteConcern let you choose the safety level during writes; for example, when writing log entries, you can use the lowest level of safety, WriteConcern.NONE.


3. Availability: MongoDB implements replication, providing high availability using replica sets. In a replica set, there are two or more nodes participating in an asynchronous master-slave replication.


The replica-set nodes elect the master, or primary, among themselves. Assuming all the nodes have equal voting rights, some nodes can be favored for being closer to the other servers, for having more RAM, and so on; users can affect this by assigning a priority—a number between 0 and 1,000—to a node.


All requests go to the master node, and the data are replicated to the slave nodes. If the master node goes down, the remaining nodes in the replica set vote among themselves to elect a new master; all future requests are routed to the new master, and the slave nodes start getting data from the new master.


When the node that had failed comes back online, it joins in as a slave and catches up with the rest of the nodes by pulling all the data it needs to get current.


4. Query: One of the good features of document databases, as compared to key-value stores, is that we can query the data inside the document without having to retrieve the whole document by its key and then introspect the document. This feature brings these databases closer to the RDBMS query model.


MongoDB has a query language that is expressed via JSON and has constructed such as $query for the where clause, $orderby for sorting the data, or $explain to show the execution plan of the query. There are many more constructs such as these that can be combined to create a MongoDB query.



Scaling implies adding nodes or changing data storage without simply migrating the database to a bigger box. Scaling for heavy- read loads can be achieved by adding more read slaves so that all the reads can be directed to the slaves.


Given a heavy-read application, with our three-node replica-set cluster, we can add more read capacity to the cluster as the read load increases just by adding more slave nodes to the replica set to execute reads with the slave Ok flag.


When a new node is added, it will sync up with the existing nodes, join the replica set as a secondary node, and start serving read requests. An advantage of this setup is that we do not have to restart any other nodes, and there is no downtime for the application either.


MongoDB Data Model

MongoDB documents are stored in Binary JSON (BSON) format, which is a variation of JSON with some additional data types and is more efficient for storage than JSON. Individual documents are stored in a collection. The operation createCollection is used to create each collection.


For example, the following command can be used to create a collection called project to hold PROJECTobjects from the CLIENT database.

db.createCollection(“project”, {capped : true, size : 1500000, max : 600})


This second command will create another document collection called worker to hold information about the EMPLOYEEs who work on each project.

db.createCollection(“worker”, {capped : true, size : 5242880, max : 2000})


Each document in a collection has a unique ObjectId field, called _id, which is automatically indexed in the collection unless the user explicitly requests no index for the _id field. The value of ObjectId


1. Can be specified by the user: User-generated ObjectsIds can have any value specified by the user as long as it uniquely identifies the document and so these Ids are similar to primary keys in relational systems.


2.Can be system generated if the user does not specify a _id field for a particular document. System-generated ObjectIds have a specific format, which combines the timestamp when the object is created (4 bytes, in an internal MongoDB format), the node id (3 bytes), the process id (2 bytes), and a counter (3 bytes) into a 16-byte Id value.


A collection does not have a schema. The structure of the data fields in documents is chosen based on how documents will be accessed and used, and the user can choose a normalized design (similar to normalized relational tuples) or a denormalized design (similar to XML documents or complex objects).


Interdocument references can be specified by storing in one document the ObjectId or ObjectIds of other related documents.


The _id values are user-defined, and the documents whose _id starts with P (for the project) will be stored in the “project” collection, whereas those whose _id starts with W (for the worker) will be stored in the “worker” collection.


The workers’ information is embedded in the project document; so there is no need for the “worker” collection:


_id: “P1”,

Pname: “ProductL”, Plocation: “Pune”, Workers: [

{ Ename: “Amitabh Bacchan”, Hours: 32.5


{ Ename: “Priyanka Chopra”, Hours: 20.0




This is known as the denormalized pattern, which is similar to creating a complex object or an XML document.

Another option is where worker references are embedded in the project document, but the worker documents themselves are stored in a separate “worker” collection:


_id: “P1”,

Pname: “ProductL”, Plocation: “Pune”, WorkerIds: [“W1”, “W2”]


{ _id: “W1”,

Ename: “Amitabh Bacchan”, Hours: 32.5


{ _id: “W2”,

Ename: “Priyanka Chopra”, Hours: 20.0


A third option is to use a normalized design, similar to First Normal Form (1NF) relations:


_id: “P1”,

Pname: “ProductL”, Plocation: “Pune”


{ _id: “W1”,

Ename: “Amitabh Bacchan”, ProjectId: “P1”,

Hours: 32.5


{ _id: “W2”,

Ename: “PriyankaBacchan”, ProjectId: “P1”,

Hours: 20.0



The choice of which design option to use depends on how the data will be accessed.


MongoDB CRUD Operations

MongoDB has several CRUD operations, where CRUD stands for creating, read, update, delete. Documents can be created and inserted into their collections using the insert operation, whose format is



The parameters of the insert operation can include either a single document or an array of documents, as shown below:

 style="margin: 0px; width: 955px; height: 157px;">db.project.insert( { _id: “P1”, Pname: “ProductL”, Plocation: “Pune” } ) db.worker.insert( [ { _id: “W1”, Ename: “Amitabh Bacchan”, ProjectId: “P1”, Hours: 32.5 },

{ _id: “W2”, Ename: “Priyanka Chopra”, ProjectId: “P1”, Hours: 20.0 } ] )

The delete operation is called remove, and the format is



The documents to be removed from the collection are specified by a Boolean condition on some of the fields in the collection documents.


There is also an update operation, which has a condition to select certain documents, and a $set clause to specify the update. It is also possible to use the update operation to replace an existing document with another one but keep the same ObjectId.


For read queries, the main command is called find, and the format is db.<collection_ name>.find(<condition>)


General Boolean conditions can be specified as <condition>, and the documents in the collection that return true are selected for the query result.


MongoDB Distributed Systems Characteristics

Most MongoDB updates are atomic if they refer to a single document, but MongoDB also provides a pattern for specifying transactions on multiple documents. Since MongoDB is a distributed system, the two-phase commit method is used to ensure atomicity and consistency of multi-document transactions.


1. MongoDB Replication:

The concept of “replica set” is used in MongoDB to create multiple copies of the same data set on different nodes in the distributed system, and it uses a variation of the master-slave approach for replication.


All write operations must be applied to the primary copy and then propagated to the secondaries. For read operations, the user can choose the particular read preference for their application. The default read preference processes all read at the primary copy, so all read and write operations are performed at the primary node.


In this case, secondary copies are mainly to make sure that the system continues operation if the primary fails, and MongoDB can ensure that every read request gets the latest document value.


To increase read performance, it is possible to set the read preference so that read requests can be processed at any replica (primary or secondary); however, a read at a secondary is not guaranteed to get the latest version of a document because there can be a delay in propagating writes from the primary to the secondaries.


2. Sharding in MongoDB:

When a collection holds a very large number of documents or requires a large storage space, storing all the documents in one node can lead to performance problems, particularly if there are many user operations accessing the documents concurrently using various CRUD operations.


Sharding or horizontal partitioning of the documents in the collection divides the documents into disjoint partitions known as shards.


This allows the system to add more nodes as needed by a process known as horizontal or scaling-out of the distributed system, and to store the shards of the collection on different nodes to achieve load balancing.


Each node will process only those operations pertaining to the documents in the shard stored at that node. Also, each shard will contain fewer documents than if the entire collection were stored at one node, thus further improving performance.


The user must specify a particular document field to be used as the basis for partitioning the documents into shards. The partitioning field—known as the shard key in MongoDB— must have two characteristics:


  • It must exist in every document in the collection.
  • It must have an index.


  • The ObjectId can be used as the partitioning filed, but any other field possessing these two characteristics can also be used as the basis for sharding.


  • The values of the shard key are divided into chunks, and the documents are partitioned based on the chunks of shard key either through


1. Hash partitioning: Hash partitioning applies a hash function h(K) to each shard key K, and the partitioning of keys into chunks is based on the hash values. If most searches retrieve one document at a time, hash partitioning may be preferable because it randomizes the distribution of shard key values into chunks.


2. Range partitioning: In general, if range queries are commonly applied to a collection (for example, retrieving all documents whose shard key value is between 200 and 400), then range partitioning is preferred because each range query will typically be submitted to a single node that contains all the required documents in one shard.


When sharding is used, MongoDB queries are submitted to a module called the query router, which keeps track of which nodes contain which shards based on the particular partitioning method used on the shard keys. The query (CRUD operation) will be routed to the nodes that contain the shards that hold the documents that the query is requesting.


If the system cannot determine which shards hold the required documents, the query will be submitted to all the nodes that hold shares of the collection.


Sharding and replication are used together; sharding focuses on improving performance via load balancing and horizontal scalability, whereas replication focuses on ensuring system availability when certain nodes fail in the distributed system.


Graph Databases

Graph Databases

A graph database uses structures called nodes and relationships (or edges or link or arc). A node or vertex is an object that has an identifier and a set of attributes;


A relationship is a link between two nodes that contains attributes about that relation. For instance, a node could be a city, and a relationship between cities could be used to store information about the distance and travel time between cities.


Much like nodes, relationships have properties; the weight of relationship represents some value about the relationship. A common problem encountered when working with graphs is to find the least-weighted path between two vertices.


The weight can represent the cost of using the edge, the time required to traverse the edge or some other metric that you are trying to minimize.


Both nodes and relationships can have complex structures. There are two types of relationships: directed and undirected. Directed edges have a direction.


A path through a graph is a set of vertices along with the edges between those vertices; paths are important because they capture information about how vertices in a graph are related. If edges are directed, the path is a directed path. If the graph is undirected, the paths in it are undirected paths.


Graph databases are designed to model adjacency between objects; every node in the database contains pointers to adjacent objects in the database. This allows for fast operations that require following paths through a graph. Graph databases allow for more efficient querying when paths through graphs are involved.


Many application areas are efficiently modeled as graphs and, in those cases, a graph database may streamline application development and minimize the amount of code you would have to write. Graph databases are the most specialized of the four types of NoSQL databases.




OrientDB is an open source graph-document NoSQL data model. It largely extends the graph data model but combines the features of both document and graph data models up to a certain extent.


At the data level for the schema-less content, it is document based in nature, whereas to traverse the relationship, it is graph oriented, and therefore, fully supports schema-less, schema-full, or schema-mixed data.


The database is completely distributed in nature and can be spanned across several servers. It supports the state of the art multimaster replication distributed system.


It is a fully ACID-compliant data model and also offers role-based security profile to the users. This database engine is light weighted, written in Java, and hence, is portable in nature, platform independent, and can run on Windows, Linux, etc.


One of the salient features of OrientDB is its fast indexing system for lookups and insertion, and that is based on MVRB-Tree algorithm, originated from Red-Black Tree and B+ Tree.


OrientDB relies on SQL for basic operations and uses some graph operator extensions to avoid SQL joins in order to deal with relationships in data; graph traversal language is used as the query processing language and can loosely be termed as OrientDB’s SQL.


OrientDB has out-of-the-box supports for web such as HTTP, RESTful, and JSON without any external intermediaries.



Neo4j is being considered as the world’s leading graph data model and is a potential member of the NoSQL family. It is an open source, a robust, disk-based graph data model that fully supports ACID transactions and is implemented in Java.


The graph nature of Neo4j imparts it with agility and speediness in comparison to relational databases for a similar set of operations. It is significantly faster and outperforms the former with greater than 1000x performance for several potentially important real-time scenarios.


Similar to an ordinary property (simple key-value pairs) graph, the Neo4j graph data model consists of nodes and edges where every node represents an entity, and an edge between two nodes corresponds to the relationship between those attached entities.


As location has become an important aspect of data today, most of the applications have to deal with the highly associated data, forming a network (or graph);


social networking sites are the obvious examples of such applications. Unlike relational database models, which require upfront schemas that restrict the absorption of the agile and ad hoc data, Neo4j is a schema-less data model that works on a bottom-up approach that allows easy expansion of the database to enable ad hoc and dynamic data.


Neo4j has its own declarative and expressive query language, called Cypher, which has pattern-matching capabilities among the nodes and relationship during the data mining and data updating.


Cypher is extremely useful when it comes to creating, updating, or removal of nodes, properties, and relationships in the graph.


Simplicity is one of the salient features of Cypher, evolving as a humane query language, and is designed for not only developers but for the naïve users also, who write the ad hoc queries.


Neo4j Features

Neo4j Features

1.Consistency: Since graph databases are operating on connected nodes, most graph database solutions usually do not support distributing the nodes on different servers. Within a single server, data are always consistent, especially in Neo4i, which is fully ACID compliant.


When running Neo4j in a cluster, a write to the master is eventually synchronized to the slaves, while slaves are always available for read. Writes to slaves are allowed and are immediately synchronized to the master; other slaves will not be synchronized immediately, though they will have to wait for the data to propagate from the master.


2.Transactions: Though Neo4j is ACID compliant, the way of managing transactions differs from the standard way of doing transactions in an RDBMS.


3. Availability: Neo4j achieves high availability by providing for replicated slaves. These slaves can also handle writes: when they are written to, they synchronize the write to the current master, and the write is committed first at the master and then at the slave. Other slaves will eventually get the update.


4. Query: Neo4j allows you to query the graph for properties of the nodes, traverse the graph, or navigate the nodes’ relationships using language bindings. Properties of a node can be indexed using the indexing service.


Similarly, properties of relationships or edges can be indexed, so a node or edge can be found by the value. Indexes should be queried to find the starting node to begin a traversal. Neo4j uses Lucene as its indexing service.


Graph databases are really powerful when you want to traverse the graphs at any depth and specify a starting node for the traversal. This is especially useful when you are trying to find nodes that are related to the starting node at more than one level down.


As the depth of the graph increases, it makes more sense to traverse the relationships by using a Traverser where you can specify that you are looking for INCOMING, OUTGOING, or BOTH types of relationships.


You can also make the traverse go top-down or sideways on the graph by using order values of BREADTH_FIRST or DEPTH_FIRST.


5. Scaling: With graph databases, sharding is difficult, as graph databases are not aggregate oriented but relationship oriented.


Since any given node can be related to any other node, storing-related nodes on the same server are better for graph traversal.


Since traversing a graph when the nodes are on different machines is not good for performance, graph database scaling can be achieved by using some common techniques:


 We can add enough RAM to the server so that the working set of nodes and relationships is held entirely in memory. This technique is helpful only if the dataset that we are working with fit in a realistic amount of RAM.


We can improve the read scaling of the database by adding more slaves with read-only access to the data, with all the writes going to the master.


This pattern of writing once and reading from many servers is useful when the dataset is large enough to not fit in a single machine’s RAM but small enough to be replicated across multiple machines.


Slaves can also contribute to availability and read scaling, as they can be configured to never become a master, remaining always read-only.


 When the dataset size makes replication impractical, we can share the data from the application side using domain-specific knowledge.


Neo4j Data Model

The data model in Neo4j organizes data using the concepts of nodes and relationships. Both nodes and relationships can have properties that store the data items associated with nodes and relationships. Nodes can have labels; a node can have zero, one, or several labels.


The nodes that have the same label are grouped into a collection that identifies a subset of the nodes in the database graph for querying purposes.


Relationships are directed; each relationship has a start node and an end node as well as a relationship type, which serves a role similar to that of a node label by identifying similar relationships that have the same relationship type.


Properties can be specified via a map pattern, which is made of one or more “name:value” pairs enclosed in curly brackets;

for example {Lname: ‘Bacchan’, Fname : ‘Amitabh’, Minit : ‘B’}.


There are various ways in which nodes and relationships can be created—for example, by calling appropriate Neo4j operations from various Neo4j APIs. We will just show the high-level syntax for creating nodes and relationships;


to do so, we will use the Neo4j CREATE command, which is part of the high-level declarative query language Cypher. Neo4j has many options and variations for creating nodes and relationships using various scripting interfaces, but a full discussion is outside the scope of our presentation.


1.Indexing and node identifiers: When a node is created, the Neo4j system creates an internal, unique, system-defined identifier for each node. To retrieve individual nodes using other properties of the nodes efficiently, the user can create indexes for the collection of nodes that have a particular label.


Typically, one or more of the properties of the nodes in that collection can be indexed. For example, Empid can be used to index nodes with the EMPLOYEE label, Dno to index the nodes with the DEPARTMENT label, and Pno to index the nodes with the PROJECT label.


2. Optional schema: A schema is optional in Neo4j. Graphs can be created and used without a schema, but in Neo4j version 2.0, a few schema-related functions were added.


The main features related to schema creation involve creating indexes and constraints based on the labels and properties. For example, it is possible to create the equivalent of a key constraint on a property of a label, so all nodes in the collection of nodes associated with the label must have unique values for that property.



This blog introduces NoSQL database vendor products. It introduces the characteristics and examples of NoSQL databases, namely, column, key-value, document, and graph databases.


The blog provides a snapshot overview of column databases (Cassandra, Google BigTable, and HBase), key-value databases (Riak, Amazon Dynamo), document databases (CouchDB, MongoDB), and graph databases (OrientDB, Neo4j).