10 Types of Databases (2019)

Types of Databases

10 Types of Databases used in 2019

This blog explains 10 types of NoSQL databases that used in 2019. It introduces the characteristics and examples of NoSQL databases, namely, column, key-value, document, and graph databases.


The blog provides a snapshot overview of key-value databases (Riak, Amazon Dynamo), graph databases (OrientDB, Neo4j), column databases (Cassandra, Google BigTable, and HBase) and document databases (CouchDB, MongoDB).


Big Data NoSQL Databases

A NoSQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability.


NoSQL databases are often highly optimized key-value stores intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput.


NoSQL databases are finding significant and growing industry use in big data and real-time web applications. NoSQL systems are also referred to as “not only SQL” to emphasize that they may, in fact, allow SQL-like query languages to be used.


The term NoSQL is more of an approach or way to address data management versus being a rigid definition.

NoSQL databases

Different types of NoSQL databases exist, and they often share certain characteristics but are optimized for specific types of data, which then require different capabilities and features. NoSQL may mean not only SQL, or it may mean “No” SQL.


When using Apache Hive to run SQL in NoSQL databases, those queries are converted to MapReduce(2) and run as a batch operation to process large volumes of data in parallel.


A “No” SQL database may use APIs or JavaScript to access data versus traditional SQL. APIs can also be used to access the data in NoSQL to process interactive and real-time queries.


When accessing big data, there is a demand for databases that are nonrelational and that are designed to work with semi-structured, unstructured, and structured data.


A nonrelational database provides a flexible data model, layout, and format. Predefined schemas are not required for NoSQL databases.


NoSQL databases have traditionally been nonrelational databases that are distributed and designed to scale to very large sizes. NoSQL databases are designed to address the challenges of big data.


NoSQL databases are very scalable, have high availability, and provide a high level of parallelization for processing large volumes of data quickly. NoSQL databases have different characteristics and features.


For example, they can be key-value based, column-based, document-based, or graph-based. NoSQL data stores may be optimized for key or value, columnar, document-oriented, XML, graph, and object data structures.


Each NoSQL database can emphasize different areas of the CAP theorem (Brewer theorem).


The CAP theorem effectively states that distributed databases can have consistency, availability, and partition (CAP) tolerance, but not all three at the same time.


In other words, the Cap theorem states that a database can excel in only two of the following areas: consistency (all DataNodes observe the same data at the same time), availability (every request for data will get a response of success or failure), and partition tolerance (the data platform will continue to run even if parts of the system are not available).


Choosing a NoSQL database requires choosing what the CAP priorities are. If the data are distributed and replicated, and nodes or networks between the nodes go down, a decision must be made between data consistency (having the current version of the data) and availability.


NoSQL databases can address this with an eventual consistency model (Cassandra). Eventual consistency means a DataNode is always available to handle data requests, but at a point in time, the data may be inconsistent but still relatively accurate.


NoSQL databases choose their priority for CAP tolerance. For instance,

  • HBase and Accumulo are consistent and partition tolerant.
  • CouchDB is partition tolerant and available.
  • Neo4j is available and consistent.


Choosing the correct NoSQL database requires an understanding of factors such as the requirements, SLAs (latency), initial size and growth projections, costs, feature or functionality, and concurrency.


It’s also important to understand that NoSQL databases have their own defaults and may have the capability of changing their priorities for CAP. It is also important to understand the roadmap of these NoSQL databases because they are evolving quickly.


NoSQL systems have been divided into four major categories, namely, Column Databases, Key-Value Databases, Document Databases, and Graph Databases

Column Databases

1. Column Databases: These systems partition a table by column into column families, where each column family is stored in its own files. They also allow versioning of data values.


Column-oriented data stores are used when key-value data stores reach their limits because you want to store a very large number of records with a very large amount of information that goes beyond the simple nature of the key-value store.


The main benefit of using columnar databases is that you can quickly access a large amount of data.


A row in an RDBMS is a continuous disk entry and multiple rows are stored in different disk locations, which makes them more difficult to access; in contrast, in columnar databases, all cells that are part of a column are stored contiguously.


As an example, consider performing a lookup for all blog titles in an RDBMS. For tackling millions of records common on the web, it might be costly in terms of disk entries, whereas in columnar databases, such a search would represent only one access.


Such databases are very handy for retrieving large amounts of data from a specific family, but the tradeoff is that they lack flexibility. The most used columnar database is Google Cloud BigTable, especially Apache HBase and Cassandra.


One of the other benefits of columnar databases is ease of scaling because data are stored in columns; these columns are highly scalable in terms of the amount of information they can store. This is why they are used mainly for keeping nonvolatile, long-living information and in scaling use cases.


2. Key-Value Databases: These systems have a simple data model based on fast access by the key to the value associated with the key; the value can be a record or an object or a document or even have a more complex data structure.

Key-Value Databases

Key-value databases are the easiest NoSQL databases to understand. These data stores basically act like a dictionary and work by matching a key to a value.


They are often used for high-performance use cases in which basic information needs to be stored—for example, when session information may need to be written and retrieved very quickly.


These data stores really perform well and are efficient for this kind of use case; they are also usually highly scalable.


Key-value data stores can also be used in a queuing context to ensure that data would not be lost, such as in logging architecture or search engine indexing architecture use cases.


Redis and Riak KV are the most famous key-value data stores; Redis is more widely used and has an in-memory K-V store with optional persistence.


Redis is often used in web applications to store session-related data, such as node or PHP web applications; it can serve thousands of session retrievals per second without altering the performance.


3. Document Databases: These systems store data in the form of documents using well-known formats such as JavaScript Object Notation (JSON). Documents are accessible via their document id, but can also be accessed rapidly using other indexes.


Columnar databases are not the best for structuring data that contain deeper nesting structures—that is where document-oriented data stores come into play.


Data are indeed stored into key-value pairs, but these are all compressed into what is called a document. This document relies on a structure or encoding such as XML, but most of the time, it relies on JavaScript Object Notation (JSON).


Document-oriented databases are used whenever there is a need to nest information. For instance, for representing an account in your application may have the following information:


Basic information: first name, last name, birthday, profile picture, URL, creation date, and so on


Additional information: address, authentication method (password, Facebook, etc.), interests, and so on NoSQL document stores are often used in web applications because representing an object with the nested object is fairly easy; moreover, integrating with front-end JavaScript technology is seamless because both technologies work with JSON.


Although document databases are more useful structurally and for representing data, they also have their downside. They basically need to acquire the whole document—for example, when they are reading for a specific field—and this can dramatically affect performance. Famous NoSQL document databases are MongoDB, Couchbase, and Apache CouchDB.


4. Graph Databases:

 Graph Databases

Data are represented as graphs, and related nodes can be found by traversing the edges using path expressions. Graph databases are really different from other types of databases discussed above.


They use a different paradigm to represent the data—a tree-like structure with nodes and edges that are connected to each other through paths called relations.


Those databases rise with social networks, for instance, to represent a user’s friend network, their friends’ relationships, and so on. With the other types of data stores, it is possible to store a friend’s relationship to a user in a document, but still, it can then be really complex to store friends’ relationships;


It is better to use a graph database and create nodes for each friend, connect them through relationships, and browse the graph depending on the need and scope of the query.


The most famous graph database is Neo4j, and it is used for use cases that have to do with complex relationship information, such as connections between entities and others entities that are related to them, but it is also used in classification use cases.


Characteristics of NoSQL Systems

  • This subsection discusses the characteristics of NoSQL systems.
  • NoSQL Characteristics Related to Distributed Systems and Distributed Databases
  • The below section makes a reference to the characteristics of distributed systems and distributed databases.


1. Availability, Replication, and Eventual Consistency: Many applications that use NoSQL systems require continuous system availability. To accomplish this, data are replicated over two or more nodes in a transparent manner, so that if one node fails, the data are still available on other nodes.


Replication improves data availability and can also improve read performance because read requests can often be serviced from any of the replicated data nodes.


However, write performance becomes more cumbersome because an update must be applied to every copy of the replicated data items; this can slow down write performance if serializable consistency is required.


 Many NoSQL applications do not require serializable consistency, so more relaxed forms of consistency known as eventual consistency are used.


2. Replication: Two major replication models are used in NoSQL systems: master-slave and master-master replication:


a. The master-slave replication requires one copy to be the master copy; all write operations must be applied to the master copy and then propagated to the slave copies, usually using eventual consistency, that is, the slave copies will eventually be the same as the master copy. For read, the master-slave paradigm can be configured in various ways.


One configuration requires all reads to also be at the master copy, so this would be similar to the primary site or primary copy methods of distributed concurrency control, with similar advantages and disadvantages.


Another configuration would allow reads at the slave copies but would not guarantee that the values are the latest writes since writes to the slave nodes can be done after they are applied to the master copy.


b. The master-master replication allows reads and writes at any of the replicas but may not guarantee that reads at nodes that store different copies see the same values.


Different users may write the same data item concurrently at different nodes of the system, so the values of the item will be temporarily inconsistent.

A reconciliation method to resolve conflicting write operations of the same data item at different nodes must be implemented as part of the master-master replication scheme.


3. Sharding of Files:


In many NoSQL applications, files (or collections of data objects) can have many millions of records (or documents or objects), and these records can be accessed concurrently by thousands of users. So, it is not practical to store the whole file in one node.


Sharding or horizontal partitioning of the file records is often employed in NoSQL systems. This serves to distribute the load of accessing the file records to multiple nodes.


The combination of sharing the file records and replicating the shards works in tandem to improve load balancing as well as data availability.


4. High-Performance Data Access:

High-performance data access: In many NoSQL applications, it is necessary to find individual records or objects (data items) from among the millions of data records or objects in a file.


The majority of accesses to an object will be by providing the key value rather than by using complex query conditions. The object key is similar to the concept of object id. To achieve this, most systems use one of two techniques: hashing or range partitioning on object keys:


a. Hashing, a hash function h(K) is applied to the key K, and the location of the object with key K is determined by the value of h(K).


b. Range partitioning, the location is determined via a range of key values; for example, the location I would hold the objects whose key values K are in the range Kimin ≤ K ≤ Kimax.


In applications that require range queries, where multiple objects within a range of key values are retrieved, range partitioning is preferred.


Other indexes can also be used to locate objects based on attribute conditions different from the key K.


5. Scalability: There are two kinds of scalability in distributed systems: scale-up and scale-out. Scale-up scalability, on the other hand, refers to expanding the storage and computing power of existing nodes.


In NoSQL systems, scale-out scalability is employed while the system is operational, so techniques for distributing the existing data among new nodes without interrupting system operation are necessary. 


In NoSQL systems, scale-out scalability is generally used, where the distributed system is expanded by adding more nodes for data storage and processing as the volume of data grows.


NoSQL Characteristics Related to Data Models and Query Languages

Data Models

1. Not Requiring a Schema:

The flexibility of not requiring a schema is achieved in many NoSQL systems by allowing semistructured, self-describing data. The users can specify a partial schema in some systems to improve storage efficiency, but it is not required to have a schema in most of the NoSQL systems.


As there may not be a schema to specify constraints, any constraints on the data would have to be programmed in the application programs that access the data items.


There are various languages for describing semistructured data, such as JavaScript Object Notation (JSON) and Extensible Markup Language (XML). JSON is used in several NoSQL systems, but other methods for describing semi-structured data can also be used.


2. Less Powerful Query Languages:

Many applications that use NoSQL systems may not require a powerful query language such as SQL, because search (read) queries in these systems often locate single objects in a single file based on their object keys.


NoSQL systems typically provide a set of functions and operations as a programming application programming interface (API), so reading and writing the data objects are accomplished by calling the appropriate operations by the programmer.


In many cases, the operations are called CRUD operations, to create, read, update, and delete. In other cases, they are known as SCRUD because of an added Search (or Find) operation.


Some NoSQL systems also provide a high-level query language, but it may not have the full power of SQL; only a subset of SQL querying capabilities would be provided.


In particular, many NoSQL systems do not provide join operations as part of the query language itself; the joins need to be implemented in the application programs.


Versioning: Some NoSQL systems provide storage of multiple versions of the data items, with the timestamps of when the data version was created.


Column Databases

Column Databases

A category of NoSQL systems is known as column-based or wide-column systems. The Google distributed storage system for big data, known as BigTable, is a well-known example of this class of NoSQL systems, and it is used in many Google applications that require large amounts of data storage, such as Gmail.


Big-Table uses the Google File System (GFS) for data storage and distribution. An open source system known as Apache HBase is somewhat similar to Google Big-Table, but it typically uses the Hadoop Distributed File System (HDFS) for data storage. HBase can also use Amazon’s Simple Storage System (known as S3) for data storage.


A column, which has a name and a value, is a basic unit of storage in a column family database. A set of columns makes up a row: rows can have identical columns, or they can have different columns.


A set of related columns can be grouped together in collections called column families. Column family databases are designed for rows with a varying number of columns; column families can support even millions of columns.


Column families are organized into groups of data items that are frequently used together. Column families for a single row may or may not be near each other when stored on a disk, but columns within a column family are stored together in persistent storage, making it more likely that reading a single block can satisfy a query.


For a column database, a keyspace is the top-level data structure in that all other data structures are contained within a keyspace; typically, there is one keyspace for every application.


Column families are stored in a keyspace. Each row in a column family is uniquely identified by a row key, thus making a column family analogous to a table in a relational database.


However, data in relational database tables are not necessarily maintained in a predefined order, and rows in relational tables are not versioned the way they are in column family databases.


Tables in relational databases have a relatively fixed structure, and the relational database management system can take advantage of that structure when optimizing the layout of data on drives when storing data for write operations and when retrieving data for reading operations.


Columns in a relational database table are not as dynamic as in column family databases— adding a column in a relational database requires changing its schema definition; adding a column in a column family database merely requires making a reference to it from a client application.


for example, inserting a value to a column name. Unlike a relational database table, the set of columns in a column family can vary from one row to another. Column databases do not require a fixed schema—developers can add columns as needed.


Similarly, in relational databases, data about an object can be stored in multiple tables. For example, a supplier might have a name, address, and contact information in one table; a list of past purchase orders in another table; and a payment history in another.


To reference data from all three tables at once requires a join operation between all these tables. Columnar databases are typically denormalized or structured so that all relevant information about an object exists in a single row.


Query languages for column family databases may look similar to SQL. The query language can support

SQL-like terms such as INSERT, UPDATE, DELETE, and SELECT

Column family-specific operations such as CREATE COLUMNFAMILY




Cassandra is a distributed, open source data model that handles petabytes of data distributed across several commodity servers.


The robust support, high data availability, and no single point of failure by Cassandra is guaranteed by the fact that the data clusters span multiple data centers.


The system offers master-less and peer-peer asynchronous replicas; the peer-peer clustering (the same role for each node) helps achieve low latency, high throughput, and no single point of failure.


In the data model, the rows are partitioned into tables; the very first column of the table forms the primary key, also known as the partition key. Cassandra does not support joins and subqueries but encourages denormalization for achieving high data availability.


The Cassandra data model believes in parallel data operation that results in high throughput and low latency and adopts data replication strategy to ensure high availability for writing too.


Unlike in relational data models, in Cassandra, the data are duplicated on multiple peer nodes, enabling reliability and fault tolerance.


In Cassandra, whenever a write operation occurs, the data are stored in the form of a structure in the memory called memtable. The same data are also appended to the commit log on the hard disk that provides configurable durability.


Each and every writer on a Cassandra node is also received by the commit log, and these durable writes are permanent and survive even after a hardware failure; the durable write, set to true, is advisable to avoid the risk of losing data.


Though customizable, the right amount of memory is dynamically allocated by the Cassandra system and a configurable threshold is set.


During the writing operation, when the contents exceed the limit of memtable, the data, including indexes, are queued, ready to be flushed to Sorted String Tables (SSTables) on disk through sequential I/O.


The Cassandra model comes equipped with a SQL-like language called Cassandra Query Language (CQL) to query data.


Cassandra Features

Cassandra Features

The basic unit of storage in Cassandra is a column. A Cassandra column consists of a name-value pair, where the name also behaves as the key; each of these key-value pairs is a single column and is always stored with a timestamp value.


The timestamp is used to expire data, resolve write conflicts, deal with stale data, and do other things. Once the column data are no longer used, space can be reclaimed later during a compaction phase.


Cassandra puts the standard and super column families into keyspaces. A keyspace is similar to a database in RDBMS where all column families related to the application are stored.


1. Consistency: Upon receipt of a write, the data are first recorded in a commit log and then written to an in-memory structure known as memorable. A write operation is considered successful once it is written to the commit log and the memtable.


Writes are batched in memory and periodically written out to structures known as SSTable. SSTables are not written to again after they are flushed; if there are changes to the data, a new SSTable is written. Unused SSTables are reclaimed by compactation.


a. Using Consistency Level ONE read operations: If we have a consistency setting of ONE as the default for all read operations, then upon a read request, Cassandra returns the data from the first replica, even if the data are stale.


If the data are stale, subsequent reads will get the latest (newest) data; this process is known as reading repair. The low consistency level is good to use for indifference to the staleness of data or for high read performance requirements.


Write operations: Cassandra would write to a node’s commit log and return a response to the client. The consistency of ONE is good to use for indifference to the loss of writes (which may happen if the node goes down before the write is replicated to other nodes) or for very high write performance requirements and also do not mind if some writes are lost.


b.Using Consistency Level QUORUM: Using the QUORUM consistency setting for both read and write operations ensures that the majority of the nodes respond to the read and the column with the newest timestamp is returned back to the client, while the replicas that do not have the newest data are repaired via the read repair operations.


During write operations, the QUORUM consistency setting means that the write has to propagate to the majority of the nodes before it is considered successful and the client is notified.


c.Using Consistency Level ALL: Using ALL as consistency level means that all nodes will have to respond to reads or writes, which will make the cluster not tolerant to faults—even when one node is down, the write or read is blocked and reported as a failure. It is essential to tune the consistency levels as the application requirements change.


2.Transactions: A write is atomic at the low level, which means inserting or updating columns for a given row key will be treated as a single write and will either succeed or fail.


Writes are first written to commit logs and memtables and are considered good only when the right to commit log and memtable is successful. If a node goes down, the commit log is used to apply changes to the node, just like the redo log in Oracle.


3.Availability: Cassandra is, by design, highly available, since there is no master in the cluster and every node is a peer in the cluster. The availability of a cluster can be increased by reducing the consistency level of the requests.


Availability is governed by the (R + W) > N formula, where W is the minimum number of nodes where the write must be successfully written, R is the minimum number of nodes that must respond successfully to a read, and N is the number of nodes participating in the replication of data. You can tune the availability by changing the R and W values for a fixed value of N.


4.Query: When designing the data model in Cassandra, as it does not have a rich query language, it is advised to make the columns and column families optimized for reading the data.


As data are inserted in the column families, data in each row are sorted by column names. If we have a column that is retrieved much more often than other columns, it is better performance-wise to use that value for the row key instead.


5.Scaling: Scaling an existing Cassandra cluster is a matter of adding more nodes. As no single node is a master when we add nodes to the cluster, we are improving the capacity of the cluster to support more writes and reads.


This type of horizontal scaling allows you to have maximum uptime as the cluster keeps serving requests from the clients while new nodes are being added to the cluster.


Google BigTable

Google BigTable

BigTable, primarily used by Google for various important projects, is envisioned as a distributed storage system, designed to manage petabytes of data, distributed across several thousand commodity servers allowing for even further horizontal scaling.


At the hardware level, these thousand commodity servers are grouped under the distributed clusters, which, in turn, are connected through the central switch.


Each cluster is constituted by various racks and a rack can, in turn, consist of several commodity computing machines that communicate with each other through the rack switches.


In fact, these are the switches that are connected to the central switch to help in inter-rack or intercluster communications.


The BigTable efficiently deals with latency and data size issues. It is a compressed, proprietary high-performance storage system, underpinned by the Google File System (GFS) for log and data files storage, and Google SSTable (String Sorted Table) that stores the BigTable data internally.


SSTable offers a persistent, immutable ordered map with keys and values as byte strings, with the key being the primary source for searching. The SSTable is a collection of blocks of typical (though configurable) size of 64 kb.


Unlike a relational database, it is a sparse, multi-dimensional sorted map, indexed by a row key, a column key, and a timestamp. The map stores each value as an uninterrupted array of bytes.


In the BigTable data model, there is no support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions across row keys. As the BigTable data model is not relational in nature, it does not support join operations and there is no SQL-type language support also.


This may pose challenges to those users who largely rely on SQL-like languages for data manipulation. In the BigTable data model, at the fundamental hardware level, interact communication is less efficient than in intra rack communication, that is communication within a rack through the rack switches.


In a write operation, a valid transaction is logged into the tablet log. The commit is the group committed to improving the throughput. On the commit moment of the transaction, the data are inserted into the memory table (memtable).


During the write operation, the size of the memtable continues to grow, and once it reaches the threshold, the current memtable freezes and a new memtable is generated. The former is converted into an SSTable and finally written into GFS.


In a read operation, the SSTables and the memtable form an efficient merged view and the valid read operation is performed on it.


HBase Data Model and Versioning

Data Model

1.HBase Data Model: The data model in HBase organizes data using the concepts of namespaces, tables, column families, column qualifiers, columns, rows, and data cells.


A column is identified by a combination of (column family: column qualifier). Data are stored in a self-describing form by associating columns with data values, where data values are strings.


HBase also stores multiple versions of a data item, with a timestamp associated with each version, so versions and timestamps are also part of the HBase data model (this is similar to the concept of attribute versioning in temporal databases).


As with other NoSQL systems, unique keys are associated with stored data items for fast access, but the keys identify cells in the storage system. Because the focus is on high performance when storing huge amounts of data, the data model includes some storage-related concepts.


a. Tables and rows:

Data in HBase is stored in tables, and each table has a table name. Data in a table are stored as self-describing rows.


Each row has a unique row key, and row keys are strings that must have the property that they can be lexicographically ordered, that is characters that do not have a lexicographic order in the character set cannot be used as part of a low key.


b. Column families, column qualifiers, and columns:

A table is associated with one or more column families. Each column family will have a name, and the column families associated with a table must be specified when the table is created and cannot be changed later.


While creating a table, the table name is followed by the names of the column families associated with the table.

creating a table:

create ‘EMPLOYEE’, ‘Name’, ‘Address’, ‘Details’


When the data are loaded into a table, each column family can be associated with many column qualifiers, but the column qualifiers are not specified as part of creating a table.


So the column qualifiers make the model a self- describing data model because the qualifiers can be dynamically specified as new rows are created and inserted into the table.


A column is specified by a combination of ColumnFamily: ColumnQualifier. Basically, column families are a way of grouping together related columns (attributes in relational terminology) for storage purposes, except that the column qualifier names are not specified during table creation.


Rather, they are specified when the data are created and stored in rows, so the data are self-describing since any column qualifier name can be used in a new row of data.

 style="margin: 0px; width: 949px; height: 154px;">put ‘EMPLOYEE’, ‘row1’, ‘Name:Fname’, ‘Amitabh’ put ‘EMPLOYEE’, ‘row1’, ‘Name:Lname’, ‘Bacchan’ put ‘EMPLOYEE’, ‘row1’, ‘Name:Nickname’, ‘Amit’ put ‘EMPLOYEE’, ‘row1’, ‘Details:Job’, ‘Engineer’ put ‘EMPLOYEE’, ‘row1’, ‘Details:Review’, ‘Good’ put ‘EMPLOYEE’, ‘row2’, ‘Name:Fname’, ‘Hema’
put ‘EMPLOYEE’, ‘row2’, ‘Name:Lname’, ‘Malini’ put ‘EMPLOYEE’, ‘row2’, ‘Name:Mname’, ‘S’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Job’, ‘DBA’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Supervisor’, ‘Prakash Padukone’ put ‘EMPLOYEE’, ‘row3’, ‘Name:Fname’, ‘Prakash’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Minit’, ‘M’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Lname’, ‘Padukone’ put ‘EMPLOYEE’, ‘row3’, ‘Name:Suffix’, ‘Mr.’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Job’, ‘CEO’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Salary’, ‘3,100,000’

However, it is important that the application programmers know which column qualifiers belong to each column family, even though they have the flexibility to create new column qualifiers on the fly when new data rows are created.


c. Versions and timestamps: HBase can keep several versions of a data item, along with the timestamp associated with each version. The timestamp is a long integer number that represents the system time when the version was created, so newer versions have larger timestamp values.


HBase uses midnight “January 1, 1970, UTC” as timestamp value zero, and uses a long integer that measures the number of milliseconds since that time as the system timestamp value (this is similar to the value returned by the Java utility java.util.Date.getTime() and is also used in MongoDB).


It is also possible for the user to define the timestamp value explicitly in a date format rather than using the system-generated timestamp.


d. Cells: A cell holds a basic data item in HBase. The key (address) of a cell is specified by a combination of (table, row, column family, column qualifier, timestamp).


If the timestamp is left out, the latest version of the item is retrieved unless a default number of versions is specified, say the last three versions.


The default number of versions to be retrieved, as well as the default number of versions that the system needs to keep, are parameters that can be specified during table creation.


e. Namespaces: A namespace is a collection of tables. A namespace basically specifies a collection of one or more tables that are typically used together by user applications, and it corresponds to a database that contains a collection of tables in relational terminology.


[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]


HBase CRUD Operations

HBase has low-level CRUD (create, read, update, delete) operations, as in many of the NoSQL systems. The formats of some of the basic CRUD operations in HBase are illustrated below.


Creating a table: create <tablename>, <column family>, <column family>, … Inserting Data: put <tablename>, <rowid>, <column family>:<column qualifier>, <value>


Reading Data (all data in a table): scan <tablename> Retrieve Data (one item): get <tablename>,<rowid>

Reading Data

  • HBase only provides low-level CRUD operations. It is the responsibility of the application programs to implement more complex operations, such as
  • The join operation between rows in different tables.
  • The create operation creates a new table and specifies one or more column families associated with that table, but it does not specify the column qualifiers, as we discussed earlier.


  • The put operation is used for inserting new data or new versions of existing data items.
  • The get operation is for retrieving the data associated with a single row in a table, and the scan operation retrieves all the rows.


HBase Storage and Distributed System Concepts

Each HBase table is divided into a number of regions, where each region will hold a range of the row keys in the table; this is why the row keys must be lexicographically ordered.


Each region will have a number of stores, where each column family is assigned to one store within the region.


Regions are assigned to region servers (storage nodes) for storage. A master server (master node) is responsible for monitoring the region servers and for splitting a table into regions and assigning regions to region servers.


HBase is built on top of both HDFS and Zookeeper.


HBase uses the Apache Zookeeper open source system for services related to managing the naming, distribution, and synchronization of the HBase data on the distributed HBase server nodes, as well as for coordination and replication services.


Zookeeper can itself have several replicas on several nodes for availability, and it keeps the data it needs in main memory to speed access to the master servers and region servers.


HBase also uses Apache Hadoop Distributed File System (HDFS) for distributed file services.


Key-Value Databases

A key-value data store is a more complex variation on the array data structure. The main characteristic of key-value stores is the fact that every value (data item) must be associated with a unique key, and that retrieving the value by supplying the key must be very fast.


Key-value pair databases are modeled on two components: keys and values; keys are identifiers associated with corresponding values. A namespace is a collection of identifiers; keys must be unique within a namespace.


If a namespace corresponds to an entire database, all keys in the database must be unique. Some key-value databases provide for different namespaces within a database by setting up separate data structures or buckets for separate collections of identifiers.


Values can be as simple as a string, such as a name, number, or more complex values, such as images or binary objects, and so on. Because key-value databases allow virtually any data type in values, it is up to the programmer to determine the range of valid values and enforce those choices as required.


Key-value databases share three essential features as follows:


Simplicity: Key-value databases work with a simple data model; it allows the addition of attributes when needed.


Speed: With a simple associative array data structure and design features to optimize performance, key-value databases can deliver high-throughput, data-intensive operations.


Scalability: It is the capability to add or remove servers from a cluster of servers as needed to accommodate the load on the system. Key-value pair databases are the simplest form of NoSQL databases.



Riak is a key-value open source NoSQL data model that has been developed by Basho Technologies. The model is considered a fine implementation of Amazon’s Dynamo principles and distributes the data through nodes by employing the consistent hashing in an ordinary key-value system into buckets, simply the namespaces.


Riak consists of the supported client libraries for several programming languages such as Erlang, Java, PHP, Python, Ruby, and C/C++. that help in setting or retrieving the value of the key, the most prominent operations a user performs with Riak.


The model has received wide acceptance in companies such as AT&T, AOL, and Ask.com - What's Your Question?.


Similar to other efficient NoSQL data models, Riak also has the potential fault-tolerant availability and by default replicates the key-value store at three places across the nodes of the cluster.


In unfavorable circumstances such hardware failure, the node outage will not completely shut down the write operation, rather its master-less peer-to-peer architecture allows a neighboring node (beyond default three) to respond to the write operation at that moment, and the data can be read back later.


Riak Features

1.Consistency: In distributed key-value store implementations such as Riak, the eventually consistent model of consistency is implemented.


Since the value may have already been replicated to other nodes, Riak has two ways of resolving update conflicts: either the newest write wins and the older writes lose, or both (all) values are returned, allowing the client to resolve the conflict.


2.Transactions: Riak uses the concept of quorum implemented by using the W value— replication factor—during the write API call. Consider a Riak cluster with a replication factor of 5 and suppose we supply the W value of 3.


When writing, the write is reported as successful only when it is written and reported as a success on at least three of the nodes.


This allows Riak to have to write tolerance; in our example, with N equal to 5 and with a W value of 3, the cluster can tolerate N−W = 2 nodes being down for write operations, though we would still have lost some data on those nodes for reading.


3.Query: Key-value stores can query by the key.


4.Scaling: Many key-value stores scale by using sharding: the value of the key determines on which node the key is stored.


Amazon Dynamo

Amazon runs a worldwide e-commerce platform that serves tens of millions of customers at peak times using tens of thousands of servers located in many data centers around the world.


Reliability is one of the most important requirements because even the slightest outage has significant financial consequences and impacts customer trust; there are strict operational requirements on Amazon’s platform in terms of performance, reliability, and efficiency, and to support Amazon’s continuous growth, the platform needs to be highly scalable.


The Dynamo system is a highly available and scalable distributed key-value-based data store built for supporting Amazon’s internal applications. Dynamo is used to manage the state of services that have very high-reliability requirements and need tight control over the tradeoffs among availability, consistency, cost-effectiveness, and performance.


There are many services on Amazon’s platform that need only primary-key access to a data store. 


The customary use of relational databases would lead to inefficiencies and limit the ability to scale and provide high availability. Dynamo provides a simple primary-key-only interface to meet the requirements of these applications.


The query model of the Dynamo system relies on simple read and write operations to a data item that is uniquely identified by a key. The state is stored as binary objects (blobs) identified by unique keys. No operations span multiple data items.


Dynamo’s partitioning scheme relies on a variant of consistent hashing mechanisms to distribute the load across multiple storage hosts. In this mechanism, the output range of a hash function is treated as a fixed circular space or ring (i.e., the largest hash value wraps around to the smallest hash value).


Each node in the system is assigned a random value within this space, which represents its position on the ring.


Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring and then walking the ring clockwise to find the first node with a position larger than the item’s position.


In the Dynamo system, each data item is replicated at N hosts, where N is a parameter configured per instance. Each key K is assigned to a coordinator node. The coordinator is in charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the (N − 1) clockwise successor nodes in the ring.


This results in a system where each node is responsible for the region of the ring between it and its Nth predecessor. Node B replicates the key K at nodes C and D in addition to storing it locally. Node D will store the keys that fall in the ranges (A, B), (B, C), and (C, D).


The list of nodes that are responsible for storing a particular key is called the preference list. The system is designed so that every node in the system can determine which nodes should be in this list for any particular key.


Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principle advantage of consistent hashing is that the departure or arrival of a node affects only its immediate neighbors and other nodes remain unaffected.


DynamoDB Data Model

DynamoDB Data Model

The basic data model in DynamoDB uses the concepts of tables, items, and attributes. A table in DynamoDB does not have a schema; it holds a collection of self-describing items.


Each item will consist of a number of (attribute, value) pairs, and attribute values can be single-valued or multivalued. So basically, a table will hold a collection of items, and each item is a self-describing record (or object).


DynamoDB also allows the user to specify the items in JSON format, and the system will convert them to the internal storage format of DynamoDB.


When a table is created, it is required to specify a table name and a primary key; the primary key will be used to rapidly locate the items in the table. Thus, the primary key is the key and the item is the value for the DynamoDB key-value store. The primary key attribute must exist in every item in the table.


The primary key can be one of the following two types:


1. A single attribute:

The DynamoDB system will use this attribute to build a hash index on the items in the table. This is called a hash type primary key. The items are not ordered in storage on the value of the hash attribute.


2. A pair of attributes:

This is called a hash and range type of primary key. The primary key will be a pair of attributes (A, B): attribute A will be used for hashing, and because there will be multiple items with the same value of A, the B values will be used for ordering the records with the same A value. A table with this type of key can have additional secondary indexes defined on its attributes.


For example, if we want to store multiple versions of some type of items in a table, we could use ItemID as hash and Date or Timestamp as the range in a hash, and range type as the primary key.


Document Databases

Document Databases

Document-based or document-oriented NoSQL systems typically store data as collections of similar documents.


The individual documents somewhat resemble “complex objects” or XML documents, but a major difference between document-based systems versus object and object-relational systems, and XML is that there is no requirement to specify a schema—rather, the documents are specified as self-describing data.


Although the documents in a collection should be similar, they can have different data elements (attributes), and new documents can have new data elements that do not exist in any of the current documents in the collection.


The system basically extracts the data element names from the self-describing documents in the collection, and the user can request that the system create indexes on some of the data elements.


Documents can be specified in various formats, such as XML. A popular language to specify documents in NoSQL systems is JavaScript Object Notation (JSON).


JSON, short for JavaScript Object Notation, is a lightweight computer data interchange format, easy for humans to read and write, and, at the same time, easy for machines to parse and generate.


Another key feature of JSON is that it is completely language independent. Programming in JSON does not raise any challenge for programmers experienced with the C family of languages.


Though XML is a widely adopted data exchange format, there are several reasons for preferring JSON to XML:


Data entities exchanged with JSON are typed, while XML data are typeless (some of the built-in data types available in JSON are a string, number, array, and Boolean); XML data, on the other hand, are all strings.


JSON is lighter and faster than XML as on-the-wire data format.


JSON integrates natively with JavaScript, a very popular programming language used to develop applications on the client side; consequently, JSON objects can be directly interpreted in JavaScript code, while XML data need to be parsed and assigned to variables through the tedious usage of Document Object Model APIs (DOM APIs).


JSON is built on two structures:

1.A collection of name/value pairs. In various languages, this is realized as an object, record, structure, dictionary, hash table, keyed list, or associative array.


2. An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence. Document databases use a key-value approach to storing data but with important differences from key-value databases:


A document database stores values as semistructured documents, typically in a standard format such as JavaScript Object Notation (JSON) or Extensible Markup Language (XML). Instead of storing each attribute of an entity with a separate key, document databases store multiple attributes in a single document.


Unlike relational databases, document databases do not necessitate defining a fixed schema before adding data to the database—adding a document to the database automatically creates the underlying data structures needed to support the document.


The lack of a fixed schema gives developers more flexibility with document databases than they have with relational databases. Documents can also have embedded documents and lists of multiple values within a document, which effectively eliminates the need for joining documents the way you join tables in a relational database.


Partitioning, which refers to splitting a document database and distributing different parts of the database to different servers, is of two types:


Vertical Partitioning: It is the process of improving database performance by separating columns of a relational table into multiple separate tables or documents into multiple separate collections; this technique is particularly useful when you have some columns or documents that are accessed more frequently than others.


Horizontal partitioning: It is the process of improving database performance by rows in a relational database or dividing a database by documents in a document database with the separate parts or shards being stored shard-wise on separate servers. Sharding is done based upon a range of values or a hash function or a list of values.


Document databases provide application programming interfaces (APIs) or query languages that enable you to retrieve documents based on attribute values. They offer support for querying structures with multiple attributes, like relational databases, but offer more flexibility with regard to variation in the attributes used by each document.


Document databases are probably the most popular type of NoSQL database.



A cluster of Unreliable Commodity Hardware Data Base (CouchDB) is an open source, NoSQL, a web-inclined database that uses JSON to store the data and relies on JavaScript as its query language.


It uses HTTP for data access and merits with multi-master replication. CouchDB does not use tables at all for data storage or relationship management; rather it is a collection of independent documents, maintaining their self-contained schemas and data, which form the database.


In CouchDB, multi-version concurrency control helps to avoid locks during a write operation and document metadata holds the revision information and any offline data updates are merged and stale ones are deleted. CouchDB is equipped with a built-in administration web interface, called Futon.



MongoDB (term extracted from the word humongous) is an open source, document-oriented NoSQL database. It enables master-slave replication:


Master performs reads and writes.

Slave copies the data received from the master performs the read operation and backs up the data. Slaves do not participate in write operations but may select an alternate master in case the current master fails.


Unlike the traditional relational databases, MongoDB uses a binary format of JSON-like documents and dynamic schemas. MongoDB enables flexible schemas and common fields of various documents in a collection can have disparate types of data.


The query system of MongoDB can return particular fields and query set compass search by fields, range queries, regular expression search, and so on, and may also include user-defined complex JavaScript functions.


The latest release of the data model is equipped with more advanced concepts such as sharing that address the scaling issues, and the replica sets that assist in automatic failover and recovery.


Sharding can be understood as a process that addresses the demand of the data growth by storing data records across multiple machines without any downtime.


Each shard (server) functions as an independent database and all the shards collectively form a single logical database. Each config server possesses the metadata about the cluster such as information about the shard and data chunks on the shard.


For protection and safety, there exist multiple config servers, such that if one of these servers goes down, another takes charge without any interruption in shard functioning.


Mongod is the key database process and represents one shard. In a replica set, one of the mongod processes serves as the master and if it goes down, another is delegated for carrying out the master’s responsibilities: Mongos intermediates as a routing process between the client and the shared database.


MongoDB Features

1.Consistency: Consistency in the MongoDB database is configured by using the replica sets and choosing to wait for the writes to be replicated to all the slaves or a given number of slaves. Every write can specify the number of servers the write has to be propagated to before it returns as successful.


2.Transactions: In traditional RDBMS, transactions mean modifying the database with insert, update, or delete commands over different tables and then deciding to keep the changes or not by using commit or rollback.


These constructs are generally not available in NoSQL solutions—a write either succeeds or fails. Transactions at the single-document level are known as “atomic” transactions.


By default, all writes are reported as successful. Finer control over the write can be achieved by using the WriteConcern parameter. We ensure that order is written to more than one node before it is reported successful by using WriteConcern. REPLICAS_SAFE.


Different levels of WriteConcern let you choose the safety level during writes; for example, when writing log entries, you can use the lowest level of safety, WriteConcern.NONE.


3. Availability: MongoDB implements replication, providing high availability using replica sets. In a replica set, there are two or more nodes participating in asynchronous master-slave replication.


The replica-set nodes elect the master, or primary, among themselves. Assuming all the nodes have equal voting rights, some nodes can be favored for being closer to the other servers, for having more RAM, and so on; users can affect this by assigning a priority—a number between 0 and 1,000—to a node.


All requests go to the master node, and the data are replicated to the slave nodes. If the master node goes down, the remaining nodes in the replica set vote among themselves to elect a new master; all future requests are routed to the new master, and the slave nodes start getting data from the new master.


When the node that had failed comes back online, it joins in as a slave and catches up with the rest of the nodes by pulling all the data it needs to get current.


4. Query: One of the good features of document databases, as compared to key-value stores, is that we can query the data inside the document without having to retrieve the whole document by its key and then introspect the document. This feature brings these databases closer to the RDBMS query model.


MongoDB has a query language that is expressed via JSON and has constructed such as $query for the where clause, $orderby for sorting the data, or $explain to show the execution plan of the query. There are many more constructs such as these that can be combined to create a MongoDB query.



Scaling implies adding nodes or changing data storage without simply migrating the database to a bigger box. Scaling for heavy- read loads can be achieved by adding more read slaves so that all the reads can be directed to the slaves.


Given a heavy-read application, with our three-node replica-set cluster, we can add more read capacity to the cluster as the read load increases just by adding more slave nodes to the replica set to execute reads with the slave Ok flag.


When a new node is added, it will sync up with the existing nodes, join the replica set as a secondary node, and start serving read requests. An advantage of this setup is that we do not have to restart any other nodes, and there is no downtime for the application either.


MongoDB Data Model

MongoDB documents are stored in Binary JSON (BSON) format, which is a variation of JSON with some additional data types and is more efficient for storage than JSON. Individual documents are stored in a collection. The operation createCollection is used to create each collection.


1. Can be specified by the user: User-generated ObjectsIds can have any value specified by the user as long as it uniquely identifies the document and so these Ids are similar to primary keys in relational systems.


2.Can be system generated if the user does not specify an _id field for a particular document. System-generated ObjectIds have a specific format, which combines the timestamp when the object is created (4 bytes, in an internal MongoDB format), the node id (3 bytes), the process id (2 bytes), and a counter (3 bytes) into a 16-byte Id value.


A collection does not have a schema. The structure of the data fields in documents is chosen based on how documents will be accessed and used, and the user can choose a normalized design (similar to normalized relational tuples) or a denormalized design (similar to XML documents or complex objects).


Interdocument references can be specified by storing in one document the ObjectId or ObjectIds of other related documents.


The _id values are user-defined, and the documents whose _id starts with P (for the project) will be stored in the “project” collection, whereas those whose _id starts with W (for the worker) will be stored in the “worker” collection.


The workers’ information is embedded in the project document; so there is no need for the “worker” collection:


_id: “P1”,

Pname: “ProductL”, Plocation: “Pune”, Workers: [

{ Ename: “Amitabh Bacchan”, Hours: 32.5


{ Ename: “Priyanka Chopra”, Hours: 20.0




This is known as the denormalized pattern, which is similar to creating a complex object or an XML document.

Another option is where worker references are embedded in the project document, but the worker documents themselves are stored in a separate “worker” collection:


_id: “P1”,

Pname: “ProductL”, Plocation: “Pune”, WorkerIds: [“W1”, “W2”]


{ _id: “W1”,

Ename: “Amitabh Bacchan”, Hours: 32.5


{ _id: “W2”,

Ename: “Priyanka Chopra”, Hours: 20.0


A third option is to use a normalized design, similar to First Normal Form (1NF) relations:


_id: “P1”,

Pname: “ProductL”, Plocation: “Pune”


{ _id: “W1”,

Ename: “Amitabh Bacchan”, ProjectId: “P1”,

Hours: 32.5


{ _id: “W2”,

Ename: “PriyankaBacchan”, ProjectId: “P1”,

Hours: 20.0



The choice of which design option to use depends on how the data will be accessed.


MongoDB CRUD Operations

MongoDB has several CRUD operations, where CRUD stands for creating, read, update, delete. Documents can be created and inserted into their collections using the insert operation, whose format is



The parameters of the insert operation can include either a single document or an array of documents, as shown below:

 style="margin: 0px; width: 955px; height: 157px;">db.project.insert( { _id: “P1”, Pname: “ProductL”, Plocation: “Pune” } ) db.worker.insert( [ { _id: “W1”, Ename: “Amitabh Bacchan”, ProjectId: “P1”, Hours: 32.5 },

{ _id: “W2”, Ename: “Priyanka Chopra”, ProjectId: “P1”, Hours: 20.0 } ] )

The delete operation is called remove, and the format is



The documents to be removed from the collection are specified by a Boolean condition on some of the fields in the collection documents.


There is also an update operation, which has a condition to select certain documents, and a $set clause to specify the update. It is also possible to use the update operation to replace an existing document with another one but keep the same ObjectId.


For read queries, the main command is called find, and the format is db.<collection_ name>.find(<condition>)


General Boolean conditions can be specified as <condition>, and the documents in the collection that return true are selected for the query result.


MongoDB Distributed Systems Characteristics

Most MongoDB updates are atomic if they refer to a single document, but MongoDB also provides a pattern for specifying transactions on multiple documents. Since MongoDB is a distributed system, the two-phase commit method is used to ensure atomicity and consistency of multi-document transactions.


1. MongoDB Replication:

The concept of “replica set” is used in MongoDB to create multiple copies of the same data set on different nodes in the distributed system, and it uses a variation of the master-slave approach for replication.


All write operations must be applied to the primary copy and then propagated to the secondaries. For read operations, the user can choose the particular read preference for their application. The default read preference processes all read at the primary copy, so all read and write operations are performed at the primary node.


In this case, secondary copies are mainly to make sure that the system continues operation if the primary fails, and MongoDB can ensure that every read request gets the latest document value.


To increase read performance, it is possible to set the read preference so that read requests can be processed at any replica (primary or secondary); however, a read at a secondary is not guaranteed to get the latest version of a document because there can be a delay in propagating writes from the primary to the secondaries.


2. Sharding in MongoDB:

When a collection holds a very large number of documents or requires large storage space, storing all the documents in one node can lead to performance problems, particularly if there are many user operations accessing the documents concurrently using various CRUD operations.


Sharding or horizontal partitioning of the documents in the collection divides the documents into disjoint partitions known as shards.


This allows the system to add more nodes as needed by a process known as horizontal or scaling-out of the distributed system, and to store the shards of the collection on different nodes to achieve load balancing.


Each node will process only those operations pertaining to the documents in the shard stored at that node. Also, each shard will contain fewer documents than if the entire collection were stored at one node, thus further improving performance.


The user must specify a particular document field to be used as the basis for partitioning the documents into shards. The partitioning field—known as the shard key in MongoDB— must have two characteristics:


  • It must exist in every document in the collection.
  • It must have an index.


  • The ObjectId can be used as the partitioning filed, but any other field possessing these two characteristics can also be used as the basis for sharing.


  • The values of the shard key are divided into chunks, and the documents are partitioned based on the chunks of shard key either through


1. Hash partitioning: Hash partitioning applies a hash function h(K) to each shard key K, and the partitioning of keys into chunks is based on the hash values. If most searches retrieve one document at a time, hash partitioning may be preferable because it randomizes the distribution of shard key values into chunks.


2. Range partitioning: In general, if range queries are commonly applied to a collection (for example, retrieving all documents whose shard key value is between 200 and 400), then range partitioning is preferred because each range query will typically be submitted to a single node that contains all the required documents in one shard.


When sharding is used, MongoDB queries are submitted to a module called the query router, which keeps track of which nodes contain which shards based on the particular partitioning method used on the shard keys. The query (CRUD operation) will be routed to the nodes that contain the shards that hold the documents that the query is requesting.


If the system cannot determine which shards hold the required documents, the query will be submitted to all the nodes that hold shares of the collection.


Sharding and replication are used together; sharding focuses on improving performance via load balancing and horizontal scalability, whereas replication focuses on ensuring system availability when certain nodes fail in the distributed system.


Graph Databases

Graph Databases

A graph database uses structures called nodes and relationships (or edges or link or arc). A node or vertex is an object that has an identifier and a set of attributes;


A relationship is a link between two nodes that contains attributes about that relation. For instance, a node could be a city, and a relationship between cities could be used to store information about the distance and travel time between cities.


Much like nodes, relationships have properties; the weight of the relationship represents some value about the relationship. A common problem encountered when working with graphs is to find the least-weighted path between two vertices.


The weight can represent the cost of using the edge, the time required to traverse the edge or some other metric that you are trying to minimize.


Both nodes and relationships can have complex structures. There are two types of relationships: directed and undirected. Directed edges have a direction.


A path through a graph is a set of vertices along with the edges between those vertices; paths are important because they capture information about how vertices in a graph are related. If edges are directed, the path is a directed path. If the graph is undirected, the paths in it are undirected paths.


Graph databases are designed to model adjacency between objects; every node in the database contains pointers to adjacent objects in the database. This allows for fast operations that require following paths through a graph. Graph databases allow for more efficient querying when paths through graphs are involved.


Many application areas are efficiently modeled as graphs and, in those cases, a graph database may streamline application development and minimize the amount of code you would have to write. Graph databases are the most specialized of the four types of NoSQL databases.




OrientDB is an open source graph-document NoSQL data model. It largely extends the graph data model but combines the features of both document and graph data models up to a certain extent.


At the data level for the schema-less content, it is document based in nature, whereas to traverse the relationship, it is graph oriented, and therefore, fully supports schema-less, schema-full, or schema-mixed data.


The database is completely distributed in nature and can be spanned across several servers. It supports the state of the art multi-master replication distributed system.


It is a fully ACID-compliant data model and also offers role-based security profile to the users. This database engine is light weighted, written in Java, and hence, is portable in nature, platform independent, and can run on Windows, Linux, etc.


One of the salient features of OrientDB is its fast indexing system for lookups and insertion, and that is based on MVRB-Tree algorithm, originated from Red-Black Tree and B+ Tree.


OrientDB relies on SQL for basic operations and uses some graph operator extensions to avoid SQL joins in order to deal with relationships in data; graph traversal language is used as the query processing language and can loosely be termed as OrientDB’s SQL.

OrientDB has out-of-the-box supports for web such as HTTP, RESTful, and JSON without any external intermediaries.



Neo4j is being considered as the world’s leading graph data model and is a potential member of the NoSQL family. It is an open source, a robust, disk-based graph data model that fully supports ACID transactions and is implemented in Java.


The graph nature of Neo4j imparts it with agility and speediness in comparison to relational databases for a similar set of operations. It is significantly faster and outperforms the former with greater than 1000x performance for several potentially important real-time scenarios.


Similar to an ordinary property (simple key-value pairs) graph, the Neo4j graph data model consists of nodes and edges where every node represents an entity, and an edge between two nodes corresponds to the relationship between those attached entities.


As location has become an important aspect of data today, most of the applications have to deal with the highly associated data, forming a network (or graph);


social networking sites are obvious examples of such applications. Unlike relational database models, which require upfront schemas that restrict the absorption of the agile and ad hoc data, Neo4j is a schema-less data model that works on a bottom-up approach that allows easy expansion of the database to enable ad hoc and dynamic data.


Neo4j has its own declarative and expressive query language, called Cypher, which has pattern-matching capabilities among the nodes and relationship during the data mining and data updating.


Cypher is extremely useful when it comes to creating, updating, or removal of nodes, properties, and relationships in the graph.


Simplicity is one of the salient features of Cypher, evolving as a humane query language, and is designed for not only developers but for the naïve users also, who write the ad hoc queries.


Neo4j Features

Neo4j Features

1.Consistency: Since graph databases are operating on connected nodes, most graph database solutions usually do not support distributing the nodes on different servers. Within a single server, data are always consistent, especially in Neo4i, which is fully ACID compliant.


When running Neo4j in a cluster, a write to the master is eventually synchronized to the slaves, while slaves are always available for reading. Writes to slaves are allowed and are immediately synchronized to the master; other slaves will not be synchronized immediately, though they will have to wait for the data to propagate from the master.


2.Transactions: Though Neo4j is ACID compliant, the way of managing transactions differs from the standard way of doing transactions in an RDBMS.


3. Availability: Neo4j achieves high availability by providing for replicated slaves. These slaves can also handle writes: when they are written to, they synchronize the write to the current master, and the write is committed first at the master and then at the slave. Other slaves will eventually get the update.


4. Query: Neo4j allows you to query the graph for properties of the nodes, traverse the graph, or navigate the nodes’ relationships using language bindings. Properties of a node can be indexed using the indexing service.


Similarly, properties of relationships or edges can be indexed, so a node or edge can be found by the value. Indexes should be queried to find the starting node to begin traversal. Neo4j uses Lucene as its indexing service.


Graph databases are really powerful when you want to traverse the graphs at any depth and specify a starting node for the traversal. This is especially useful when you are trying to find nodes that are related to the starting node at more than one level down.


As the depth of the graph increases, it makes more sense to traverse the relationships by using a Traverser where you can specify that you are looking for INCOMING, OUTGOING, or BOTH types of relationships.


You can also make the traverse go top-down or sideways on the graph by using order values of BREADTH_FIRST or DEPTH_FIRST.


5. Scaling: With graph databases, sharding is difficult, as graph databases are not aggregate oriented but relationship oriented. Since any given node can be related to any other node, storing-related nodes on the same server are better for graph traversal.


Since traversing a graph when the nodes are on different machines is not good for performance, graph database scaling can be achieved by using some common techniques:


 We can add enough RAM to the server so that the working set of nodes and relationships is held entirely in memory. This technique is helpful only if the dataset that we are working with fit in a realistic amount of RAM.


We can improve the read scaling of the database by adding more slaves with read-only access to the data, with all the writes going to the master.


This pattern of writing once and reading from many servers is useful when the dataset is large enough to not fit in a single machine’s RAM but small enough to be replicated across multiple machines.


Slaves can also contribute to availability and read scaling, as they can be configured to never become a master, remaining always read-only. When the dataset size makes replication impractical, we can share the data from the application side using domain-specific knowledge.


Neo4j Data Model

The data model in Neo4j organizes data using the concepts of nodes and relationships. Both nodes and relationships can have properties that store the data items associated with nodes and relationships. Nodes can have labels; a node can have zero, one, or several labels.


The nodes that have the same label are grouped into a collection that identifies a subset of the nodes in the database graph for querying purposes.


Relationships are directed; each relationship has a start node and an end node as well as a relationship type, which serves a role similar to that of a node label by identifying similar relationships that have the same relationship type.


Properties can be specified via a map pattern, which is made of one or more “name:value” pairs enclosed in curly brackets;

for example {Lname: ‘Bacchan’, Fname : ‘Amitabh’, Minit : ‘B’}.


There are various ways in which nodes and relationships can be created—for example, by calling appropriate Neo4j operations from various Neo4j APIs. We will just show the high-level syntax for creating nodes and relationships;


to do so, we will use the Neo4j CREATE command, which is part of the high-level declarative query language Cypher. Neo4j has many options and variations for creating nodes and relationships using various scripting interfaces, but a full discussion is outside the scope of our presentation.


1.Indexing and node identifiers: When a node is created, the Neo4j system creates an internal, unique, system-defined identifier for each node. To retrieve individual nodes using other properties of the nodes efficiently, the user can create indexes for the collection of nodes that have a particular label.


Typically, one or more of the properties of the nodes in that collection can be indexed. For example, Empid can be used to index nodes with the EMPLOYEE label, Dno to index the nodes with the DEPARTMENT label, and Pno to index the nodes with the PROJECT label.


2. Optional schema: A schema is optional in Neo4j. Graphs can be created and used without a schema, but in Neo4j version 2.0, a few schema-related functions were added.


The main features related to schema creation involve creating indexes and constraints based on the labels and properties. For example, it is possible to create the equivalent of a key constraint on a property of a label, so all nodes in the collection of nodes associated with the label must have unique values for that property.


NoSQL databases for big data

databases for big data

NoSQL is the generic name used to refer to non-relational databases and stands for Not only SQL. Why is there a need for a non-relational model that does not use SQL?


The short answer is that the non-relational model allows us to continually add new data. The non-relational model has some features that are necessary for the management of big data, namely scalability, availability, and performance.


With a relational database, you cannot keep scaling vertically without loss of function, whereas with NoSQL you scale horizontally and this enables performance to be maintained.


Before describing the NoSQL distributed database infrastructure and why it is suitable for big data, we need to consider the CAP Theorem.


CAP Theorem

In 2000, Eric Brewer, a professor of computer science at the University of California Berkeley, presented the CAP (consistency, availability, and partition tolerance) Theorem.


Within the context of a distributed database system, consistency refers to the requirement that all copies of data should be the same across nodes.


So, for example, Block A in DataNode 1 should be the same as Block A in DataNode 2. Availability requires that if a node fails, other nodes still function—if DataNode 1 fails, then DataNode 2 must still operate.


Data, and hence DataNodes, are distributed across physically separate servers and communication between these machines will sometimes fail. When this occurs it is called a network partition. Partition tolerance requires that the system continues to operate even if this happens.


In essence, what the CAP Theorem states is that for any distributed computer system, where the data is shared, only two of these three criteria can be met. There are therefore three possibilities; the system must be: consistent and available, consistent and partition tolerant, or partition tolerant and available.


Notice that since in an RDMS the network is not partitioned, only consistency and availability would be of concern and the RDMS model meets both of these criteria.


In NoSQL, since we necessarily have partitioning, we have to choose between consistency and availability.


By sacrificing availability, we are able to wait until the consistency is achieved. If we choose instead to sacrifice consistency it follows that sometimes the data will differ from server to server.


The somewhat contrived acronym BASE (Basically Available, Soft, and Eventually consistent) is used as a convenient way of describing this situation. The BASE appears to have been chosen in contrast to the ACID properties of relational databases.


‘Soft’ in this context refers to the flexibility in the consistency requirement. The aim is not to abandon any one of these criteria but to find a way of optimizing all three, essentially a compromise.


The architecture of NoSQL databases

NoSQL databases

The name NoSQL derives from the fact that SQL cannot be used to query these databases.


So, for example, joins such as the one we are not possible. There are four main types of non-relational or NoSQL database: key−value, column-based, document, and graph—all useful for storing large amounts of structured and semi-structured data.


The simplest is the key−value database, which consists of an identifier (the key) and the data associated with that key (the value). Notice that ‘value’ can contain multiple items of data.


Currently, an approach called NewSQL is finding a niche. By combining the performance of NoSQL databases and the ACID properties of the relational model, the aim of this latest technology is to solve the scalability problems associated with the relational model, making it more usable for big data.


Cloud storage

Cloud storage

Like so many modern computing terms the Cloud sounds friendly, comforting, inviting, and familiar, but actually ‘the Cloud’ is, as mentioned earlier, just a way of referring to a network of interconnected servers housed in data centers across the world. These data centers provide a hub for storing big data.


Through the Internet we share the use of these remote servers, provided (on payment of a fee) by various companies, to store and manage our files, to run apps, and so on. As long as your computer or other device has the requisite software to access the Cloud, you can view your files from anywhere and give permission for others to do so.


You can also use software that ‘resides’ in the Cloud rather than on your computer. So it’s not just a matter of accessing the Internet but also of having the means to store and process information—hence the term ‘Cloud computing’.


Our individual Cloud storage needs are not that big, but scaled up the amount of information stored is massive.


Amazon is the biggest provider of Cloud services but the amount of data managed by them is a commercial secret. We can get some idea of their importance in Cloud computing by looking at an incident that occurred in February 2017 when Amazon Web Services’ Cloud storage system, S3, suffered a major outage (i.e. service was lost).


This lasted for approximately five hours and resulted in the loss of connection to many websites and services, including Netflix, Expedia, and the US Securities and Exchange Commission.


Amazon later reported the human error as the cause, stating that one of their employees had been responsible for inadvertently taking servers offline. Rebooting these large systems took longer than expected but was eventually completed successfully.



This blog introduces NoSQL database vendor products. It introduces the characteristics and examples of NoSQL databases, namely, column, key-value, document, and graph databases.


The blog provides a snapshot overview of column databases (Cassandra, Google BigTable, and HBase), key-value databases (Riak, Amazon Dynamo), document databases (CouchDB, MongoDB), and graph databases (OrientDB, Neo4j).