Top NoSQL Technologies

nosql database technology a survey and comparison of systems and nosql technologies are used to manage sets of data and nosql database technology white paper
CecilGalot Profile Pic
CecilGalot,United Kingdom,Teacher
Published Date:06-08-2017
Your Website URL(Optional)
Early Big Data with NoSQL In this chapter, I provide you with an overview of the available datastore technologies that are use in a Big Data project context. I then focus on Couchbase and ElasticSearch and show you how they can be used and what their differences are. The first section gives you a better understanding of the different flavors of existing technologies within the NoSQL landscape. NoSQL Landscape Relational databases were the choice, almost the only choice, of a lot of developers and database administrators for traditional three-tier applications. This was the case for many reasons having to do with the data modeling methodology, the querying language that interacted with the data, and the powerful nature of those technologies, which allowed for consistent data stores to be deployed that served complex applications. Then the needs started to evolve/change in such a way that those data stores could no longer be the solution to all data-store problems. That’s how the term NoSQL arose—it offered a new approach to those problems by first breaking the standardized SQL schema-oriented paradigms. NoSQL technologies are schemaless and highly scalable, and couple of them are also highly distributed and high-performance. Most of the time, they complete an architecture with an existing RDBMS technology by, for example, playing the role of cache, search engine, unstructured store, and volatile information store. They are divided in four main categories: 1. Key/value data store 2. Column data store 3. Document-oriented data store 4. Graph data store Now let’s dive into the different categories and then choose the most appropriate for our use case. Key/Value The first and easiest NoSQL data stores to understand are key/value data stores. These data stores basically act like a dictionary and work by matching a key to a value. They are often used for high-performance use cases in which basic information needs to be stored—for example, when session information may need to be written and retrieved very quickly. These data stores really perform well and are efficient for this kind of use case; they are also usually highly scalable. 17 Chapter 2 ■ y Big earl Daa t with NoSQl Key/value data stores can also be used in a queuing context to ensure that data won’t be lost, such as in logging architecture or search engine indexing architecture use cases. Redis and Riak KV are the most famous key/value data stores; Redis is more widely used and has an in-memory K/V store with optional persistence. Redis is often used in web applications to store session- related data, like node or PHP web applications; it can serve thousands of session retrievals per second without altering the performance. Another typical use case is the queuing use case that I describe later in this book; Redis is positioned between Logstash and ElasticSearch to avoid losing streamed log data before it is indexed in ElasticSearch for querying. Column Column-oriented data stores are used when key/value data stores reach their limits because you want to store a very large number of records with a very large amount of information that goes beyond the simple nature of the key/value store. Column data store technologies might be difficult to understand for people coming from the RDBMS world, but actually, they are quite simple. Whereas data is stored in rows in RDBMS, it is obviously stored in columns in column data stores. The main benefit of using columnar databases is that you can quickly access a large amount of data. Whereas a row in an RDBMS is a continuous disk entry, and multiple rows are stored in different disk locations, which makes them more difficult to access, in columnar databases, all cells that are part of a column are stored continuously. As an example, consider performing a lookup for all blog titles in an RDBMS; it might be costly in terms of disk entries, specifically if we are talking about millions of records, whereas in columnar databases, such a search would represent only one access. Such databases are indeed very handy for retrieving large amounts of data from a specific family, but the tradeoff is that they lack flexibility. The most used columnar database is Google Cloud Bigtable, but specifically, Apache HBase and Cassandra. One of the other benefits of columnar databases is ease of scaling because data is stored in columns; these columns are highly scalable in terms of the amount of information they can store. This is why they are mainly used for keeping nonvolatile, long-living information and in scaling use cases. Document Columnar databases are not the best for structuring data that contains deeper nesting structures—that’s where document-oriented data stores come into play. Data is indeed stored into key/value pairs, but these are all compressed into what is called a document. This document relies on a structure or encoding such as XML, but most of the time, it relies on JSON (JavaScript Object Notation). Although document-oriented databases are more useful structurally and for representing data, they also have their downside—specifically when it comes to interacting with data. They basically need to acquire the whole document—for example, when they are reading for a specific field—and this can dramatically affect performance. You are apt to use document-oriented databases when you need to nest information. For instance, think of how you would represent an account in your application. It would have the following: • Basic information: first name, last name, birthday, profile picture, URL, creation date, and so on • Complex information: address, authentication method (password, Facebook, etc.), interests, and so on That’s also why NoSQL document-oriented stores are so often used in web applications: representing an object with nested object is pretty easy, and integrating with front-end JavaScript technology is seamless because both technologies work with JSON. 18 Chapter 2 ■ y earl Big Daa t hwit NoSQl The most used technologies today are MongoDB, Couchbase, and Apache CouchDB. These are easy to install and start, are well documented, and are scalable, but above all, they are the most obvious choices for starting a modern web application. Couchbase is one the technologies we are going to use in our architecture specifically because of the way we can store, organize, and query the data using it. I made the choice of Couchbase mainly based on a performance benchmark that reveals that high latency is lower for high operation thoughputs than it is in MongoDB. Also it’s worth mentioning that Couchbase is the combination of CouchDB and Memcached, and today, from a support perspective, it makes more sense to use Couchbase, more details on this link. Graph Graph databases are really different from other types of database. They use a different paradigm to represent the data—a tree-like structure with nodes and edges that are connected to each other through paths called relations. Those databases rise with social networks to, for example, represent a user’s friends network, their friends’ relationships, and so on. With the other types of data stores, it is possible to store a friend’s relationship to a user in a document, but still, it can then be really complex to store friends’ relationships; it’s better to use a graph database and create nodes for each friend, connect them through relationships, and browse the graph depending on the need and scope of the query. The most famous graph database is Neo4j, and as I mentioned before, it is used for use cases that have to do with complex relationship information, such as connections between entities and others entities that are related to them; but it is also used in classification use cases. Figure 2-1 shows how three entities would be connected within a graph database. Figure 2-1. Graph database example 19 Chapter 2 ■ y Big earl Daa t with NoSQl The diagram’s two accounts nodes, Jane and John, connect to each other through edges that define their relationship; they have known each other since a defined date. Another group node connects to the two accounts nodes and this shows that Jane and John have been part of the soccer group since a defined date. NoSQL in Our Use Case In the context of our use case, we first need a document-oriented NoSQL technology that structures the data contained in our relational database into a JSON document. As mentioned earlier, traditional RDBMSs store data into multiple tables linked with relationships, which makes it harder and less efficient when you want to get the description of a whole object. Let’s take the example of an account that can be split into the tables shown in Figure 2-2. Figure 2-2. Account tables If you want to retrieve all account information, you basically need to make two joins between the three tables. Now think this: I need to do that for all users, every time they connect, and these connections happen for different business logic reasons in my application. In the end, you just want a “view” of the account itself. What if we can get the whole account view just by passing the account identifier to a method of our application API that returns the following JSON document? "id": "account_identifier", "email": "", "firstname": "account_firstname", "lastname": "account_lastname", "birthdate": "account_birthdate", "authentication": 20 Chapter 2 ■ y earl Big Daa t wih t NoSQl "token": "authentication_token_1", "source": "authenticaton_source_1", "created": "12-12-12" , "token": "authentication_token_2", "source": "authenticaton_source_2", "created": "12-12-12" , "address": "street": "address_street_1", "city": "address_city_1" "zip": "address_zip_1" "country": "address_country_1" "created": "12-12-12" The benefit is obvious: by keeping a fresh JSON representation of our application entities, we can get better and faster data access. Furthermore, we can generalize this approach to all read requests so that all the reading is done on the NoSQL data store, which leaves all the modification requests (create, update, delete) to be made on the RDBMS. But we must implements a logic that transactionally spreads any changes on the RDBMS to the NoSQL data store and also that creates the object from the relational database if it is not found in the cache. You may wonder why we would keep the RDBMS when we know that creating documents in a NoSQL data store is really efficient and scalable. It is because that is actually not the goal of our application. We don’t want to make a Big Bang effect. Let’s assume that the RDBMS was already there and that we want to integrate a NoSQL data store because of the lack of flexibility in a RDBMS. We want to leverage the best of the two technologies—specifically the data consistency in the RDBMS and the scalability from the NoSQL side. Beside, this is just a simple query example that we can perform, but we want to go further by, for example, making full-text searches on any field of our document. Indeed, how do we do this with a relational database? There is indexing, that’s true, but would we index all table columns? In fact, that’s not possible; but this is something you can easily do with NoSQL technologies such as ElasticSearch. Before we dive into such a NoSQL caching system, we need to go through how to use a Couchbase document-oriented database, and then we need to review the limitations that will drive us to switch to ElasticSearch. We will see that our scalable architecture first relies on Couchbase, but because of some important Couchbase limitations, we’ll first complete the architecture with ElasticSearch before we make a definitive shift to it. Introducing Couchbase Couchbase is an open source, document-oriented database that has a flexible data model, is performant, is scalable, and is suitable for applications like the one in our use case that needs to shift its relational database data into a structured JSON document. Most NoSQL technologies have similar architectures—we’ll first see how the Couchbase architecture is organized and get introduced to naming convention in Couchbase, then we’ll go deeper into detail on how querying data is stored in Couchbase, and finally we’ll talk about cross datacenter replication. 21 Chapter 2 ■ y Big earl Daa t with NoSQl Architecture Couchbase is based on a real shared-nothing architecture, which means that there is no single point of contention because every node in the cluster is self-sufficient and independent. That’s how distributed technologies work—nodes don’t share any memory or disk storage. Documents are stored in JSON or in binary in Couchbase, are replicated over the cluster, and are organized into units called buckets. A bucket can be scaled depending on the storage and access needs by setting the RAM for caching and also by setting the number of replication for resiliency. Under the hood, a bucket is split into smaller units called vBuckets that are actually data partitions. Couchbase uses a cluster map to map the partition to the server to which it belongs. A Couchbase server replicates up to three times a bucket within a cluster; every Couchbase server then manages a subset of the active or replica vBuckets. That’s how resiliency works in Couchbase; every time a document is indexed, it’s replicated, and if a node within the cluster goes down, then the cluster promotes a replica partition to active to ensure continuous service. Only one copy of the data is active with zero or more replicas in the cluster as Figure 2-3 illustrates. Figure 2-3. Couchbase active document and replicas From a client point of view, if smart-clients are used as part as the provided clients (Java, C, C++, Ruby, etc.), then these clients are connected to the cluster map; that’s how clients can send requests from applications to the appropriate server, which holds the document. In term of interaction, there is an important point to remember: operations on documents are, by default, asynchronous. This means that when, for example, you update a document, Couchbase does not update it immediately on the disk. It actually goes through the processing shown in Figure 2-4. 22 Chapter 2 ■ y earl Big Daa t wiht NoSQl Figure 2-4. Couchbase data flow As Figure 2-4 shows, the smart-client connects to a Couchbase server instance and first asynchronously writes the document in the managed cache. The client gets a response immediately and is not blocked until the end of the data flow process, but this behavior can be changed at the client level to make the client wait for the write to be finished. Then the document is put in the inter-cluster write queue, so the document is replicated across the cluster; after that, the document is put in the disk storage write queue to be persisted on the related node disk. If multiple clusters are deployed, then the Cross Data Center Replication (XDCR) feature can be used to propagate the changes to other clusters, located on a different data center. Couchbase has its own way to query the data; indeed, you can query the data with a simple document ID, but the power of Couchbase is inside the view feature. In Couchbase, there is a second-level index called the design document, which is created within a bucket. A bucket can contain multiple types of document, for example, in a simple e-commerce application a bucket would contain the following: • Account • Product • Cart • Orders • Bills 23 Chapter 2 ■ y Big earl Daa t with NoSQl The way Couchbase splits them logically is through the design document. A bucket can contain multiple design documents, which also contain multiple views. A view is a function that indexes documents contained in the bucket in a user-defined way. The function is precisely a user-defined map/reduce function that maps documents across the cluster and outputs key/value pairs, which are then stored in the index for further retrieval. Let’s go back to our e-commerce website example and try to index all orders so we can get them from the account identifier. The map/reduce function would be as follows: function(doc, meta) if (doc.order_account_id) emit(doc.order_account_id, null); Theif statement allows the function to focus only on the document that contains theorder_account_id field and then index this identifier. Therefore any client can query the data based on this identifier in Couchbase. Cluster Manager and Administration Console Cluster manager is handled by a specific node within the cluster, the orchestrator node. At any time, if one of the nodes fails within the cluster, then the orchestrator handles the failover by notifying all other nodes within the cluster, locating the replica partitions of the failing node to promote them to active status. Figure 2-5 describes the failover process. Figure 2-5. Couchbase failover 24 Chapter 2 ■ y earl Big Daa t hwit NoSQl If the orchestrator node fails, then all nodes detect that through the heartbeat watchdog, which is a cluster component that runs on all cluster nodes. Once the failure is detected, a new orchestrator is elected among the nodes. All cluster-related features are exposed through APIs that can be used to manage Couchbase, but the good news is that an administration console is shipped out of the box. Couchbase console is a secure console that lets you manage and monitor your cluster; you can choose from the available actions, which include setting up your server, creating buckets, browsing and updating documents, implementing new views, and monitoring vBucket and the disk write queue. Figure 2-6 shows the Couchbase console home page with an overview of the RAM used by existing buckets, the disk used by data, and the buckets’ activity. Figure 2-6. Couchbase console home 25 Chapter 2 ■ y Big earl Daa t with NoSQl You can perform cluster management in the Server Nodes tab, which lets the user configure failover and replication to prevent them from losing data. Figure 2-7 shows a single node installation that is not safe for failover as the warning mentions. Figure 2-7. Couchbase server nodes At any time, you can add a new Couchbase server by clicking the Add Server button; when you do, data will start replicating across nodes to enable failover. By clicking on the server IP, you can access fine-grained monitoring data on each aspect of the bucket, as shown in Figure 2-8. 26 Chapter 2 ■ y earl Big Daa t wih t NoSQl Figure 2-8. Couchbase bucket monitoring This figure shows a data bucket called DevUser that contains the user-related JSON document. As explained earlier, the process of indexing a new document is part of a complex data flow under the hood. The metrics shown in the monitoring console are essential when you are dealing with a large amount of data that generates a high indexing throughput. For example, the disk queue statistics can reveal bottlenecks when data is being written on the disk. In Figure 2-9, we can see that the drain rate—the number of items written on the disk from the disk write queue—is alternatively flat on the active side when the replica is written, and that the average age of the active item grows during that flat period. An altering behavior would have been to see the average age of the active item keep growing, which would mean that the writing process was too slow compared to the number of active items pushed into the write disk queue. 27 Chapter 2 ■ y Big earl Daa t with NoSQl Figure 2-9. Couchbase bucket disk queue Managing Documents You can manage all documents from the administration console through the bucket view. This view allows users to browse buckets and design documents and views. Documents are stored in a bucket in Couchbase, and they can be accessed in the Data Bucket tab on the administration console as shown in the Figure 2-10. Figure 2-10. Couchbase console bucket view 28 Chapter 2 ■ y earl Big Daa t hwi t NoSQl As in the server view, the console gives statistics on the bucket, such as RAM and storage size, as well as the number of operation per second. But the real benefit of this view is that you are able to browse documents and retrieve them by ID as is shown in Figure 2-11. Figure 2-11. Couchbase document by ID It’s also in this view that you create a design document and views to index documents for further retrieval, as shown in Figure 2-12. Figure 2-12. Couchbase console view implementation 29 Chapter 2 ■ y Big earl Daa t with NoSQl In Figure 2-12, I have implemented a view that retrieves documents based on the company name. The administration console is a handy way to manage documents, but in real life, you can start implementing your design document in the administration console, and you can create a backup to industrialize its deployment. All design documents are stored in a JSON file and a simple structure that describes all the views, similar to what Listing 2-1 shows. Listing 2-1. Designing a Document JSON Example ... "doc": "json": "views": "by_Id": "map": "function (doc, meta) \n emit(, doc);\n", "by_email": "map": "function (doc, meta) \n emit(, doc);\n", "by_name": "map": "function (doc, meta) \n emit(doc.firstname, null);\n", "by_company": "map": "function (doc, meta) \n emit(, null);\n", "by_form": "map": "function (doc, meta) \n emit(, null);\n" ... As you have seen, you can perform document management through the administration console, but keep in mind that in industrialized architecture, most of the work is done through scripts that use the Couchbase API. Introducing ElasticSearch You have seen an example of a NoSQL database with Couchbase; ElasticSearch is also a NoSQL technology but it’s totally different than Couchbase. It’s a distributed datastore provided by the company named Elastic (at the time I’m writing this book, ElasticSearch is in version 2.1). Architecture ElasticSearch is a NoSQL technology that allows you to store, search, and analyze data. It’s an indexation/ search engine made on top of Apache Lucene, an open source full-text search engine written in Java. From the start, ElasticSearch was made to be distributed and to scale out, which means that in addition to scaling ElasticSearch vertically by adding more resource to a node, you can simply scale it horizontally by adding more nodes on the fly to increase the high availability of your cluster but also its resiliency. In the case of a node failure, because data is replicated over the cluster, data is served by another node. 30 Chapter 2 ■ y earl Big Daa t hwi t NoSQl ElasticSearch is a schemaless engine; data is stored in JSON and is partitioned into what we call shards. A shard is actually a Lucene index and is the smallest unit of scale in ElasticSearch. Shards are organized in indexes in ElasticSearch with which an application can make read and write interactions. In the end, an index is just a logical namespace in ElasticSearch that regroups a collection of shards, and when a request comes in, ElasticSearch routes it to the appropriate shards as the Figure 2-13 summarizes. Figure 2-13. Elasticsearch index and shards Two types of shards live in ElasticSearch: primary shards and replica shards. When you start an ElasticSearch node, you can begin by adding only one primary shard, which might be enough, but what if the read/index request throughput increases with time? If this is the case, the one primary shard might not be enough anymore and you then need another shard. You can’t add shards on the fly and expect ElasticSearch to scale; it will have to re-index all data in the bigger index with the two new primary shards. So, as you can see, from the beginning of a project based on ElasticSearch, it’s important that you have a decent estimate of how many primary shards you need in the cluster. Adding more shards in a node may not increase the capacity of your cluster, because you are limited to the node hardware capacity. To increase cluster capacity, you have to add more nodes that hold primary shards as well, as is shown in Figure 2-14. Figure 2-14. ElasticSearch primary shard allocation 31 Chapter 2 ■ y Big earl Daa t with NoSQl The good thing is that ElasticSearch automatically copies the shard over the network on the new node for you as described in Figure 2-14. But what if you want to be sure that you won’t lose data? That’s where replica shards come into play. Replica shards are made at start for failover; when a primary shard dies, a replica is promoted to become the primary to ensure continuity in the cluster. Replica shards have the same load that primary shards do at index time; this means that once the document is indexed in the primary shard, it’s indexed in the replica shards. That’s why adding more replicas to our cluster won’t increase index performance, but still, if we add extra hardware, it can dramatically increase search performance. In the three nodes cluster, with two primary shards and two replica shards, we would have the repartition shown in Figure 2-15. Figure 2-15. ElasticSearch primary and replica shards In addition to increased performance, this use case also helps with balancing requests and getting better performance on the overall cluster. The last thing to talk about in terms of pure ElasticSearch architecture is indices, and more specifically, nodes. Indices are regrouped into ElasticSearch nodes, and there are actually three types of nodes as shown in Figure 2-16. 32 Chapter 2 ■ y earl Big Daa t wih t NoSQl Figure 2-16. ElasticSearch cluster topology Here are descriptions of the three types of node: • Master nodes: These nodes are lightweight and responsible for cluster management. It means they don’t hold any data, server indices, or search requests. They are dedicated to ensure cluster stability and have a low workload. It’s recommended that you have three dedicated master nodes to be sure that if one fails, redundancy will be guaranteed. • Data nodes: These nodes hold the data and serve index and search requests. • Client nodes: These ensure load balancing in some processing steps and can take on part of the workload that a data node can perform, such as in a search request where scattering the request over the nodes and gathering the output for responses can be really heavy at runtime. Now that :you understand the architecture of ElasticSearch, let’s play with the search API and run some queries. 33 Chapter 2 ■ y Big earl Daa t with NoSQl Monitoring ElasticSearch Elastic provides a plug-in called Marvel for ElasticSearch that aims to monitor an ElasticSearch cluster. This plug-in is part of Elastic’s commercial offer, but you can use it for free in Development mode. You can download Marvel from the following link; the installation process is quite simple: Marvel relies on Kibana, the visualization console of Elastic, and comes with a bunch of visualization techniques that let an operator be really precise about what happens in the cluster. Figure 2-17 shows the overview dashboard that you see when launching Marvel. Figure 2-17. ElasticSearch Marvel console 34 Chapter 2 ■ y earl Big Daa t wih t NoSQl Marvel provides information on nodes, indices, and shards; about the CPU used; about the memory used by the JVM; about the indexation rate, and about the search rate. Marvel even goes down to the Lucene level by providing information about flushes and merges. You can, for example, have a live view of the shard allocation on the cluster, as shown in Figure 2-18. Figure 2-18. Marvel’s shard allocation view To give you an idea of the amount of information that Marvel can provide you with about your cluster, Figure 2-19 shows a subset of what you get in the Node Statistics dashboard. Figure 2-19. Marvel Node Statistics dashboard 35 Chapter 2 ■ y Big earl Daa t with NoSQl As you can see, the dashboard is organized in several rows; in fact, there are more than 20 rows that you just can’t see in this screenshot. Each row contains one or visualization about the row topic. In Figure 2-19, you can see that noGET requests are sent to indices; that’s why the line chart is flat on 0. During the development mode, these statistics will help you scale your server by, for example, starting with a single node cluster and seeing the behavior based on your specific needs. Then in production mode, you are able to follow the life inside your cluster without losing any information about it. Search with ElasticSearch Marvel comes with a feature called Sense that is a query editor/helper for ElasticSearch. The real power of Sense is its ability to autocomplete your query, which helps a lot when you are not familiar with all ElasticSearch APIs, as shown in Figure 2-20. Figure 2-20. Marvel Sense completion feature You can also export your query to cURLs, for example, so you can use it in a script as show in Figure 2-21. Figure 2-21. Marvel Sense copy as cURL feature In this case, the query would give a cURL command, as show in Listing 2-3. Listing 2-3. Example of cURL Command Generated by Sense curl -XGET "" -d' "query": "match_all": ' 36

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.