How Big data can be used

what is big data and how does it work, what is big data architecture,what is big data and why is it important, how big data creates false confidence pdf
LexiWills Profile Pic
LexiWills,United Kingdom,Professional
Published Date:31-07-2017
Your Website URL(Optional)
CHAPTER 1 Introduction Before we start looking into all the moving parts of HBase, let us pause to think about why there was a need to come up with yet another storage architecture. Relational database management systems (RDBMSes) have been around since the early 1970s, and have helped countless companies and organizations to implement their solution to given problems. And they are equally helpful today. There are many use cases for which the relational model makes perfect sense. Yet there also seem to be specific problems that do not fit this model very well. The Dawn of Big Data We live in an era in which we are all connected over the Internet and expect to find results instantaneously, whether the question concerns the best turkey recipe or what to buy mom for her birthday. We also expect the results to be useful and tailored to our needs. Because of this, companies have become focused on delivering more targeted infor- mation, such as recommendations or online ads, and their ability to do so directly † influences their success as a business. Systems like Hadoop now enable them to gather and process petabytes of data, and the need to collect even more data continues to increase with, for example, the development of new machine learning algorithms. Where previously companies had the liberty to ignore certain data sources because there was no cost-effective way to store all that information, they now are likely to lose out to the competition. There is an increasing need to store and analyze every data point they generate. The results then feed directly back into their e-commerce platforms and may generate even more data. See, for example, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone” ( ugur/fits_all.pdf) by Michael Stonebraker and Uğur Çetintemel. † Information can be found on the project’s website. Please also see the excellent Hadoop: The Definitive Guide (Second Edition) by Tom White (O’Reilly) for everything you want to know about Hadoop. 1In the past, the only option to retain all the collected data was to prune it to, for example, retain the last N days. While this is a viable approach in the short term, it lacks the opportunities that having all the data, which may have been collected for months or years, offers: you can build mathematical models that span the entire time range, or amend an algorithm to perform better and rerun it with all the previous data. ‡ Dr. Ralph Kimball, for example, states that Data assets are a major component of the balance sheet, replacing traditional physical assets of the 20th century and that there is a Widespread recognition of the value of data even beyond traditional enterprise boundaries Google and Amazon are prominent examples of companies that realized the value of data and started developing solutions to fit their needs. For instance, in a series of technical publications, Google described a scalable storage and processing system based on commodity hardware. These ideas were then implemented outside of Google as part of the open source Hadoop project: HDFS and MapReduce. Hadoop excels at storing data of arbitrary, semi-, or even unstructured formats, since it lets you decide how to interpret the data at analysis time, allowing you to change the way you classify the data at any time: once you have updated the algorithms, you simply run the analysis again. Hadoop also complements existing database systems of almost any kind. It offers a limitless pool into which one can sink data and still pull out what is needed when the time is right. It is optimized for large file storage and batch-oriented, streaming access. This makes analysis easy and fast, but users also need access to the final data, not in batch mode but using random access—this is akin to a full table scan versus using indexes in a database system. We are used to querying databases when it comes to random access for structured data. RDBMSes are the most prominent, but there are also quite a few specialized variations and implementations, like object-oriented databases. Most RDBMSes strive to imple- § ment Codd’s 12 rules, which forces them to comply to very rigid requirements. The architecture used underneath is well researched and has not changed significantly in quite some time. The recent advent of different approaches, like column-oriented or massively parallel processing (MPP) databases, has shown that we can rethink the tech- ‡ The quotes are from a presentation titled “Rethinking EDW in the Era of Expansive Information Management” by Dr. Ralph Kimball, of the Kimball Group, available at campaigns/rethink_edw_kimball.pdf. It discusses the changing needs of an evolving enterprise data warehouse market. § Edgar F. Codd defined 13 rules (numbered from 0 to 12), which define what is required from a database management system (DBMS) to be considered relational. While HBase does fulfill the more generic rules, it fails on others, most importantly, on rule 5: the comprehensive data sublanguage rule, defining the support for at least one relational language. See Codd’s 12 rules on Wikipedia. 2 Chapter 1: Introductionnology to fit specific workloads, but most solutions still implement all or the majority of Codd’s 12 rules in an attempt to not break with tradition. Column-Oriented Databases Column-oriented databases save their data grouped by columns. Subsequent column values are stored contiguously on disk. This differs from the usual row-oriented approach of traditional databases, which store entire rows contiguously—see Figure 1-1 for a visualization of the different physical layouts. The reason to store values on a per-column basis instead is based on the assumption that, for specific queries, not all of the values are needed. This is often the case in analytical databases in particular, and therefore they are good candidates for this dif- ferent storage schema. Reduced I/O is one of the primary reasons for this new layout, but it offers additional advantages playing into the same category: since the values of one column are often very similar in nature or even vary only slightly between logical rows, they are often much better suited for compression than the heterogeneous values of a row-oriented record structure; most compression algorithms only look at a finite window. Specialized algorithms—for example, delta and/or prefix compression—selected based on the type of the column (i.e., on the data stored) can yield huge improvements in compression ratios. Better ratios result in more efficient bandwidth usage. Note, though, that HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format. This is also where the majority of similarities end, because although HBase stores data on disk in a column-oriented format, it is distinctly different from traditional columnar databases: whereas columnar databases excel at providing real-time analytical access to data, HBase excels at pro- viding key-based access to a specific cell of data, or a sequential range of cells. The speed at which data is created today is already greatly increased, compared to only just a few years back. We can take for granted that this is only going to increase further, and with the rapid pace of globalization the problem is only exacerbated. Websites like Google, Amazon, eBay, and Facebook now reach the majority of people on this planet. The term planet-size web application comes to mind, and in this case it is fitting. Facebook, for example, is adding more than 15 TB of data into its Hadoop cluster every ‖ day and is subsequently processing it all. One source of this data is click-stream log- ging, saving every step a user performs on its website, or on sites that use the social plug-ins offered by Facebook. This is an ideal case in which batch processing to build machine learning models for predictions and recommendations is appropriate. Facebook also has a real-time component, which is its messaging system, including chat, wall posts, and email. This amounts to 135+ billion messages per month, and ‖ See this note published by Facebook. The Dawn of Big Data 3Figure 1-1. Column-oriented and row-oriented storage layouts storing this data over a certain number of months creates a huge tail that needs to be handled efficiently. Even though larger parts of emails—for example, attachments— are stored in a secondary system, the amount of data generated by all these messages is mind-boggling. If we were to take 140 bytes per message, as used by Twitter, it would See this blog post, as well as this one, by the Facebook engineering team. Wall messages count for 15 billion and chat for 120 billion, totaling 135 billion messages a month. Then they also add SMS and others to create an even larger number. Facebook uses Haystack, which provides an optimized storage infrastructure for large binary objects, such as photos. 4 Chapter 1: Introductiontotal more than 17 TB every month. Even before the transition to HBase, the existing † system had to handle more than 25 TB a month. In addition, less web-oriented companies from across all major industries are collecting an ever-increasing amount of data. For example: Financial Such as data generated by stock tickers Bioinformatics Such as the Global Biodiversity Information Facility ( Smart grid Such as the OpenPDC ( project Sales Such as the data generated by point-of-sale (POS) or stock/inventory systems Genomics Such as the Crossbow ( project Cellular services, military, environmental Which all collect a tremendous amount of data as well Storing petabytes of data efficiently so that updates and retrieval are still performed well is no easy feat. We will now look deeper into some of the challenges. The Problem with Relational Database Systems RDBMSes have typically played (and, for the foreseeable future at least, will play) an integral role when designing and implementing business applications. As soon as you have to retain information about your users, products, sessions, orders, and so on, you are typically going to use some storage backend providing a persistence layer for the frontend application server. This works well for a limited number of records, but with the dramatic increase of data being retained, some of the architectural implementation details of common database systems show signs of weakness. Let us use Hush, the HBase URL Shortener mentioned earlier, as an example. Assume that you are building this system so that it initially handles a few thousand users, and that your task is to do so with a reasonable budget—in other words, use free software. ‡ The typical scenario here is to use the open source LAMP stack to quickly build out a prototype for the business idea. The relational database model normalizes the data into a user table, which is accom- panied by a url, shorturl, and click table that link to the former by means of a foreign † See this presentation, given by Facebook employee and HBase committer, Nicolas Spiegelberg. ‡ Short for Linux, Apache, MySQL, and PHP (or Perl and Python). The Problem with Relational Database Systems 5key. The tables also have indexes so that you can look up URLs by their short ID, or the users by their username. If you need to find all the shortened URLs for a particular list of customers, you could run an SQL JOIN over both tables to get a comprehensive list of URLs for each customer that contains not just the shortened URL but also the customer details you need. In addition, you are making use of built-in features of the database: for example, stored procedures, which allow you to consistently update data from multiple clients while the database system guarantees that there is always coherent data stored in the various tables. Transactions make it possible to update multiple tables in an atomic fashion so that either all modifications are visible or none are visible. The RDBMS gives you the so- § called ACID properties, which means your data is strongly consistent (we will address this in greater detail in “Consistency Models” on page 9). Referential integrity takes care of enforcing relationships between various table schemas, and you get a domain- specific language, namely SQL, that lets you form complex queries over everything. Finally, you do not have to deal with how data is actually stored, but only with higher- level concepts such as table schemas, which define a fixed layout your application code can reference. This usually works very well and will serve its purpose for quite some time. If you are lucky, you may be the next hot topic on the Internet, with more and more users joining your site every day. As your user numbers grow, you start to experience an increasing amount of pressure on your shared database server. Adding more application servers is relatively easy, as they share their state only with the central database. Your CPU and I/O load goes up and you start to wonder how long you can sustain this growth rate. The first step to ease the pressure is to add slave database servers that are used to being read from in parallel. You still have a single master, but that is now only taking writes, and those are much fewer compared to the many reads your website users generate. But what if that starts to fail as well, or slows down as your user count steadily increases? ‖ A common next step is to add a cache—for example, Memcached. Now you can off- load the reads to a very fast, in-memory system—however, you are losing consistency guarantees, as you will have to invalidate the cache on modifications of the original value in the database, and you have to do this fast enough to keep the time where the cache and the database views are inconsistent to a minimum. While this may help you with the amount of reads, you have not yet addressed the writes. Once the master database server is hit too hard with writes, you may replace it with a beefed-up server—scaling up vertically—which simply has more cores, more memory, and faster disks... and costs a lot more money than the initial one. Also note § Short for Atomicity, Consistency, Isolation, and Durability. See “ACID” on Wikipedia. ‖ Memcached is an in-memory, nonpersistent, nondistributed key/value store. See the Memcached project home page. 6 Chapter 1: Introductionthat if you already opted for the master/slave setup mentioned earlier, you need to make the slaves as powerful as the master or the imbalance may mean the slaves fail to keep up with the master’s update rate. This is going to double or triple the cost, if not more. With more site popularity, you are asked to add more features to your application, which translates into more queries to your database. The SQL JOINs you were happy to run in the past are suddenly slowing down and are simply not performing well enough at scale. You will have to denormalize your schemas. If things get even worse, you will also have to cease your use of stored procedures, as they are also simply be- coming too slow to complete. Essentially, you reduce the database to just storing your data in a way that is optimized for your access patterns. Your load continues to increase as more and more users join your site, so another logical step is to prematerialize the most costly queries from time to time so that you can serve the data to your customers faster. Finally, you start dropping secondary indexes as their maintenance becomes too much of a burden and slows down the database too much. You end up with queries that can only use the primary key and nothing else. Where do you go from here? What if your load is expected to increase by another order of magnitude or more over the next few months? You could start sharding (see the sidebar titled “Sharding”) your data across many databases, but this turns into an op- erational nightmare, is very costly, and still does not give you a truly fitting solution. You essentially make do with the RDBMS for lack of an alternative. Sharding The term sharding describes the logical separation of records into horizontal partitions. The idea is to spread data across multiple storage files—or servers—as opposed to having each stored contiguously. The separation of values into those partitions is performed on fixed boundaries: you have to set fixed rules ahead of time to route values to their appropriate store. With it comes the inherent difficulty of having to reshard the data when one of the horizontal partitions exceeds its capacity. Resharding is a very costly operation, since the storage layout has to be rewritten. This entails defining new boundaries and then horizontally splitting the rows across them. Massive copy operations can take a huge toll on I/O performance as well as temporarily elevated storage requirements. And you may still take on updates from the client ap- plications and need to negotiate updates during the resharding process. This can be mitigated by using virtual shards, which define a much larger key parti- tioning range, with each server assigned an equal number of these shards. When you add more servers, you can reassign shards to the new server. This still requires that the data be moved over to the added server. Sharding is often a simple afterthought or is completely left to the operator. Without proper support from the database system, this can wreak havoc on production systems. The Problem with Relational Database Systems 7Let us stop here, though, and, to be fair, mention that a lot of companies are using RDBMSes successfully as part of their technology stack. For example, Facebook—and also Google—has a very large MySQL setup, and for its purposes it works sufficiently. This database farm suits the given business goal and may not be replaced anytime soon. The question here is if you were to start working on implementing a new product and knew that it needed to scale very fast, wouldn’t you want to have all the options avail- able instead of using something you know has certain constraints? Nonrelational Database Systems, Not-Only SQL or NoSQL? Over the past four or five years, the pace of innovation to fill that exact problem space has gone from slow to insanely fast. It seems that every week another framework or project is announced to fit a related need. We saw the advent of the so-called NoSQL solutions, a term coined by Eric Evans in response to a question from Johan Oskarsson, who was trying to find a name for an event in that very emerging, new data storage system space. The term quickly rose to fame as there was simply no other name for this new class of products. It was (and is) discussed heavily, as it was also deemed the nemesis of “SQL”—or was meant to bring the plague to anyone still considering using traditional RDBMSes... just kidding The actual idea of different data store architectures for specific problem sets is not new at all. Systems like Berkeley DB, Coherence, GT.M, and object-oriented database systems have been around for years, with some dating back to the early 1980s, and they fall into the NoSQL group by definition as well. The tagword is actually a good fit: it is true that most new storage systems do not provide SQL as a means to query data, but rather a different, often simpler, API-like interface to the data. On the other hand, tools are available that provide SQL dialects to NoSQL data stores, and they can be used to form the same complex queries you know from relational databases. So, limitations in querying no longer differentiate RDBMSes from their nonrelational kin. The difference is actually on a lower level, especially when it comes to schemas or ACID- like transactional features, but also regarding the actual storage architecture. A lot of these new kinds of systems do one thing first: throw out the limiting factors in truly scalable systems (a topic that is discussed in “Dimensions” on page 10). For example, they often have no support for transactions or secondary indexes. More See “NoSQL” on Wikipedia. 8 Chapter 1: Introductionimportantly, they often have no fixed schemas so that the storage can evolve with the application using it. Consistency Models It seems fitting to talk about consistency a bit more since it is mentioned often through- out this book. On the outset, consistency is about guaranteeing that a database always appears truthful to its clients. Every operation on the database must carry its state from one consistent state to the next. How this is achieved or implemented is not specified explicitly so that a system has multiple choices. In the end, it has to get to the next consistent state, or return to the previous consistent state, to fulfill its obligation. Consistency can be classified in, for example, decreasing order of its properties, or guarantees offered to clients. Here is an informal list: Strict The changes to the data are atomic and appear to take effect instantaneously. This is the highest form of consistency. Sequential Every client sees all changes in the same order they were applied. Causal All changes that are causally related are observed in the same order by all clients. Eventual When no updates occur for a period of time, eventually all updates will propagate through the system and all replicas will be consistent. Weak No guarantee is made that all updates will propagate and changes may appear out of order to various clients. The class of system adhering to eventual consistency can be even further divided into subtler sets, where those sets can also coexist. Werner Vogels, CTO of Amazon, lists them in his post titled “Eventually Consistent”. The article also picks up on the topic of the CAP theorem, which states that a distributed system can only achieve two out of the following three properties: consistency, availability, and partition tolerance. The CAP theorem is a highly discussed topic, and is certainly not the only way to classify, but it does point out that distributed systems are not easy to develop given certain requirements. Vogels, for example, mentions: An important observation is that in larger distributed scale systems, network par- titions are a given and as such consistency and availability cannot be achieved at the same time. This means that one has two choices on what to drop; relaxing consistency will allow the system to remain highly available ... and prioritizing consistency means that under certain conditions the system will not be available. See Eric Brewer’s original paper on this topic and the follow-up post by Coda Hale, as well as this PDF by Gilbert and Lynch. Nonrelational Database Systems, Not-Only SQL or NoSQL? 9Relaxing consistency, while at the same time gaining availability, is a powerful propo- sition. However, it can force handling inconsistencies into the application layer and may increase complexity. There are many overlapping features within the group of nonrelational databases, but some of these features also overlap with traditional storage solutions. So the new sys- tems are not really revolutionary, but rather, from an engineering perspective, are more evolutionary. Even projects like memcached are lumped into the NoSQL category, as if anything that is not an RDBMS is automatically NoSQL. This creates a kind of false dichotomy that obscures the exciting technical possibilities these systems have to offer. And there are many; within the NoSQL category, there are numerous dimensions you could use to classify where the strong points of a particular system lie. Dimensions Let us take a look at a handful of those dimensions here. Note that this is not a com- prehensive list, or the only way to classify them. Data model There are many variations in how the data is stored, which include key/value stores (compare to a HashMap), semistructured, column-oriented stores, and document- oriented stores. How is your application accessing the data? Can the schema evolve over time? Storage model In-memory or persistent? This is fairly easy to decide since we are comparing with RDBMSes, which usually persist their data to permanent storage, such as physical disks. But you may explicitly need a purely in-memory solution, and there are choices for that too. As far as persistent storage is concerned, does this affect your access pattern in any way? Consistency model Strictly or eventually consistent? The question is, how does the storage system achieve its goals: does it have to weaken the consistency guarantees? While this seems like a cursory question, it can make all the difference in certain use cases. It may especially affect latency, that is, how fast the system can respond to read and † write requests. This is often measured in harvest and yield. Physical model Distributed or single machine? What does the architecture look like—is it built from distributed machines or does it only run on single machines with the distri- bution handled client-side, that is, in your own code? Maybe the distribution is † See Brewer: “Lessons from giant-scale services.” Internet Computing, IEEE (2001) vol. 5 (4) pp. 46–55 (http: // 10 Chapter 1: Introductiononly an afterthought and could cause problems once you need to scale the system. And if it does offer scalability, does it imply specific steps to do so? The easiest solution would be to add one machine at a time, while sharded setups (especially those not supporting virtual shards) sometimes require for each shard to be in- creased simultaneously because each partition needs to be equally powerful. Read/write performance You have to understand what your application’s access patterns look like. Are you designing something that is written to a few times, but is read much more often? Or are you expecting an equal load between reads and writes? Or are you taking in a lot of writes and just a few reads? Does it support range scans or is it better suited doing random reads? Some of the available systems are advantageous for only one of these operations, while others may do well in all of them. Secondary indexes Secondary indexes allow you to sort and access tables based on different fields and sorting orders. The options here range from systems that have absolutely no sec- ondary indexes and no guaranteed sorting order (like a HashMap, i.e., you need to know the keys) to some that weakly support them, all the way to those that offer them out of the box. Can your application cope, or emulate, if this feature is missing? Failure handling It is a fact that machines crash, and you need to have a mitigation plan in place that addresses machine failures (also refer to the discussion of the CAP theorem in “Consistency Models” on page 9). How does each data store handle server failures? Is it able to continue operating? This is related to the “Consistency model” dimen- sion discussed earlier, as losing a machine may cause holes in your data store, or even worse, make it completely unavailable. And if you are replacing the server, how easy will it be to get back to being 100% operational? Another scenario is decommissioning a server in a clustered setup, which would most likely be handled the same way. Compression When you have to store terabytes of data, especially of the kind that consists of prose or human-readable text, it is advantageous to be able to compress the data to gain substantial savings in required raw storage. Some compression algorithms can achieve a 10:1 reduction in storage space needed. Is the compression method pluggable? What types are available? Load balancing Given that you have a high read or write rate, you may want to invest in a storage system that transparently balances itself while the load shifts over time. It may not be the full answer to your problems, but it may help you to ease into a high- throughput application design. Nonrelational Database Systems, Not-Only SQL or NoSQL? 11Atomic read-modify-write While RDBMSes offer you a lot of these operations directly (because you are talking to a central, single server), they can be more difficult to achieve in distributed systems. They allow you to prevent race conditions in multithreaded or shared- nothing application server design. Having these compare and swap (CAS) or check and set operations available can reduce client-side complexity. Locking, waits, and deadlocks It is a known fact that complex transactional processing, like two-phase commits, can increase the possibility of multiple clients waiting for a resource to become available. In a worst-case scenario, this can lead to deadlocks, which are hard to resolve. What kind of locking model does the system you are looking at support? Can it be free of waits, and therefore deadlocks? We will look back at these dimensions later on to see where HBase fits and where its strengths lie. For now, let us say that you need to carefully select the dimensions that are best suited to the issues at hand. Be prag- matic about the solution, and be aware that there is no hard and fast rule, in cases where an RDBMS is not working ideally, that a NoSQL system is the perfect match. Evaluate your options, choose wisely, and mix and match if needed. An interesting term to describe this issue is impedance match, which describes the need to find the ideal solution for a given problem. Instead of using a “one-size-fits-all” approach, you should know what else is available. Try to use the system that solves your problem best. Scalability While the performance of RDBMSes is well suited for transactional processing, it is less so for very large-scale analytical processing. This refers to very large queries that scan wide ranges of records or entire tables. Analytical databases may contain hundreds or thousands of terabytes, causing queries to exceed what can be done on a single server in a reasonable amount of time. Scaling that server vertically—that is, adding more cores or disks—is simply not good enough. What is even worse is that with RDBMSes, waits and deadlocks are increasing nonlinearly with the size of the transactions and concurrency—that is, the square of ‡ concurrency and the third or even fifth power of the transaction size. Sharding is often an impractical solution, as it has to be done within the application layer, and may involve complex and costly (re)partitioning procedures. Commercial RDBMSes are available that solve many of these issues, but they are often specialized and only cover certain aspects. Above all, they are very, very expensive. ‡ See “FT 101” by Jim Gray et al. 12 Chapter 1: IntroductionLooking at open source alternatives in the RDBMS space, you will likely have to give up many or all relational features, such as secondary indexes, to gain some level of performance. The question is, wouldn’t it be good to trade relational features permanently for per- formance? You could denormalize (see the next section) the data model and avoid waits and deadlocks by minimizing necessary locking. How about built-in horizontal scala- bility without the need to repartition as your data grows? Finally, throw in fault toler- ance and data availability, using the same mechanisms that allow scalability, and what you get is a NoSQL solution—more specifically, one that matches what HBase has to offer. Database (De-)Normalization At scale, it is often a requirement that we design schema differently, and a good term to describe this principle is Denormalization, Duplication, and Intelligent Keys § (DDI). It is about rethinking how data is stored in Bigtable-like storage systems, and how to make use of it in an appropriate way. Part of the principle is to denormalize schemas by, for example, duplicating data in more than one table so that, at read time, no further aggregation is required. Or the related prematerialization of required views, once again optimizing for fast reads with- out any further processing. There is much more on this topic in Chapter 9, where you will find many ideas on how to design solutions that make the best use of the features HBase provides. Let us look at an example to understand the basic principles of converting a classic relational database model to one that fits the columnar nature of HBase much better. Consider the HBase URL Shortener, Hush, which allows us to map long URLs to short URLs. The entity relationship diagram (ERD) can be seen in Figure 1-2. The full SQL ‖ schema can be found in Appendix E. The shortened URL, stored in the shorturl table, can then be given to others that subsequently click on it to open the linked full URL. Each click is tracked, recording the number of times it was used, and, for example, the country the click came from. This is stored in the click table, which aggregates the usage on a daily basis, similar to a counter. Users, stored in the user table, can sign up with Hush to create their own list of short- ened URLs, which can be edited to add a description. This links the user and short url tables with a foreign key relationship. § The term DDI was coined in the paper “Cloud Data Structure Diagramming Techniques and Design Patterns” by D. Salmen et al. (2009). ‖ Note, though, that this is provided purely for demonstration purposes, so the schema is deliberately kept simple. Nonrelational Database Systems, Not-Only SQL or NoSQL? 13Figure 1-2. The Hush schema expressed as an ERD The system also downloads the linked page in the background, and extracts, for in- stance, the TITLE tag from the HTML, if present. The entire page is saved for later processing with asynchronous batch jobs, for analysis purposes. This is represented by the url table. Every linked page is only stored once, but since many users may link to the same long URL, yet want to maintain their own details, such as the usage statistics, a separate entry in the shorturl is created. This links the url, shorturl, and click tables. This also allows you to aggregate statistics to the original short ID, refShortId, so that you can see the overall usage of any short URL to map to the same long URL. The shortId and refShortId are the hashed IDs assigned uniquely to each shortened URL. For example, in the ID is a23eg. Figure 1-3 shows how the same schema could be represented in HBase. Every shortened URL is stored in a separate table, shorturl, which also contains the usage statistics, storing various time ranges in separate column families, with distinct time-to-live settings. The columns form the actual counters, and their name is a combination of the date, plus an optional dimensional postfix—for example, the country code. The downloaded page, and the extracted details, are stored in the url table. This table uses compression to minimize the storage requirements, because the pages are mostly HTML, which is inherently verbose and contains a lot of text. The user-shorturl table acts as a lookup so that you can quickly find all short IDs for a given user. This is used on the user’s home page, once she has logged in. The user table stores the actual user details. We still have the same number of tables, but their meaning has changed: the clicks table has been absorbed by the shorturl table, while the statistics columns use the date as their key, formatted as YYYYMMDD—for instance, 20110502—so that they can be ac- 14 Chapter 1: IntroductionFigure 1-3. The Hush schema in HBase cessed sequentially. The additional user-shorturl table is replacing the foreign key relationship, making user-related lookups faster. There are various approaches to converting one-to-one, one-to-many, and many-to- many relationships to fit the underlying architecture of HBase. You could implement even this simple example in different ways. You need to understand the full potential of HBase storage design to make an educated decision regarding which approach to take. The support for sparse, wide tables and column-oriented design often eliminates the need to normalize data and, in the process, the costly JOIN operations needed to aggregate the data at query time. Use of intelligent keys gives you fine-grained control over how—and where—data is stored. Partial key lookups are possible, and when Nonrelational Database Systems, Not-Only SQL or NoSQL? 15combined with compound keys, they have the same properties as leading, left-edge indexes. Designing the schemas properly enables you to grow the data from 10 entries to 10 million entries, while still retaining the same write and read performance. Building Blocks This section provides you with an overview of the architecture behind HBase. After giving you some background information on its lineage, the section will introduce the general concepts of the data model and the available storage API, and presents a high- level overview on implementation. Backdrop In 2003, Google published a paper titled “The Google File System”. This scalable dis- tributed file system, abbreviated as GFS, uses a cluster of commodity hardware to store huge amounts of data. The filesystem handled data replication between nodes so that losing a storage server would have no effect on data availability. It was also optimized for streaming reads so that data could be read for processing later on. Shortly afterward, another paper by Google was published, titled “MapReduce: Sim- plified Data Processing on Large Clusters”. MapReduce was the missing piece to the GFS architecture, as it made use of the vast number of CPUs each commodity server in the GFS cluster provides. MapReduce plus GFS forms the backbone for processing massive amounts of data, including the entire search index Google owns. What is missing, though, is the ability to access data randomly and in close to real-time (meaning good enough to drive a web service, for example). Another drawback of the GFS design is that it is good with a few very, very large files, but not as good with millions of tiny files, because the data retained in memory by the master node is ulti- mately bound to the number of files. The more files, the higher the pressure on the memory of the master. So, Google was trying to find a solution that could drive interactive applications, such as Mail or Analytics, while making use of the same infrastructure and relying on GFS for replication and data availability. The data stored should be composed of much smaller entities, and the system would transparently take care of aggregating the small records into very large storage files and offer some sort of indexing that allows the user to retrieve data with a minimal number of disk seeks. Finally, it should be able to store the entire web crawl and work with MapReduce to build the entire search index in a timely manner. Being aware of the shortcomings of RDBMSes at scale (see “Seek Versus Trans- fer” on page 315 for a discussion of one fundamental issue), the engineers approached this problem differently: forfeit relational features and use a simple API that has basic create, read, update, and delete (or CRUD) operations, plus a scan function to iterate 16 Chapter 1: Introductionover larger key ranges or entire tables. The culmination of these efforts was published in 2006 in a paper titled “Bigtable: A Distributed Storage System for Structured Data”, two excerpts from which follow: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. …a sparse, distributed, persistent multi-dimensional sorted map. It is highly recommended that everyone interested in HBase read that paper. It describes a lot of reasoning behind the design of Bigtable and, ultimately, HBase. We will, how- ever, go through the basic concepts, since they apply directly to the rest of this book. HBase is implementing the Bigtable storage architecture very faithfully so that we can explain everything using HBase. Appendix F provides an overview of where the two systems differ. Tables, Rows, Columns, and Cells First, a quick summary: the most basic unit is a column. One or more columns form a row that is addressed uniquely by a row key. A number of rows, in turn, form a table, and there can be many of them. Each column may have multiple versions, with each distinct value contained in a separate cell. This sounds like a reasonable description for a typical database, but with the extra dimension of allowing multiple versions of each cells. But obviously there is a bit more to it. All rows are always sorted lexicographically by their row key. Example 1-1 shows how this will look when adding a few rows with different keys. Example 1-1. The sorting of rows done lexicographically by their key hbase(main):001:0 scan 'table1' ROW COLUMN+CELL row-1 column=cf1:, timestamp=1297073325971 ... row-10 column=cf1:, timestamp=1297073337383 ... row-11 column=cf1:, timestamp=1297073340493 ... row-2 column=cf1:, timestamp=1297073329851 ... row-22 column=cf1:, timestamp=1297073344482 ... row-3 column=cf1:, timestamp=1297073333504 ... row-abc column=cf1:, timestamp=1297073349875 ... 7 row(s) in 0.1100 seconds Note how the numbering is not in sequence as you may have expected it. You may have to pad keys to get a proper sorting order. In lexicographical sorting, each key is com- pared on a binary level, byte by byte, from left to right. Since row-1... is less than row-2..., no matter what follows, it is sorted first. Having the row keys always sorted can give you something like a primary key index known from RDBMSes. It is also always unique, that is, you can have each row key Building Blocks 17only once, or you are updating the same row. While the original Bigtable paper only considers a single index, HBase adds support for secondary indexes (see “Secondary Indexes” on page 370). The row keys can be any arbitrary array of bytes and are not necessarily human-readable. Rows are composed of columns, and those, in turn, are grouped into column families. This helps in building semantical or topical boundaries between the data, and also in applying certain features to them—for example, compression—or denoting them to stay in-memory. All columns in a column family are stored together in the same low- level storage file, called an HFile. Column families need to be defined when the table is created and should not be changed too often, nor should there be too many of them. There are a few known shortcomings in the current implementation that force the count to be limited to the low tens, but in practice it is often a much smaller number (see Chapter 9 for details). The name of the column family must be composed of printable characters, a notable difference from all other names or values. Columns are often referenced as family:qualifier with the qualifier being any arbitrary array of bytes. As opposed to the limit on column families, there is no such thing for the number of columns: you could have millions of columns in a particular column family. There is also no type nor length boundary on the column values. Figure 1-4 helps to visualize how different rows are in a normal database as opposed to the column-oriented design of HBase. You should think about rows and columns not being arranged like the classic spreadsheet model, but rather use a tag metaphor, that is, information is available under a specific tag. The "NULL?" in Figure 1-4 indicates that, for a database with a fixed schema, you have to store NULLs where there is no value, but for HBase’s storage architectures, you simply omit the whole column; in other words, NULLs are free of any cost: they do not occupy any storage space. All rows and columns are defined in the context of a table, adding a few more concepts across all included column families, which we will discuss shortly. Every column value, or cell, either is timestamped implicitly by the system or can be set explicitly by the user. This can be used, for example, to save multiple versions of a value as it changes over time. Different versions of a cell are stored in decreasing time- stamp order, allowing you to read the newest value first. This is an optimization aimed at read patterns that favor more current values over historical ones. The user can specify how many versions of a value should be kept. In addition, there is support for predicate deletions (see “Log-Structured Merge-Trees” on page 316 for You will see in “Column Families” on page 212 that the qualifier also may be left unset. 18 Chapter 1: IntroductionFigure 1-4. Rows and columns in HBase the concepts behind them) allowing you to keep, for example, only values written in the past week. The values (or cells) are also just uninterpreted arrays of bytes, that the client needs to know how to handle. If you recall from the quote earlier, the Bigtable model, as implemented by HBase, is a sparse, distributed, persistent, multidimensional map, which is indexed by row key, column key, and a timestamp. Putting this together, we can express the access to data like so: (Table, RowKey, Family, Column, Timestamp) → Value In a more programming language style, this may be expressed as: SortedMap RowKey, List SortedMap Column, List Value, Timestamp or all in one line: SortedMapRowKey, ListSortedMapColumn, ListValue, Timestamp Building Blocks 19

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.