How is Big Data Related to Business Intelligence

how big data is changing the world and how to solve big data problems and how big data is related to cloud computing and how to learn big data step by step
OliviaCutts Profile Pic
Published Date:01-08-2017
Your Website URL(Optional)
INTRODUCTION The poet A.R. Ammons once wrote, “A word too much repeated falls out of being.” Well, kudos to the term Big Data, because it’s hanging in there, and it’s hard to imagine a term with more hype than Big Data. Indeed, perhaps it is repeated too much. Big Data Beyond the Hype: A Guide to Conversations for Today’s Data Center is a collection of discussions that take an overused term and break it down into a confluence of technologies, some that have been around for a while, some that are relatively new, and others that are just com- ing down the pipe or are not even a market reality yet. The book is organized into three parts. Part I, “Opening Conversations About Big Data,” gives you a framework so that you can engage in Big Data conversations in social forums, at key- notes, in architectural reviews, during marketing mix planning, at the office watercooler, or even with your spouse (nothing like a Big Data discussion to inject romance into an evening). Although we talk a bit about what IBM does in this space, the aim of this part is to give you a grounding in cloud service delivery models, NoSQL, Big Data, cognitive computing, what a modern data information architecture looks like, and more. This chapter is going to give you the constructs and foundations that you will need to engage conver- sation that indeed can hype Big Data, but allow you to extend those conver- sations beyond. In Chapter 1, we briefly tackle, define, and illustrate the term Big Data. Although its use is ubiquitous, we think that many people have used it irre- sponsibly. For example, some people think Big Data just means Hadoop— and although Hadoop is indeed a critical repository and execution engine in the Big Data world, Hadoop is not solely Big Data. In fact, without analytics, Big Data is, well, just a bunch of data. Others think Big Data just means more data, and although that could be a characteristic, you certainly can engage in Big Data without lots of data. Big Data certainly doesn’t replace the RDBMS either, and admittedly we do find it ironic that the biggest trend in the NoSQL world is SQL. We also included in this chapter a discussion of cognitive computing—the next epoch of data analytics. IBM Watson represents a whole new class of industry-specific solutions called Cognitive Systems. It builds upon—but is xxiiixxiv Introduction not meant to replace—the current paradigm of programmatic systems, which will be with us for the foreseeable future. It is often the case that keep- ing pace with the demands of an increasingly complex business environ- ment requires a paradigm shift in what we should expect from IT. The world needs an approach that recognizes today’s realities as opportunities rather than challenges, and it needs computers that help with probabilistic out- comes. Traditional IT relies on search to find the location of a key phrase. Emerging IT gathers information and combines it for true discovery. Tradi- tional IT can handle only small sets of focused data, whereas today’s IT must live with Big Data. And traditional IT is really just starting to ubiqui- tously work with machine language, whereas what we as users really need is to be able to interact with machines the way we communicate—by using natural language. All of these considerations, plus the cancer fighting, Jeop- ardy winning, wealth managing, and retail-assisting gourmet chef known as IBM Watson itself, are discussed in this chapter. After providing a solid foundation for how to define and understand Big Data, along with an introduction to cognitive computing, we finish this chap- ter by presenting to you our Big Data and Analytics platform manifesto. This vendor-neutral discussion lays out the architectural foundation for an infor- mation management platform that delivers dividends today and into the future. IBM has built such a platform, and we cover that in Part II of this book. If you’re taking a journey of discovery into Big Data, no matter what vendor (or vendors) you ultimately partner with, you will need what we outline in this part of Chapter 1. Chapter 2 introduces you to something we call polyglot persistence. It’s an introduction to the NoSQL world and how that world is meant to comple- ment the relational world. It’s about having access to a vast array of capabili- ties that are “right-fit, right-purpose” for you to deliver the kinds of solutions you need. There are well over 150 NoSQL databases, so we break them down into types and discuss the “styles” of NoSQL databases. We also discuss things like CAP and ACID, which should give you a general understanding of what differentiates the NoSQL and SQL worlds. We assume you have a pretty good knowledge of the relational world, so we won’t be focusing on relational databases here (although it’s definitely part of a polyglot architec- ture). Entire books have been written about NoSQL, so consider this chapter Introduction xxv a primer; that said, if you thought NoSQL was something you put on your resume if you don’t know SQL, then reading this chapter will give you a solid foundation for understanding some of the most powerful forces in today’s IT landscape. In the cloud, you don’t build applications; you compose them. “Composing Cloud Applications: Why We Love the Bluemix and the IBM Cloud” is the title for our dive into the innovative cloud computing marketplace of compos- able services (Chapter 3). Three key phrases are introduced here: as a service, as a service, and as a service (that’s not a typo, our editors are too good to have missed that one). In this chapter, we introduce you to different cloud “as a service” models, which you can define by business value or use case; as your understanding of these new service models deepens, you will find that this distinction appears less rigid than you might first expect. During this journey, we examine IBM SoftLayer’s flexible, bare-metal infrastructure as a service (IaaS), IBM Bluemix’s developer-friendly and enterprise-ready platform as a service (PaaS), and software as a service (SaaS). In our SaaS discussion, we talk about the IBM Cloud marketplace, where you can get started in a free- mium way with loads of IBM and partner services to drive your business. We will also talk about some of the subcategories in the SaaS model, such as data warehouse as a service (DWaaS) and database as a service (DBaaS). The IBM dashDB fully managed analytics service is an example of DWaaS. By this point in the book, we would have briefly discussed the IBM NoSQL Cloudant database—an example of DBaaS. Finally, we will tie together how services can be procured in a PaaS or SaaS environment—depending on what you are try- ing to get done. For example, IBM provides a set of services to help build trust into data; they are critical to building a data refinery. The collection of these services is referred to as IBM DataWorks and can be used in production or to build applications (just like the dashDB service, among others). Finally, we talk about the IBM Cloud and how IBM, non IBM, and open source services are hosted to suit whatever needs may arise. After finishing this chapter, you will be seeing “as a service” everywhere you look. Part I ends with a discussion about where any organization that is serious about its analytics is heading: toward a data zones model (Chapter 4). New technologies introduced by the open source community and enhancements developed by various technology vendors are driving a dramatic shift in xxvi Introduction how organizations manage their information and generate insights. Organi- zations have been working hard to establish a single, trusted view of infor- mation across the enterprise but are realizing that traditional approaches lead to significant challenges related to agility, cost, and even the depth of insights that are provided. A next generation of architectures is emerging, one that is enabling organizations to make information available much faster; reduce their overall costs for managing data, fail fast, and archive on day zero; and drive a whole new level of value from their information. A modern architecture isn’t about the death of the enterprise data ware- house (EDW); it’s about a polyglot environment that delivers tangible economies of scale alongside powerful capabilities as a whole. This conflu- ence results in deeper insights, and these insights are materialized more quickly. So, is the EDW dead? Not even close. However, there is a move away from the traditional idea of EDWs as the “center of the universe” to the concept of an environment in which data is put into different “zones” and more fit-for-purpose services are leveraged on the basis of specific data and analytic requirements. Chapter 4 talks about the challenges most companies are trying to over- come and the new approaches that are rapidly evolving. Read about sand- boxes, landing zones, data lakes (but beware of the data swamp), fail-fast methodologies, day zero archives, data refineries, and more. We also discuss the elephant in the room: Hadoop. What is a data refinery? It’s a facility for transforming raw data into rele- vant and actionable information. Data refinement services take the uncer- tainty out of the data foundation for analysis and operations. Refined data is timely, clean, and well understood. A data refinery is needed to address a critical problem facing today’s journey-bound Big Data organizations: Data seems to be everywhere except where we need it, when we need it, and in the reliable form that we need. That refinery also has to be available as a set of services such that the information flows can be composed in the cloud with discrete pieces of refinery logic—and traditionally available on premise too. The data holds great potential, but it’s not going to deliver on this potential unless it is made available quickly, easily, and cleanly to both people and systems. A data refinery and the zone architecture we cover in this chapter go hand in hand when it comes to a next-generation information management architecture. Introduction xxvii Part II, “IBM Watson Foundations,” covers the IBM Big Data and Analyt- ics platform that taps into all relevant data, regardless of source or type, to provide fresh insights in real time and the confidence to act on them. IBM Watson Foundations, as its name implies, is the place where data is prepared for the journey to cognitive computing, in other words, IBM Watson. Clients often ask us how to get started with an IBM Watson project. We tell them to start with the “ground truth”—your predetermined view of good, rational, and trusted insights—because to get to the start line, you need a solid foun- dation. IBM Watson Foundations enables you to infuse analytics into every decision, every business process, and every system of engagement; indeed, this part of the book gives you details on how IBM can help you get Big Data beyond the hype. As part of the IBM Big Data and Analytics portfolio, Watson Foundations supports all types of analytics (including discovery, reporting, and analysis) and predictive and cognitive capabilities, and that’s what we cover in Chap- ter 5. For example, Watson Foundations offers an enterprise-class, nonforked Apache Hadoop distribution that’s optionally “Blue suited” for even more value; there’s also a rich portfolio of workload-optimized systems, analytics on streaming data, text and content analytics, and more. With governance, privacy, and security services, the platform is open, modular, trusted, and integrated so that you can start small and scale at your own pace. The follow- ing illustration shows an overview map of the IBM Watson Foundations capa- bilities and some of the things we cover in this part of the book. Explore IBM Watson Foundations Decision Content Planning and Discovery and Management Analytics Forecasting Exploration Business Intelligence and Predictive Analytics Data Management Hadoop Stream Content and Warehouse System Computing Management Information Integration and Governancexxviii Introduction Chapter 6 describes Hadoop and its related ecosystem of tools. Hadoop is quickly becoming an established component in today’s data centers and is causing significant disruption (in a good way) to traditional thinking about data processing and storage. In this chapter, you’ll see what the buzz is all about as we delve into Hadoop and some big changes to its underlying archi- tecture. In addition to its capabilities, Hadoop is remarkable because it’s a great example of the power of community-driven open source development. IBM is deeply committed to open source Hadoop and is actively involved in the community. You’ll also learn about IBM’s Hadoop distribution, called IBM InfoSphere BigInsights for Hadoop (BigInsights), which has two main focus areas: enabling analysts to use their existing skills in Hadoop (such as SQL, statistics, and text analytics, for example) and making Hadoop able to sup- port multitenant environments (by providing rich resource and workload management tools and an optional file system designed for scalability, mult- itenancy, and flexibility). If you are looking for 100 percent pure open source Hadoop, BigInsights has it (we call it “naked Hadoop”). Like other vendors (such as Cloudera), IBM offers optional proprietary Hadoop extensions that we think deliver greater value, such as visualization, security, SQL program- mability, and more. We like to refer to this as “Blue Suit Hadoop.” So if you want to leverage open source Apache Hadoop without any fancy add-ons but with full support that includes all the back testing and installation com- plexities handled for you, BigInsights does that—and it also lets you dress it up with enterprise-hardening capabilities. Chapter 7 describes how you can analyze data before it has landed on disk—it’s the “analytics for data in motion” chapter (sometimes, it seems that analytics for data at rest gets all the attention). This is an area in which IBM has differentiated itself from any other vendor in the marketplace. In fact, few Big Data vendors even talk about data in motion. IBM is well trav- eled in this area, with a high-velocity Big Data engine called InfoSphere Streams. Think about it: From EDWs to RDBMSs to HDFS, the talk is always centered on harvesting analytics “at rest.” Conversations about Big Data typ- ically end here, forgetting about how organizations can really benefit by moving their at-rest analytics (where they forecast) to the frontier of the busi- ness (where they can nowcast). We call this in-the-moment analytics. The irony is that we all operate “in the moment” every time we write an email, yet we don’t demand it of our analytics systems. For example, when you misspell a Introduction xxix word, does it get flagged with a disapprovingly red squiggly underline as you type? That’s in the moment. How many of us configure our spelling analytics engine to engage only after the entire email has been written? Not many. In fact, it’s more likely the case that you have autocorrect turned on, and if a spelling error is discovered as you type, it’s automatically corrected. What’s more, as you spot words the analytics dictionary doesn’t recognize, you can opt to “learn” them (harvesting them at rest), and subsequent occur- rences of the same misspelled words are corrected or highlighted (in motion). Did you know that it takes between 300 and 400 milliseconds to blink the human eye? When you hear that analytics is occurring “in the blink of an eye,” it certainly sounds fast, doesn’t it? What you might not know is that BLU Acceleration analytics, in its sweet spot, is about 700 million times faster than the blink of an eye (yes, you read that right), and this is the focus of Chapter 8. You also might not know that BLU Acceleration is a technology and not a product—so you are going to find it in various on-premise prod- ucts and off-premise services (we talk about that in this chapter too). BLU Acceleration is a confluence of various big ideas (some new and some that have been around for a while) that together represent what we believe to be the most powerful in-memory analytics system in today’s competitive mar- ketplace. In this chapter, we introduce you to those ideas. BLU Acceleration technology first debuted in an on-premise fashion in the DB2 10.5 release. What we find even more exciting is its availability in both PaaS and SaaS provisioning models as a newly hosted or managed (depending if you lever- age the service from the IBM Cloud Marketplace or Bluemix) analytics ser- vice called dashDB One of the things that makes BLU Acceleration so special is that it is opti- mized to work where the data resides, be it on a traditional spinning disk, a solid-state disk (SSD), in system memory (RAM), or even in the CPU cache (L1, L2, and so on). Let’s face it, in a Big Data world, not all of the data is going to fit into memory. Although some nameplate vendors have solutions that require this, the BLU Acceleration technology does not. In fact, BLU Acceleration treats system memory as the “new disk.” It really doesn’t want to use that area to persist data unless it needs to because RAM is too darn slow (Now consider those vendors who talk about their “great” in-memory system RAM databases.) This is yet another differentiator for the BLU Accel- eration technology: It’s built from the ground up to leverage 700 million times faster storage mechanisms than the blink of an Introduction If you are using the BLU Acceleration technology on premise through DB2, you simply must take notice of the fact that it has also been extended to bring reporting directly on top of OLTP systems with shadow table support in the DB2 Cancun release ( Clients testing this new capability have seen incredible performance improvements in their latency-driven extract, transform, and load (ETL) protocols, not to mention a jaw-dropping perfor- mance boost to their reports and analytics, without sacrificing the perfor- mance of their OLTP systems. The IBM PureData System for Analytics (PDA), powered by Netezza tech- nology, is a simple data appliance for serious analytics. It simplifies and opti- mizes the performance of data services for analytic applications, enabling complex algorithms to run in minutes, not hours. We don’t mind saying that the Netezza technology that powers this system pretty much started the appliance evolution that every major database vendor now embraces. And although there is a lot of imitation out there, we don’t believe that other offer- ings are in the PDA class when it comes to speed, simplicity, and ease of operation in a single form factor; deeply integrated analytics; and the flatten- ing of the time-to-value curve. Chapter 9 describes the IBM PureData System for Analytics. In this chapter, you learn about patented data filtering by field programmable gate arrays (FPGAs). You will also learn about PDA’s rich built-in analytical infrastructure and extensive library of statistical and mathematical functions— this capability is referred to as IBM Netezza Analytics (INZA). INZA is an embedded, purpose-built, advanced analytics platform—delivered free with every system. It includes more than 200 scalable in-database analytic functions that execute analytics in parallel while removing the complexity of parallel programming from developers, users, and DBAs. In short, it lets you predict with more accuracy, deliver predictions faster, and respond rapidly to changes. In fact, we don’t know of any competing vendor who offers this type of tech- nology in their warehousing appliances. The second best part (it is hard to beat free) is that some of these functions are part of dashDB, and more and more of them will make their way over to this managed analytics service. In fact, so will the Netezza SQL dialect, making dashDB the analytics service to burst Netezza workloads into the cloud. Chapter 10 is the “build more, grow more, sleep more” chapter that covers IBM Cloudant. This chapter continues the exploration of JSON and NoSQL Introduction xxxi databases (covered in Chapter 2) with additional detail and a touch of techni- cal depth about the IBM Cloudant DBaaS solution—another kind of SaaS offering. Cloudant’s earliest incarnation was as a data management layer for one of the largest data-generating projects on Earth: the Large Hadron Col- lider (LHC), operated by the European Organization for Nuclear Research (CERN). Among the many subject areas that particle physicists investigate with the LHC are the origin of the universe and how to replicate the condi- tions around the Big Bang, to name but a few. This data store technology matured into Cloudant, a fully managed NoSQL data-layer solution that eliminates complexity and risk for developers of fast-growing web and mobile applications. We examine how the flexibility of a JSON document data model ensures that you can perform powerful indexing and query work (including built-in Apache Lucene and geospatial searches) against nearly any type of structured or unstructured data, on an architecture that is a man- aged and scaled database as a service for off-premise, on-premise, and occa- sionally connected mobile environments. Part III, “Calming the Waters: Big Data Governance,” covers one of the most overlooked parts of any Big Data initiative. In fact, if you’re moving on a Big Data project and you aren’t well versed with what we talk about in this part of the book, you’re going to get beyond the hype alright, but where you end up isn’t going to be what you had in mind when you started. Trust us. In fact, it’s so overlooked that although it’s part of Watson Foundations from a product perspective, we decided to give this topic its own part. We did this out of a concern for what we are seeing in the marketplace as companies run to the pot of gold that supposedly lies at the end of the Big Data rainbow. Although, indeed, there is the potential for gold to be found (assuming you infuse your Big Data with analytics), when you run for the gold with a pair of scissors in your hand, it can end up being painful and with potentially devastating consequences. In this part, we cover the principles of informa- tion governance, why security must not be an afterthought, how to manage the Big Data lifecycle, and more. Chapter 11 describes what really sets IBM apart in the Big Data market- place: a deep focus on data governance. There is universal recognition in the relational database space that the principles of data governance (such as data quality, access control, lifecycle management, and data lineage) are critical success factors. IBM has made significant investments to ensure that the key xxxii Introduction principles of data governance can be applied to Big Data technologies. This is not the chapter to miss—we can’t think of a more important chapter to read before beginning any Big Data conversation. Think of it this way: If you were to plan a trip somewhere, would you get there faster, safer, and more efficiently by using a GPS device or intuition? Governance is about using a GPS device for turn-by-turn directions to effective analytics. Chapter 12 is the “security is not an afterthought” chapter in which we discuss the importance of security in a Hadoop environment and the ever- changing world of Big Data. We share some of our personal field experiences on this topic, and we review the core security requirements for your Hadoop environment. If data is sensitive and needs to be protected in an RDBMS, why do some folks think it doesn’t have to be protected in HDFS (Hadoop’s file system)? As more and more organizations adopt Hadoop as a viable plat- form to augment their current data housing methods, they need to become more knowledgeable about the different data security and protection meth- ods that are available, weighing the pros and cons of each. We finish this chapter with a somewhat detailed introduction to the IBM services that are available for this domain, including their abilities to govern, protect, and secure data in Hadoop environments as it moves through the data lifecycle—on or off premise. We cover the InfoSphere Guardium and Info- Sphere Optim family of data lifecycle management services. There is so much talk about “data lakes” these days, but without an understanding of the material that we cover in this chapter, it’s more likely to be a data swamp Chapter 13 extends the conversation from the safeguarding of information to making the data trusted—and there’s a big difference. As lines of business, data scientists, and others get access to more and more data, it’s essential that this data be surfaced to them as artifacts for interrogation, not by its location in the polyglot environment (for example, sentiment data in Hadoop and mobile interaction data in Cloudant). This enables you to see a manifest of data artifacts being surfaced through a Big Data catalog and to access that data through your chosen interface (Excel spreadsheet, R, SQL, and so on). To trust the data, you need to understand its provenance and lineage, and that has everything to do with the metadata. Let’s face it, metadata isn’t sexy, but it’s vitally important. Metadata is the “secret sauce” behind a successful Big Data project because it can answer questions like “Where did this data come Introduction xxxiii from?” and “Who owns this metric?” and “What did you do to present these aggregated measures?” A business glossary that can be used across the polyglot environment is imperative. This information can be surfaced no matter the user or the tool. Conversations about the glossarization, documentation, and location of data make the data more trusted, and this allows you to broadcast new data assets across the enterprise. Trusting data isn’t solely an on-premise phenomenon; in our social-mobile-cloud world, it’s critical that these capabilities are pro- vided as services, thereby creating a data refinery for trusted data. The set of products within the InfoSphere Information Server family, which is discussed in this chapter, includes the services that make up the IBM data refinery and they can be used traditionally through on-premise product installation or as individually composable discrete services via the IBM DataWorks catalog. Master data management (matching data from different repositories) is the focus of Chapter 14. Matching is a critical tool for Big Data environments that facilitate regular reporting and analytics, exploration, and discovery. Most organizations have many databases, document stores, and log data repositories, not to mention access to data from external sources. Successful organizations in the Big Data era of analytics will effectively match the data between these data sets and build context around them at scale and this is where IBM’s Big Match technology comes to play. In a Big Data world, tradi- tional matching engines that solely rely on relational technology aren’t going to cut it. IBM’s Big Match, as far as we know, is the only enterprise-capable matching engine that’s built on Hadoop, and this is the focus of the aptly named “Matching at Scale: Big Match” Chapter 14. Ready…Set…Go We understand that when all is said and done, you will spend the better part of a couple of days of your precious time reading this book. But we’re confi- dent that by the time you are finished, you’ll have a better understanding of requirements for the right Big Data and Analytics platform and a strong foundational knowledge of available IBM technologies to help you tackle the most promising Big Data opportunities. You will be able to get beyond the hype.xxxiv Introduction Our authoring team has more than 100 years of collective experience, including many thousands of consulting hours and customer interactions. We have experience in research, patents, sales, architecture, development, competitive analysis, management, and various industry verticals. We hope that we have been able to effectively share some of that experience with you to help you on your Big Data journey beyond the hype. P.S. It’s a never-ending onePart I Opening Conversations About Big Data 1This page intentionally left blank 1 Getting Hype out of the Way: Big Data and Beyond The term Big Data is a bit of a misnomer. Truth be told, we’re not even big fans of the term—despite that it is so prominently displayed on the cover of this book—because it implies that other data is somehow small (it might be) or that this particular type of data is large in size (it can be, but doesn’t have to be). For this reason, we thought we’d use this chapter to explain exactly what Big Data is, to explore the future of Big Data (cognitive com- puting), and to offer a manifesto of what constitutes a Big Data and Analyt- ics platform. There’s Gold in “Them There” Hills We like to use a gold mining analogy to articulate the opportunity of Big Data. In the “olden days,” miners could easily spot nuggets or veins of gold because they were highly visible to the naked eye. Let’s consider that gold to be “high value-per-byte data.” You can see its value, and therefore you invest resources to extract it. But there is more gold out there, perhaps in the hills nearby or miles away; it just isn’t visible to the naked eye, and trying to find this hidden gold becomes too much of a gambling game. Sure, history has its gold rush fever stories, but nobody ever mobilized millions of people to dig everywhere and anywhere; that would be too expensive. The same is true for the data that resides in your enterprise data warehouse today. That data has been invested in and is trusted. Its value is obvious, so your business invested 34 Big Data Beyond the Hype in cleansing it, transforming it, tracking it, cataloging it, glossarizing it, and so on. It was harvested…with care. Today’s miners work differently. Gold mining leverages new age capital equipment that can process millions of tons of dirt (low value-per-byte data) to find nearly invisible strands of gold. Ore grades of 30 parts per million are usually needed before gold is visible to the naked eye. In other words, there’s a great deal of gold (high value-per-byte data) in all of this dirt (low value- per-byte data), and with the right equipment, you can economically process lots of dirt and keep the flakes of gold that you find. The flakes of gold are processed and combined to make gold bars, which are stored and logged in a safe place, governed, valued, and trusted. If this were data, we would call it harvested because it has been processed, is trusted, and is of known quality. The gold industry is working on chemical washes whose purpose is to reveal even finer gold deposits in previously extracted dirt. The gold analogy for Big Data holds for this innovation as well. New analytic approaches in the future will enable you to extract more insight out of your forever archived data than you can with today’s technology (we come back to this when we discuss cognitive computing later in this chapter). This is Big Data in a nutshell: It is the ability to retain, process, and under- stand data like never before. It can mean more data than what you are using today, but it can also mean different kinds of data—a venture into the unstructured world where most of today’s data resides. If you’ve ever been to Singapore, you’re surely aware of the kind of down- pours that happen in that part of the world; what’s more, you know that it is next to impossible to get a taxi during such a downpour. The reason seems obvious—they are all busy. But when you take the “visible gold” and mix it with the “nearly invisible gold,” you get a completely different story. When Big Data tells the story about why you can’t get a cab in Singapore when it is pouring rain, you find out that it is not because they are all busy. In fact, it is the opposite Cab drivers pull over and stop driving. Why? Because the deductible on their insurance is prohibitive and not worth the risk of an acci- dent (at fault or not). It was Big Data that found this correlation. By rigging Singapore cabs with GPS systems to do spatial and temporal analysis of their movements and combining that with freely available national weather ser- vice data, it was found that taxi movements mostly stopped in torrential downpours. Getting Hype out of the Way: Big Data and Beyond 5 Why Is Big Data Important? A number of years ago, IBM introduced its Smarter Planet campaign (“Instru- mented, Interconnected, and Intelligent,” for those of you who didn’t get the T-shirt). This campaign anticipated the Big Data craze that hit the IT land- scape just a few short years later. From an instrumentation perspective, what doesn’t have some amount of coding in it today? In a Big Data world, we can pretty much measure any- thing we want. Just look at your car; you can’t diagnose a problem these days without hooking it to a computer. The wearables market is set to explode too; even fashion house Ralph Lauren developed a shirt for the U.S. Open tennis championship that measures its wearer’s physiological markers such as heart rate, breathing, stress levels, and more. You can imagine how wear- ables can completely change the way society treats cardiac patients, post- traumatic stress disorders, train engineers, security forces, and more. As you can imagine, instrumentation has the ability to generate a heck of a lot of data. For example, one electric car on the market today generates 25GB of data during just one hour of charging. One important Big Data capability is capturing data that is getting “dropped to the floor.” This type of data can yield incredible insights and results because it enriches the analytics initiatives that are going on in your organization today. Data exhaust is the term we like to use for this kind of data, which is generated in huge amounts (often terabytes per day) but typically isn’t tapped for business insight. Online storefronts fail to capture terabytes of generated clickstreams that can be used to perform web sessionization, optimize the “last mile” shopping experience, understand why online shopping baskets are getting abandoned, or simply understand the navigational experience. For example, the popular Zynga game FarmVille collects approximately 25TB of log data per day, and the company designs iterations of the game based on these interactions. We like to think of this data as digital body language because it shows you the path that a client took to reach a destination—be that the purchasing of a pair of socks or the selling of cheese in an online game. Stored log files are a corpus of data describing the operational state (and out- ages for that matter) of your most important networks—you could analyze this data for trends when nothing obvious has gone wrong to find the “needle in the stack of needles” that reveals potential downstream problems. There’s 6 Big Data Beyond the Hype an “if” here that tightly correlates with the promise of Big Data: “If you could collect and analyze all the data….” We like to refer to the capability of analyz- ing all of the data as whole-population analytics. It’s one of the value proposi- tions of Big Data; imagine the kind of predictions and insights your analytic programs could make if they weren’t restricted to samples and subsets of the data. In the last couple of years, the data that is available in a Big Data world has increased even more, and we refer to this phenomenon as the Internet of Things (IoT). The IoT represents an evolution in which objects are capable of interacting with other objects. For example, hospitals can monitor and regu- late pacemakers from afar, factories can automatically address production- line issues, and hotels can adjust temperature and lighting according to their guests’ preferences. This development’s prediction by IBM’s Smarter Planet agenda was encapsulated by the term interconnected. This plethora of data sources and data types opens up new opportunities. For example, energy companies can do things that they could not do before. Data gathered from smart meters can provide a better understanding of cus- tomer segmentation and behavior and of how pricing influences usage—but only if companies have the ability to use such data. Time-of-use pricing encourages cost-savvy energy consumers to run their laundry facilities, air conditioners, and dishwashers at off-peak times. But the opportunities don’t end there. With the additional information that’s available from smart meters and smart grids, it’s possible to transform and dramatically improve the effi- ciency of electricity generation and scheduling. It’s also possible to deter- mine which appliances are drawing too much electricity and to use that information to propose rebate-eligible, cost-effective, energy-efficient upgrades wrapped in a compelling business case to improve the conversion yield on an associated campaign. Now consider the additional impact of social media. A social layer on top of an instrumented and interconnected world generates a massive amount of data too. This data is more complex because most of it is unstructured (images, Twitter feeds, Facebook posts, micro-blog commentaries, and so on). If you eat Frito-Lay SunChips, you might remember its move to the world’s first biodegradable, environmentally friendly chip bag; you might also remember how loud the packaging was. Customers created thousands of YouTube vid- eos showing how noisy the environmentally friendly bag was. A “Sorry, but I can’t hear you over this SunChips bag” Facebook page had hundreds of Getting Hype out of the Way: Big Data and Beyond 7 thousands of Likes, and bloggers let their feelings be known. In the end, Frito- Lay introduced a new, quieter SunChips bag, demonstrating the power and importance of social media. It isn’t hard to miss the careers lost and made from a tweet or video that went viral. For a number of years, Facebook was adding a new user every three seconds; today these users collectively generate double-digit terabytes of data every day. In fact, in a typical day, Facebook experiences more than 3.5 billion posts and about 155 million “Likes.” The format of a Facebook post is indeed structured data. It’s encoded in the JavaScript Object Notation (JSON) format—which we talk about in Chapter 2. However, it’s the unstructured part that has the “golden nugget” of potential value; it holds monetizable intent, reputational decrees, and more. Although the structured data is easy to store and analyze, it is the unstructured components for intent, sentiment, and so on that are hard to analyze. They’ve got the potential to be very rewarding, if…. Twitter is another phenomenon. The world has taken to generating over 400 million daily expressions of short 140 character or less opinions (amount- ing to double-digit terabytes) and commentary (often unfiltered) about sporting events, sales, images, politics, and more. Twitter also provides enor- mous amounts of data that’s structured in format, but it’s the unstructured part within the structure that holds most of the untapped value. Perhaps more accurate, it’s the combination of the structured (timestamp, location) and unstructured (the message) data where the ultimate value lies. The social world is at an inflection point. It is moving from a text-centric mode of communication to a visually centric one. In fact, the fastest growing social sites (such as Vine, Snapchat, Pinterest, and Instagram, among others) are based on video or image communications. For example, one fashion house uses Pinterest to build preference profiles for women (Pinterest’s membership is approximately 95 percent female). It does this by sending collages of outfits to clients and then extracting from the likes and dislikes preferences for color, cut, fabric, and so on; the company is essentially learning through error (it’s a methodology we refer to as fail fast and we talk about it later in this book). Compare this to the traditional method whereby you might fill in a question- naire about your favorite colors, cuts, and styles. Whenever you complete a survey, you are guarded and think about your responses. The approach that this fashion house is using represents unfiltered observation of raw human behavior—instant decisions in the form of “likes” and “dislikes.”

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.