Big Data Analytics using Artificial Intelligence

big data artificial intelligence machine learning and data protection and big data vs artificial intelligence
GregDeamons Profile Pic
GregDeamons,New Zealand,Professional
Published Date:03-08-2017
Your Website URL(Optional)
Comment
Introduction: Big Data’s Big Ideas The big data space is maturing in dog years, seven years of maturity for each turn of the calendar. In the four years we have been produc‐ ing our annual Big Data Now, the field has grown from infancy (or, if you prefer the canine imagery, an enthusiastic puppyhood) full of potential (but occasionally still making messes in the house), through adolescence, sometimes awkward as it figures out its place in the world, into young adulthood. Now in its late twenties, big data is now not just a productive member of society, it’s a leader in some fields, a driver of innovation in others, and in still others it provides the analysis that makes it possible to leverage domain knowledge into scalable solutions. Looking back at the evolution of our Strata events, and the data space in general, we marvel at the impressive data applications and tools now being employed by companies in many industries. Data is having an impact on business models and profitability. It’s hard to find a non-trivial application that doesn’t use data in a significant manner. Companies who use data and analytics to drive decision- making continue to outperform their peers. Up until recently, access to big data tools and techniques required significant expertise. But tools have improved and communities have formed to share best practices. We’re particularly excited about solutions that target new data sets and data types. In an era when the requisite data skill sets cut across traditional disciplines, companies have also started to emphasize the importance of processes, culture, and people. vAs we look into the future, here are the main topics that guide our current thinking about the data landscape. We’ve organized this book around these themes: Cognitive Augmentation The combination of big data, algorithms, and efficient user interfaces can be seen in consumer applications such as Waze or Google Now. Our interest in this topic stems from the many tools that democratize analytics and, in the process, empower domain experts and business analysts. In particular, novel visual interfaces are opening up new data sources and data types. Intelligence Matters Bring up the topic of algorithms and a discussion on recent developments in artificial intelligence (AI) is sure to follow. AI is the subject of an ongoing series of posts on O’Reilly Radar. The “unreasonable effectiveness of data” notwithstanding, algo‐ rithms remain an important area of innovation. We’re excited about the broadening adoption of algorithms like deep learning, and topics like feature engineering, gradient boosting, and active learning. As intelligent systems become common, security and privacy become critical. We’re interested in efforts to make machine learning secure in adversarial environments. The Convergence of Cheap Sensors, Fast Networks, and Distributed Computing The Internet of Things (IoT) will require systems that can pro‐ cess and unlock massive amounts of event data. These systems will draw from analytic platforms developed for monitoring IT operations. Beyond data management, we’re following recent developments in streaming analytics and the analysis of large numbers of time series. Data (Science) Pipelines Analytic projects involve a series of steps that often require dif‐ ferent tools. There are a growing number of companies and open source projects that integrate a variety of analytic tools into coherent user interfaces and packages. Many of these inte‐ grated tools enable replication, collaboration, and deployment. This remains an active area, as specialized tools rush to broaden their coverage of analytic pipelines. vi Introduction: Big Data’s Big IdeasThe Evolving, Maturing Marketplace of Big Data Components Many popular components in the big data ecosystem are open source. As such, many companies build their data infrastructure and products by assembling components like Spark, Kafka, Cas‐ sandra, and ElasticSearch, among others. Contrast that to a few years ago when many of these components weren’t ready (or didn’t exist) and companies built similar technologies from scratch. But companies are interested in applications and ana‐ lytic platforms, not individual components. To that end, demand is high for data engineers and architects who are skilled in maintaining robust data flows, data storage, and assembling these components. Design and Social Science To be clear, data analysts have always drawn from social science (e.g., surveys, psychometrics) and design. We are, however, noticing that many more data scientists are expanding their col‐ laborations with product designers and social scientists. Building a Data Culture “Data-driven” organizations excel at using data to improve decision-making. It all starts with instrumentation. “If you can’t measure it, you can’t fix it,” says DJ Patil, VP of product at Rela‐ teIQ. In addition, developments in distributed computing over the past decade have given rise to a group of (mostly technol‐ ogy) companies that excel in building data products. In many instances, data products evolve in stages (starting with a “mini‐ mum viable product”) and are built by cross-functional teams that embrace alternative analysis techniques. The Perils of Big Data Every few months, there seems to be an article criticizing the hype surrounding big data. Dig deeper and you find that many of the criticisms point to poor analysis and highlight issues known to experienced data analysts. Our perspective is that issues such as privacy and the cultural impact of models are much more significant. Introduction: Big Data’s Big Ideas viiCognitive Augmentation We address the theme of cognitive augmentation first because this is where the rubber hits the road: we build machines to make our lives better, to bring us capacities that we don’t otherwise have—or that only some of us would. This chapter opens with Beau Cronin’s thoughtful essay on predictive APIs, things that deliver the right functionality and content at the right time, for the right person. The API is the interface that tackles the challenge that Alistair Croll defined as “Designing for Interruption.” Ben Lorica then discusses graph analysis, an increasingly prevalent way for humans to gather information from data. Graph analysis is one of the many building blocks of cognitive augmentation; the way that tools interact with each other—and with us—is a rapidly developing field with huge potential. Challenges Facing Predictive APIs Solutions to a number of problems must be found to unlock PAPI value by Beau Cronin In November, the first International Conference on Predictive APIs and Apps will take place in Barcelona, just ahead of Strata Barce‐ lona. This event will bring together those who are building intelli‐ gent web services (sometimes called Machine Learning as a Service) with those who would like to use these services to build predictive apps, which, as defined by Forrester, deliver “the right functionality and content at the right time, for the right person, by continuously learning about them and predicting what they’ll need.” 1This is a very exciting area. Machine learning of various sorts is rev‐ olutionizing many areas of business, and predictive services like the ones at the center of predictive APIs (PAPIs) have the potential to bring these capabilities to an even wider range of applications. I co- founded one of the first companies in this space (acquired by Sales‐ force in 2012), and I remain optimistic about the future of these efforts. But the field as a whole faces a number of challenges, for which the answers are neither easy nor obvious, that must be addressed before this value can be unlocked. In the remainder of this post, I’ll enumerate what I see as the most pressing issues. I hope that the speakers and attendees at PAPIs will keep these in mind as they map out the road ahead. Data Gravity It’s widely recognized now that for truly large data sets, it makes a lot more sense to move compute to the data rather than the other way around—which conflicts with the basic architecture of cloud-based analytics services such as predictive APIs. It’s worth noting, though, that after transformation and cleaning, many machine learning data sets are actually quite small—not much larger than a hefty spread‐ sheet. This is certainly an issue for the truly big data needed to train, say, deep learning models. Workflow The data gravity problem is just the most basic example of a number of issues that arise from the development process for data science and data products. The Strata conferences right now are flooded with proposals from data science leaders who stress the iterative and collaborative nature of this work. And it’s now widely appreciated that the preparatory (data preparation, cleaning, transformation) and communication (visualization, presentation, storytelling) pha‐ ses usually consume far more time and energy than model building itself. The most valuable toolsets will directly support (or at least not disrupt) the whole process, with machine learning and model build‐ ing closely integrated into the overall flow. So, it’s not enough for a predictive API to have solid client libraries and/or a slick web inter‐ face: instead, these services will need to become upstanding, fully assimilated citizens of the existing data science stacks. 2 Cognitive AugmentationCrossing the Development/Production Divide Executing a data science project is one thing; delivering a robust and scalable data product entails a whole new set of requirements. In a nutshell, project-based work thrives on flexible data munging, tight iteration loops, and lightweight visualization; productization emphasizes reliability, efficient resource utilization, logging and monitoring, and solid integration with other pieces of distributed architecture. A predictive API that supports one of these endeavors won’t necessarily shine in the other setting. These limitations might be fine if expectations are set correctly; it’s fine for a tool to support, say, exploratory work, with the understanding that production use will require re-implementation and hardening. But I do think the reality does conflict with some of the marketing in the space. Users and Skill Sets Sometimes it can be hard to tell at whom, exactly, a predictive ser‐ vice is aimed. Sophisticated and competent data scientists—those familiar with the ins and outs of statistical modeling and machine learning methods—are typically drawn to high-quality open source libraries, like scikit-learn, which deliver a potent combination of control and ease of use. For these folks, predictive APIs are likely to be viewed as opaque (if the methods aren’t transparent and flexible) or of questionable value (if the same results could be achieved using a free alternative). Data analysts, skilled in data transformation and manipulation but often with limited coding ability, might be better served by a more integrated “workbench” (such as those provided by legacy vendors like SAS and SPSS). In this case, the emphasis is on the overall experience rather than the API. Finally, application devel‐ opers probably just want to add predictive capabilities to their prod‐ ucts, and need a service that doesn’t force them to become de facto (and probably subpar) data scientists along the way. These different needs are conflicting, and clear thinking is needed to design products for the different personas. But even that’s not enough: the real challenge arises from the fact that developing a sin‐ gle data product or predictive app will often require all three kinds of effort. Even a service that perfectly addresses one set of needs is therefore at risk of being marginalized. Challenges Facing Predictive APIs 3Horizontal versus Vertical In a sense, all of these challenges come down to the question of value. What aspects of the total value chain does a predictive service address? Does it support ideation, experimentation and exploration, core development, production deployment, or the final user experi‐ ence? Many of the developers of predictive services that I’ve spoken with gravitate naturally toward the horizontal aspect of their serv‐ ices. No surprise there: as computer scientists, they are at home with abstraction, and they are intellectually drawn to—even entranced by —the underlying similarities between predictive problems in fields as diverse as finance, health care, marketing, and e-commerce. But this perspective is misleading if the goal is to deliver a solution that carries more value than free libraries and frameworks. Seemingly trivial distinctions in language, as well as more fundamental issues such as appetite for risk, loom ever larger. As a result, predictive API providers will face increasing pressure to specialize in one or a few verticals. At this point, elegant and general APIs become not only irrelevant, but a potential liability, as industry- and domain-specific feature engineering increases in importance and it becomes crucial to present results in the right parlance. Sadly, these activities are not thin adapters that can be slapped on at the end, but instead are ravenous time beasts that largely determine the perceived value of a predictive API. No single customer cares about the generality and wide applicability of a plat‐ form; each is looking for the best solution to the problem as he con‐ ceives it. As I said, I am hopeful that these issues can be addressed—if they are confronted squarely and honestly. The world is badly in need of more accessible predictive capabilities, but I think we need to enlarge the problem before we can truly solve it. 4 Cognitive AugmentationThere Are Many Use Cases for Graph Databases and Analytics Business users are becoming more comfortable with graph analytics by Ben Lorica The rise of sensors and connected devices will lead to applications that draw from network/graph data management and analytics. As the number of devices surpasses the number of people—Cisco esti‐ mates 50 billion connected devices by 2020—one can imagine appli‐ cations that depend on data stored in graphs with many more nodes and edges than the ones currently maintained by social media com‐ panies. This means that researchers and companies will need to produce real-time tools and techniques that scale to much larger graphs (measured in terms of nodes and edges). I previously listed tools for tapping into graph data, and I continue to track improvements in accessibility, scalability, and performance. For example, at the just- There Are Many Use Cases for Graph Databases and Analytics 5concluded Spark Summit, it was apparent that GraphX remains a 1 high-priority project within the Spark ecosystem. Another reason to be optimistic is that tools for graph data are get‐ ting tested in many different settings. It’s true that social media applications remain natural users of graph databases and analytics. But there are a growing number of applications outside the “social” realm. In his recent Strata Santa Clara talk and book, Neo Technolo‐ gy’s founder and CEO Emil Eifrem listed other uses cases for graph databases and analytics: • Network impact analysis (including root cause analysis in data centers) • Route finding (going from point A to point B) • Recommendations • Logistics • Authorization and access control • Fraud detection • Investment management and finance (including securities and debt) The widening number of applications means that business users are becoming more comfortable with graph analytics. In some domains network science dashboards are beginning to appear. More recently, analytic tools like GraphLab Create make it easier to unlock and 2 build applications with graph data. Various applications that build upon graph search/traversal are becoming common, and users are beginning to be comfortable with notions like “centrality” and “community structure”. A quick way to immerse yourself in the graph analysis space is to attend the third GraphLab conference in San Francisco—a showcase 3 of the best tools for graph data management, visualization, and ana‐ lytics, as well as interesting use cases. For instance, MusicGraph will be on hand to give an overview of their massive graph database from the music industry, Ravel Law will demonstrate how they leverage 1 Full disclosure: I am an advisor to Databricks—a startup commercializing Apache Spark. 2 As I noted in a previous post, GraphLab has been extended to handle general machine learning problems (not just graphs). 3 Exhibitors at the GraphLab conference will include creators of several major graph databases, visualization tools, and Python tools for data scientists. 6 Cognitive Augmentationgraph tools and analytics to improve search for the legal profession, and Lumiata is assembling a database to help improve medical sci‐ ence using evidence-based tools powered by graph analytics. Figure 1-1. Interactive analyzer of Uber trips across San Francisco’s micro-communities There Are Many Use Cases for Graph Databases and Analytics 7Network Science Dashboards Network graphs can be used as primary visual objects with conventional charts used to supply detailed views by Ben Lorica With Network Science well on its way to being an established aca‐ 4 demic discipline, we’re beginning to see tools that leverage it. Appli‐ cations that draw heavily from this discipline make heavy use of vis‐ ual representations and come with interfaces aimed at business users. For business analysts used to consuming bar and line charts, network visualizations take some getting used. But with enough practice, and for the right set of problems, they are an effective visu‐ alization model. In many domains, networks graphs can be the primary visual objects with conventional charts used to supply detailed views. I recently got a preview of some dashboards built using Financial Net‐ work Analytics (FNA). In the example below, the primary visualiza‐ tion represents correlations among assets across different asset 5 classes (the accompanying charts are used to provide detailed infor‐ mation for individual nodes): 4 This post is based on a recent conversation with Kimmo Soramäki, founder of Finan‐ cial Network Analytics. 5 Kimmo is an experienced researcher and policy-maker who has consulted and worked for several central banks. Thus FNA’s first applications are aimed at financial services. 8 Cognitive AugmentationUsing the network graph as the center piece of a dashboard works well in this instance. And with FNA’s tools already being used by a variety of organizations and companies in the financial sector, I think “Network Science dashboards” will become more common‐ place in financial services. Network Science dashboards only work to the extent that network graphs are effective (networks graphs tend get harder to navigate 6 and interpret when the number of nodes and edges get large ). One workaround is to aggregate nodes and visualize communities rather than individual objects. New ideas may also come to the rescue: the rise of networks and graphs is leading to better techniques for visu‐ alizing large networks. This fits one of the themes we’re seeing in Strata: cognitive augmen‐ tation. The right combination of data/algorithm(s)/interface allows analysts to make smarter decisions much more efficiently. While much of the focus has been on data and algorithms, it’s good to see more emphasis paid to effective interfaces and visualizations. 6 Traditional visual representations of large networks are pejoratively referred to as “hairballs.” Network Science Dashboards 9Intelligence Matters Artificial intelligence has been “just around the corner” for decades. But it’s more accurate to say that our ideas of what we can expect from AI have been sharpening and diversifying since the invention of the computer. Beau Cronin starts off this chapter with considera‐ tion of AI’s ‘dueling definitions'—and then resolves the “duel” by considering both artificial and human intelligence as part of a sys‐ tem of knowledge; both parts are vital and new capacities for both human and machine intelligence are coming. Pete Warden then takes us through deep learning—one form of machine intelligence whose performance has been astounding over the past few years, blasting away expectations particularly in the field of image recognition. Mike Loukides then brings us back to the big picture: what makes human intelligence is not power, but the desire for betterment. AI’s Dueling Definitions Why my understanding of AI is different from yours by Beau Cronin Let me start with a secret: I feel self-conscious when I use the terms “AI” and “artificial intelligence.” Sometimes, I’m downright embar‐ rassed by them. Before I get into why, though, answer this question: what pops into your head when you hear the phrase artificial intelligence? 11Figure 2-1. SoftBank’s Pepper, a humanoid robot that takes its sur‐ roundings into consideration. For the layperson, AI might still conjure HAL’s unblinking red eye, and all the misfortune that ensued when he became so tragically confused. Others jump to the replicants of Blade Runner or more recent movie robots. Those who have been around the field for some time, though, might instead remember the “old days” of AI— whether with nostalgia or a shudder—when intelligence was thought to primarily involve logical reasoning, and truly intelligent machines seemed just a summer’s work away. And for those steeped in today’s big-data-obsessed tech industry, “AI” can seem like nothing more than a high-falutin’ synonym for the machine-learning and predictive-analytics algorithms that are already hard at work opti‐ mizing and personalizing the ads we see and the offers we get—it’s the term that gets trotted out when we want to put a high sheen on things. Like the Internet of Things, Web 2.0, and big data, AI is discussed and debated in many different contexts by people with all sorts of motives and backgrounds: academics, business types, journalists, and technologists. As with these other nebulous technologies, it’s no wonder the meaning of AI can be hard to pin down; everyone sees what they want to see. But AI also has serious historical baggage, layers of meaning and connotation that have accreted over genera‐ tions of university and industrial research, media hype, fictional accounts, and funding cycles. It’s turned into a real problem: without 12 Intelligence Mattersa lot of context, it’s impossible to know what someone is talking about when they talk about AI. Let’s look at one example. In his 2004 book On Intelligence, Jeff Haw‐ kins confidently and categorically states that AI failed decades ago. Meanwhile, the data scientist John Foreman can casually discuss the “AI models” being deployed every day by data scientists, and Marc Andreessen can claim that enterprise software products have already achieved AI. It’s such an overloaded term that all of these viewpoints are valid; they’re just starting from different definitions. Which gets back to the embarrassment factor: I know what I mean when I talk about AI, at least I think I do, but I’m also painfully aware of all these other interpretations and associations the term evokes. And I’ve learned over the years that the picture in my head is almost always radically different from that of the person I’m talk‐ ing to. That is, what drives all this confusion is the fact that different people rely on different primal archetypes of AI. Let’s explore these archetypes, in the hope that making them explicit might provide the foundation for a more productive set of conversa‐ tions in the future. AI as interlocutor This is the concept behind both HAL and Siri: a computer we can talk to in plain language, and that answers back in our own lingo. Along with Apple’s personal assistant, systems like Cor‐ tana and Watson represent steps toward this ideal: they aim to meet us on our own ground, providing answers as good as—or better than—those we could get from human experts. Many of the most prominent AI research and product efforts today fall under this model, probably because it’s such a good fit for the search- and recommendation-centric business models of today’s Internet giants. This is also the version of AI enshrined in Alan Turing’s famous test for machine intelligence, though it’s worth noting that direct assaults on that test have succeeded only by gaming the metric. AI as android Another prominent notion of AI views disembodied voices, however sophisticated their conversational repertoire, as inade‐ quate: witness the androids from movies like Blade Runner, I Robot, Alien, The Terminator, and many others. We routinely transfer our expectations from these fictional examples to real- AI’s Dueling Definitions 13world efforts like Boston Dynamics’ (now Google’s) Atlas, or SoftBank’s newly announced Pepper. For many practitioners and enthusiasts, AI simply must be mechanically embodied to fulfill the true ambitions of the field. While there is a body of theory to motivate this insistence, the attachment to mechanical form seems more visceral, based on a collective gut feeling that intelligences must move and act in the world to be worthy of our attention. It’s worth noting that, just as recent Turing test results have highlighted the degree to which people are willing to ascribe intelligence to conversation partners, we also place unrealistic expectations on machines with human form. AI as reasoner and problem-solver While humanoid robots and disembodied voices have long cap‐ tured the public’s imagination, whether empathic or psycho‐ pathic, early AI pioneers were drawn to more refined and high- minded tasks—playing chess, solving logical proofs, and plan‐ ning complex tasks. In a much-remarked collective error, they mistook the tasks that were hardest for smart humans to per‐ form (those that seemed by introspection to require the most intellectual effort) for those that would be hardest for machines to replicate. As it turned out, computers excel at these kinds of highly abstract, well-defined jobs. But they struggle at the things we take for granted—things that children and many animals perform expertly, such as smoothly navigating the physical world. The systems and methods developed for games like chess are completely useless for real-world tasks in more varied envi‐ ronments.Taken to its logical conclusion, though, this is the scariest version of AI for those who warn about the dangers of artificial superintelligence. This stems from a definition of intel‐ ligence that is “an agent’s ability to achieve goals in a wide range of environments.” What if an AI was as good at general problem-solving as Deep Blue is at chess? Wouldn’t that AI be likely to turn those abilities to its own improvement? AI as big-data learner This is the ascendant archetype, with massive amounts of data being inhaled and crunched by Internet companies (and gov‐ ernments). Just as an earlier age equated machine intelligence with the ability to hold a passable conversation or play chess, many current practitioners see AI in the prediction, optimiza‐ tion, and recommendation systems that place ads, suggest prod‐ 14 Intelligence Mattersucts, and generally do their best to cater to our every need and commercial intent. This version of AI has done much to propel the field back into respectability after so many cycles of hype and relative failure—partly due to the profitability of machine learning on big data. But I don’t think the predominant machine-learning paradigms of classification, regression, clus‐ tering, and dimensionality reduction contain sufficient richness to express the problems that a sophisticated intelligence must solve. This hasn’t stopped AI from being used as a marketing label—despite the lingering stigma, this label is reclaiming its marketing mojo. This list is not exhaustive. Other conceptualizations of AI include the superintelligence that might emerge—through mechanisms never made clear—from a sufficiently complex network like the Internet, or the result of whole-brain emulation (i.e., mind upload‐ ing). Each archetype is embedded in a deep mesh of associations, assumptions, and historical and fictional narratives that work together to suggest the technologies most likely to succeed, the potential applications and risks, the timeline for development, and the “personality” of the resulting intelligence. I’d go so far as to say that it’s impossible to talk and reason about AI without reference to some underlying characterization. Unfortunately, even sophisticated folks who should know better are prone to switching mid- conversation from one version of AI to another, resulting in argu‐ ments that descend into contradiction or nonsense. This is one rea‐ son that much AI discussion is so muddled—we quite literally don’t know what we’re talking about. For example, some of the confusion about deep learning stems from it being placed in multiple buckets: the technology has proven itself successful as a big-data learner, but this achievement leads many to assume that the same techniques can form the basis for a more com‐ plete interlocutor, or the basis of intelligent robotic behavior. This confusion is spurred by the Google mystique, including Larry Page’s stated drive for conversational search. It’s also important to note that there are possible intelligences that fit none of the most widely held stereotypes: that are not linguistically sophisticated; that do not possess a traditional robot embodiment; AI’s Dueling Definitions 15that are not primarily goal driven; and that do not sort, learn, and optimize via traditional big data. Which of these archetypes do I find most compelling? To be honest, I think they all fall short in one way or another. In my next post, I’ll put forth a new conception: AI as model-building. While you might find yourself disagreeing with what I have to say, I think we’ll at least benefit from having this debate explicitly, rather than talking past each other. In Search of a Model for Modeling Intelligence True artificial intelligence will require rich models that incorporate real-world phenomena by Beau Cronin Figure 2-2. An orrery, a runnable model of the solar system that allows us to make predictions. Photo: Wikimedia Commons. In my last post, we saw that AI means a lot of things to a lot of peo‐ ple. These dueling definitions each have a deep history—OK fine, baggage—that has massed and layered over time. While they’re all legitimate, they share a common weakness: each one can apply per‐ fectly well to a system that is not particularly intelligent. As just one example, the chatbot that was recently touted as having passed the Turing test is certainly an interlocutor (of sorts), but it was widely criticized as not containing any significant intelligence. Let’s ask a different question instead: What criteria must any system meet in order to achieve intelligence—whether an animal, a smart robot, a big-data cruncher, or something else entirely? 16 Intelligence Matters