Question? Leave a message!




Big data and NoSQL databases

Big data and NoSQL databases
Big data and NoSQL databases Seminar on big data management Lecturer: Jiaheng Lu Spring 2016 www.helsinki.fi 25.1.2016 1Information on preparing Presentation and Report Goals for presentation and report are different: 1. Presentation: Let the audience to understand your topic; 2. Report: Show your own critical thinking and new ideas. www.helsinki.fiContents of Presentation (Length: 3540 minutes) • 1. Introduction: please make a clear introduction • 1.1 Why you are interested in this topic: what kind of problems do you hope to solve • 1.2 How had the problem been studied before • 1.3 What is the application of this problem for big data • • 2. Related works: • 2.1 Make sure you leave sufficient time to present all related prior work. Do not assume that the audience knows the prior work, • 2.2 Present it on an intuitive level. • Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 3 Jiaheng LuContents of Presentation (Cont.) • • 3 Main algorithms and contributions • 3.1 Show the main solutions of the paper(s). • 3.2 Present it with examples. The examples are quite important for understanding. • • 4. Your own comments and conclusion • 4.1 Present your own comments about the paper(s) • 4.2 It would be very good to identify the weak points of the paper(s) after your critical thinking. • Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 4 Jiaheng LuContents of Report (68 pages, Single column) • 1. What are the research problems • 2. What are the strengths of the paper(s) • 3. What are the main weaknesses of the paper(s) • 4. If you were to solve this problem, what would you do • 5. Why do you like/dislike the paper(s) • 6. Conclusion and summary of your report. Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 5 Jiaheng LuOpponent • Carefully listen to the presentation • Ask questions after the presentation • Complete an opponent assessment form and submit it to the teacher after the presentation Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 6 Jiaheng Lu• Big data and NoSQL databases Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 7 Jiaheng LuData storage and history Before1950s Data was stored as paper records Lot of time was wasted. e.g. when searching. Therefore inefficient. www.helsinki.fiMagnetic tapes and hard disk • 1950s and early 1960s: Data processing using magnetic tapes for storage • Late 1960s and 1970s: Hard disks allow direct access to data • • Data stored in files Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 9 Jiaheng LuDrawbacks of file system • Each program has its own data format • Programs are written in different languages, and so cannot easily access each other’s files. • Any new requirement needs a new program Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 10 Jiaheng LuDatabase Approach • 1960’s Network databases • 1970’s Relational databases • 1990’s Objectoriented and objectrelational • 1995+ XML, Mobile, GeoDB, Embedded DB • 2005+ NoSQL DB, NewSQL DB Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 11 Jiaheng LuHistory of databases: Turing awards 1973 Charles W. Bachman 1981 Edgar F. Codd 1998 Jim Gray 2014 Michael Stonebraker www.helsinki.fi 12History of databases: Turing awards Objectrelational model, column stores,…Modern databases Distributed databases and transaction Relational databases Network databases 2014 Michael Stonebraker 1998 Jim Gray 1981 Edgar F. Codd 1973 Charles W. Bachman www.helsinki.fi 13Network Model Physical file pointers are used to model the relations between files Most suitable for large databases with welldefined queries and welldefined applications Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 14 Jiaheng LuRelational model • E. F Codd introduced the relational model in 1970 Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 15 Jiaheng LuRelational model • Support relational algebra and operations • Data and program are separated • Improved data sharing and better integration • DB2, Oracle and SQL server are the most prominent commercial DBMS products Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 16 Jiaheng LuObject oriented data model (1990’s) • The purpose of OODBMS is to store objectoriented programming objects in a database without having to transform them into relational format Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 17 Jiaheng LuObjectrelational model • Extend the relational data model by including object orientation • Allow attributes of tuples to have complex types, including nonatomic values such as nested relations Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 18 Jiaheng LuBig Data Challenge www.helsinki.fi5V’s of big data • Volume ‒TB  PB  EB • Variety ‒ Text, audio, video • Velocity ‒ Real time Operational / Analytic Applications • Value ‒ Extract Value from big data, complex Analytics • Veracity ‒ Biases, noise and abnormality in data. www.helsinki.fiLimitation for relational databases(1) • Different Types of Data: Data Variety www.helsinki.fiLimitation for relational databases(2) • What are Big Analytics • Not only simple “group by” aggregation,But also ‒ Machine leaning, artificial intelligence ‒ Data mining、natural language processing ‒ Social network analysis and search ‒…… www.helsinki.fiWhat are Big Analytics • Aster Data works on Graph www.helsinki.fiLimitation for relational databases(3) • Design for relational data, but not suitable for ‒ Graph data,Geospatial data,unstructured data • Limited Scalability ‒ No RDBMS has been deployed onto a cluster of more than 1000 nodes • Separation of Data Storage and Data Analytics ‒ Data migration ‒ Difficulty for parallel ‒…… www.helsinki.fiLimitation for relational databases(4) • Extending relational database • Relational table sharding ‒ Depending on the program ‒ Data size increase, need resharding • Denormalization for relational table to improve the performance ‒ Increase more redundancy data ‒ Increase the cost to maintain data consistence Relational databases cannot solve those challenges. We need new types of databases www.helsinki.fiNoSQL databases Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 26 Jiaheng LuNoSQL DEFINITION: • Next Generation Databases mostly addressing some of the points: being nonrelational, distributed, open source and horizontally scalable • NonSQL or Not only SQL • Watch a video about NoSQL from Jens Dittrich: • Say No No and No CIDR 2013 Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 27 Jiaheng LuTypes and examples of NoSQL databases Types Examples Column Accumulo, Cassandra, Druid, HBase, Vertica Document HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB Keyvalue: Aerospike, CouchDB, Dynamo, FairCom ctreeACE, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL databases Graph Allegro, InfiniteGraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog Multimodel Alchemy Database, ArangoDB, CortexDB, FoundationDB, MarkLogic, OrientDB Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 28 Jiaheng LuColumn stores • A columnoriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data. • This columnoriented DBMS has advantages for data warehouses, clinical data analysis, customer relationship management (CRM) systems, and library card catalogs, and other ad hoc inquiry systems Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 29 Jiaheng LuExample of column stores RowId EmpId Name Age 1 123 Anna 34 2 456 Mikko 30 3 789 Emilia 44 Roworiented storage: 1:123,Anna,34; 2:456,Mikko,30;3:789,Emilia,44 Columnoriented storage: 123:1,456:2,789:3; Anna:1, Mikko:2,Emilia:3;34:1,30:2,44:3 Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 30 Jiaheng LuKeyvalue stores • Keyvalue (KV) stores use the associative array as their fundamental data model. • In this model, data is represented as a collection of keyvalue pairs, such that each possible key appears at most once in the collection. www.helsinki.fi 25.1.2016 31 Matemaattisluonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimiExample of Keyvalue stores RowId EmpId Name Age 1 123 Anna 34 2 456 Mikko 30 3 789 Emilia 44 1: (123,Anna,34); 2: (2,456,Mikko,30); 3: (789,Emilia,44) Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 32 Jiaheng LuInsertion of a column and a record in Keyvalue stores RowId EmpId Name Age Salary 1 123 Anna 34 2 456 Mikko 30 3 789 Emilia 44 4 147 Joha 28 3000 1: (123,Anna,34); 2: (2,456,Mikko,30); 3: (789,Emilia,44); 4: (147,Joha,28,3000) Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 33 Jiaheng LuDocument store • The central concept of a document store is the notion of a "document". • Encodings in use include XML, YAML, and JSON as well as binary forms like BSON. • Documents are addressed in the database via a unique key that represents that document. Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 34 Jiaheng LuExample of document store XML: contact University of Helsinki company Universtiy of Helsinki /company Yliopistonkatu 4, address Yliopistonkatu 4 /address 00100 Helsinki cityHelsinki/city Finland zip 00100 /zip countryFinland/country /contact JSON: “contact": “company": "Universtiy of Helsinki", " address ": " Yliopistonkatu 4 ", “city": " Helsinki ", “zip": "00100“, “country”:”Finland” , Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 35 Jiaheng LuGraph stores • Designed for graph data • Applications: social relations, public transport links, road maps or network topologies, etc. Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 36 Jiaheng LuMultimodel stores • Support multiple data models against a single, integrated backend: Document, graph, relational, and keyvalue models are examples of data models Database Keyvalue SQL Document Graph Object Transacti on OrientDB Yes Yes Yes Yes Yes Full ACID, even distributed CouchDB Yes Yes Yes No Yes Marklogic Yes Yes Yes Yes No Full ACID Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 37 Jiaheng LuSummary • Relational databases is very successful to manage table and relational data, but it has limitations for managing big data. • NOSQL databases is a general term, which includes five types of data stores. • NOSQL database are starting to gain market traction Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ www.helsinki.fi 25.1.2016 38 Jiaheng Lu