Question? Leave a message!

Big Data Data Warehousing

Big Data Data Warehousing
Ghislain Fourny Big Data 13. Data Warehousing fotoreactor / 123RF Stock PhotoAurelio Scetta / 123RF Stock Photo The road to analytics 2Another history of data management (T. Hofmann) 1970s – 2000s Age of Transactions Age of Business Intelligence 1995 2000s Age of Big Data 3Paradigms vs. OLTP OLAP 4OnLine Transaction Processing Consistent and Reliable RecordKeeping 5OnLine Transaction Processing Transactions and results on small portions of data 6OnLine Transaction Processing Lots of transactions on small portions of data 7OnLine Transaction Processing Normalized Data 8OnLine Analytical Processing Databased Decision Support arturaliev / 123RF Stock Photo 9OLAP is Big Large portions of the data Possibly many joins Few long heavy queries 10OLAP Examples Web analytics Sales analytics Management support Statistical analysis (census) Scientific databases (e.g., bioinformatics) 11OLTP vs. OLAP Historical Summarized Consolidated Data Aurelio Scetta / 123RF Stock Photo vs. Detailed Individual Records OLTP OLAP 12OLTP vs. OLAP vs. Lots of writes Lots of reads OLTP OLAP 13OLTP vs. OLAP vs. Analysis Small sets over big chunks of records OLTP OLAP 14OLTP vs. OLAP fully interactive ( 1s) vs. Slow interactive OLTP OLAP 15OLTP vs. OLAP Redundancy Redundancy Redundancy Consistency OLTP OLAP 16Aurelio Scetta / 123RF Stock Photo OLAP 17A data warehouse ... is a subjectoriented integrated timevariant nonvolatile collection of data in support of management's decisionmaking process 18Subjectoriented customers sales products events 19Integrated 20Timevariant Time in data warehouses is paramount (not so in OLTP systems) 21Timevariant Y9 Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1 2016 Often past 510 years 22Nonvolatile Load. Access. no updates Period. Milosh Kojadinovich / 123RF Stock Photo 23Architecture ERP Analyze CRM ETL Report OLTP Files Mine 24OLAP: Redundancy Materialized views (denormalized) 251st Normal Form (tabular) – The Key 262nd Normal Form (not joined) – The Whole Key 273rd Normal Form – Nothing But The Key 28Why materialize Operational data sources are too heterogeneous 29OLAP: Specialpurpose indices 30OLAP: Derived data 31Querying OLAP Slow interactive 6 5 4 3 vs. 2 1 Continuous 0 Category 1 Category 2 Category 3 Category 4 monitoring/tracking Series 1 Series 2 Series 3 1 10s hours 32Summary of differences OLTP OLAP Source Original (operational) Derived (consolidated) Purpose Business tasks Decision support Interface Snapshot Multidimensional views Writing short and fast, by end user period refreshes, by batch jobs Queries Simple, small results Complex and aggregating Design Many normalized tables Few denormalized cubes Precision ACID Sampling, confidence intervals Freshness Serializability Reproducibility Speed Very fast Often slow Optimization Interquery Intraquery Space Small, archiving old data Large, less space efficient Backup Very important ReETL 33Data Model 34Data Cubes Data is stored in multidimensional hypercubes 35Data Cubes Year 36Data Cubes Country 37Data Cubes Product 38Fact 2016 CH Server 39Dimensions Which currency Where What When Who etc. 40Fact table Where Germany 2016 Peter 1,000 Germany 2015 Mary 15,000 Switzerland 2016 Mary 1,500 Switzerland 2015 Peter 3,000 Australia 2015 Peter 6,000 China 2015 Mary 1,000 41Aggregation Where Germany 2016 Peter 1,000 Germany 2015 Mary 15,000 Switzerland 2016 Mary 1,500 Switzerland 2015 Peter 3,000 Australia 2015 Peter 6,000 China 2015 Mary 1,000 42Aggregation 43Aggregation Where Germany 2016 Peter 1,000 Germany 2015 Mary 15,000 Switzerland 2016 Mary 1,500 Switzerland 2015 Peter 3,000 Australia 2015 Peter 6,000 China 2015 Mary 1,000 44Aggregation 2016 Peter 1,000 2015 Mary 16,000 2016 Mary 1,500 2015 Peter 9,000 45Slicing 46Slicers and Dicers Dicers Slicers 47Slicers and Dicers Usually between 1 and 3 dicers, often 2 Dicers Slicers 48Servers Slicers and Dicers World Slicers USD 49Servers Slicers and Dicers World Slicers USD Dicers 2014 2015 2016 Peter 1,000,000 1,500,000 1,400,000 Mary 2,000,000 2,300,000 2,200,000 50Products: the big three Essbase Cognos Analysis Services 51ETLing 52OLAP: Derived data 53OLAP: Derived data ETL 54ETL Extract Transform Load 55Extract Triggers Gateways Incremental updates Log extraction 56Transform Herr Mister Derivation Value transformation Cleaning Filter, split, merge, join 57Load Integrity constraints Sorting Partition Build indices 58Considerations Granularity When Infrastructure 59Implementation 60Two flavors of OLAP ROLAP MOLAP 61Fact table (ROLAP) Dim1 Dim2 Dim3 Dim4 Dim5 Value 62Star Schema Dim1 Dim2 Dim3 Dim4 Dim5 Value 63Snowflake schema Dim1 Dim2 Dim3 Dim4 Dim5 Value Normalize More 64Querying 65Querying cubes Tables: Cubes: SQL MDX 66MDX stands for... MultiDimensional eXpressions 67Measures Amount of licenses Revenues Taxes paid ... 68Dimensions Quarter Salesperson Product Country 69In short... A cube is a list of dimensions indexing a list of measures 70Hierarchies Dimension values are organized in hierarchies. Location Geo Economy i.e., slice and aggregate i.e., slice and aggregate by geographic region, etc by economic partnership, etc 71Members Members correspond to levels in a hierarchy. Africa Ocenia Geo America Asia Europe Canada Switzerland China USA ZH India Brazil BE ... ... Germany ... 72Identifying a member Location.Geo.Europe.Switzerland.ZH.Zurich 73Tuples A list of members (Location.Geo.Europe.Switzerland.ZH.Zurich, Salesmen.People.John, Time.Year.2016.Q4) Associated with a dimensionality (list of hierarchies) (Location.Geo, Salesmen.People Time.Year) 74Sets A set of tuples with same dimensionality (Location.Geo.Europe.Switzerland.ZH.Zurich, Salesmen.People.John, Time.Year.2016.Q4), (Location.Geo.Europe.Switzerland.BE.Bärn, Salesmen.People.Mary, Time.Year.2016.Q4), (Location.Geo.Europe.Germany.Berlin, Salesmen.People.John, Time.Year.2016.Q3) 75MDX statements: dicing SELECT Measures.Members ON COLUMNS, Location.Geo.Members ON ROWS FROM Sales 76MDX statements: slicing SELECT Measures.Members ON COLUMNS, Location.Geo.Members ON ROWS FROM Sales WHERE Products.Line.Laptops.MBP 77Syntax 78XBRL Architecture Linkbase (.xml) Schema (.xsd) Instance (.xml) Discoverable Taxonomy Set 79Technologies XML XML Names XML Schema XML Link 80Fact Dimension Value What Assets Who Coca Cola When Dec 31, 2011 Of what USD usgaap:Assets contextRef="FI2012Q4" decimals="6" id="Fact600212FD4D06E63B4F8F6874C6E5BE74" unitRef="usd" 86174000000 /usgaap:Assets 81Context xbrli:context id="FI2011Q4" xbrli:entity xbrli:identifier scheme="" 0000021344 /xbrli:identifier /xbrli:entity xbrli:period xbrli:instant20111231/xbrli:instant December 2011 /xbrli:period 1 2 3 4 /xbrli:context 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 82Unit xbrli:unit id="usd" xbrli:measureiso4217:USD/xbrli:measure /xbrli:unit 83Concept (XML Schema) xs:element id='usgaapAssets' name='Assets' nillable='true' substitutionGroup='xbrli:item' type='xbrli:monetaryItemType' xbrli:balance='debit' xbrli:periodType='instant' / 84Graphs 85DAGs 86Trees 87Node: locator loc xlink:href=" gaap20130131.xsdusgaapAssets" xlink:label="locus gaapAssets102D7A4D204ED45AC0DEDA6BBC78F38 6" xlink:type="locator" / 88Node: resource link:label id="labkoNetChangeInOperatingAssetsAndLiabilitiesDisclosureAbstrac tA6469A522E35CBF355816876394722EElabelenUS" xlink:label="labkoNetChangeInOperatingAssetsAndLiabilitiesDisclosur eAbstractA6469A522E35CBF355816876394722EE" xlink:role="" xlink:type="resource" xml:lang="enUS" NET CHANGE IN OPERATING ASSETS AND LIABILITIES DISCLOSURE Abstract /link:label 89Edge presentationArc order="10" preferredLabel="" xlink:arcrole="" xlink:from="locus gaapAssetsAbstract2F55ECB2BF7C1A62009CDA6BBC757094" xlink:to="locus gaapAssets102D7A4D204ED45AC0DEDA6BBC78F386" xlink:type="arc" / 90Summary 91Architecture ERP Analyze CRM ETL Report OLTP Files Mine 92
Document Information
User Name:
User Type:
Uploaded Date: