How data warehouse is utilized for Business intelligence

relationship between data warehousing and business intelligence for eCommerce. benefits of data warehousing and business intelligence pdf free download
dr.naveensingla Profile Pic
Published Date:18-07-2017
Your Website URL(Optional)
Introduction to Data Warehousing and Business Intelligence Slides kindly borrowed from the course “Data Warehousing and Machine Learning” Aalborg University, Denmark Christian S. Jensen Torben Bach Pedersen Christian Thomsen csj,tbp, Course Structure • Business intelligence  Extract knowledge from large amounts of data collected in a modern enterprise  Data warehousing, machine learning • Purpose  Acquire theoretical background in lectures and literature studies  Obtain practical experience on (industrial) tools in practical exercises Data warehousing: construction of a database with only data analysis purpose Business Machine learning: find Intelligence (BI) patterns automatically in databases 2 •1Literature • Multidimensional Databases and Data Warehousing, Christian S. Jensen, Torben Bach Pedersen, Christian Thomsen, Morgan & Claypool Publishers, 2010 • Data Warehouse Design: Modern Principles and Methodologies, Golfarelli and Rizzi, McGraw-Hill, 2009 • Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications, Elzbieta Malinowski, Esteban Zimányi, Springer, 2008 • The Data Warehouse Lifecycle Toolkit, Kimball et al., Wiley 1998 nd • The Data Warehouse Toolkit, 2 Ed., Kimball and Ross, Wiley, 2002 3 Overview • Why Business Intelligence? • Data analysis problems • Data Warehouse (DW) introduction • DW topics  Multidimensional modeling  ETL  Performance optimization 4 •2What is Business Intelligence (BI)? • From Encyclopedia of Database Systems: “BI refers to a set of tools and techniques that enable a company to transform its business data into timely and accurate information for the decisional process, to be made available to the right persons in the most suitable form.” 5 What is Business Intelligence (BI)? • BI is different from Artificial Intelligence (AI)  AI systems make decisions for the users  BI systems help the users make the right decisions, based on available data • Combination of technologies  Data Warehousing (DW)  On-Line Analytical Processing (OLAP)  Data Mining (DM)  …… 6 •3Why is BI Important? • Worldwide BI revenue in 2005 = US 5.7 billion  10% growth each year  A market where players like IBM, Microsoft, Oracle, and SAP compete and invest • BI is not only for large enterprises  Small and medium-sized companies can also benefit from BI • The financial crisis has increased the focus on BI  You cannot afford not to use the “gold” in your data 7 BI and the Web • The Web makes BI even more useful  Customers do not appear “physically” in a store; their behaviors cannot be observed by traditional methods  A website log is used to capture the behavior of each customer, e.g., sequence of pages seen by a customer, the products viewed  Idea: understand your customers using data and BI  Utilize website logs, analyze customer behavior in more detail than before (e.g., what was not bought?)  Combine web data with traditional customer data 8 •4Case Study of an Enterprise • Example of a chain (e.g., fashion stores or car dealers)  Each store maintains its own customer records and sales records  Hard to answer questions like: “find the total sales of Product X from stores in Aalborg”  The same customer may be viewed as different customers for different stores; hard to detect duplicate customer information  Imprecise or missing data in the addresses of some customers  Purchase records maintained in the operational system for limited time (e.g., 6 months); then they are deleted or archived  The same “product” may have different prices, or different discounts in different stores • Can you see the problems of using those data for business analysis? 9 Data Analysis Problems • The same data found in many different systems  Example: customer data across different stores and departments  The same concept is defined differently • Heterogeneous sources  Relational DBMS, On-Line Transaction Processing (OLTP)  Unstructured data in files (e.g., MS Word)  Legacy systems  … 10 •5Data Analysis Problems (cont’) • Data is suited for operational systems  Accounting, billing, etc.  Do not support analysis across business functions • Data quality is bad  Missing data, imprecise data, different use of systems • Data are “volatile”  Data deleted in operational systems (6 months)  Data change over time – no historical information 11 Data Warehousing • Solution: new analysis environment (DW) where data are  Subject oriented (versus function oriented)  Integrated (logically and physically)  Time variant (data can always be related to time)  Stable (data not deleted, several versions)  Supporting management decisions (different organization) • Data from the operational systems are  Extracted  Cleansed  Transformed  Aggregated (?)  Loaded into the DW • A good DW is a prerequisite for successful BI 12 •6DW: Purpose and Definition • DW is a store of information organized in a unified data model • Data collected from a number of different sources  Finance, billing, website logs, personnel, … • Purpose of a data warehouse (DW): support decision making • Easy to perform advanced analysis  Ad-hoc analysis and reports  We will cover this soon ……  Data mining: discovery of hidden patterns and trends  You will study this in another course 13 DW Architecture – Data as Materialized Views Existing databases and systems (OLTP) New databases Appl. and systems (OLAP) OLAP DB DM Appl. Data DB mining DM Trans. Appl. DW DB (Global) Data Warehouse Appl. Visua- lization DM DB (Local) Appl. Data Marts DB Analogy: (data) producers warehouse (data) consumers 14 •7Function vs. Subject Orientation Function-oriented systems Subject-oriented Sales Appl. systems D-Appl. DM DB Appl. D-Appl. DB DM Trans. Costs Appl. DW DB All subjects, Profit integrated Appl. D-Appl. DM DB Selected Appl. subjects DB 15 Top-down vs. Bottom-up Appl. D-Appl. DB DM Appl. D-Appl. DB DM Trans. Appl. DW DB In-between: Appl. D-Appl. 1. Design of DW for DM1 DM DB 2. Design of DM2 and Bottom-up: Appl. integration with DW 1. Design of DMs Top-down:DB 3. Design of DM3 and 2. Maybe integration 1. Design of DW integration with DW of DMs in DW 2. Design of DMs 4. ... 3. Maybe no DW 16 •8Hard/Infeasible Queries for OLTP • Why not use the existing databases (OLTP) for business analysis? • Business analysis queries  In the past five years, which product is the most profitable?  Which public holiday we have the largest sales?  Which week we have the largest sales?  Does the sales of dairy products increase over time? • Difficult to express these queries in SQL rd  3 query: may extract the “week” value using a function  But the user has to learn many transformation functions … th  4 query: use a “special” table to store IDs of all dairy products, in advance  There can be many different dairy products; there can be many other product types as well … • The need of multidimensional modeling 17 Multidimensional Modeling • Example: sales of supermarkets • Facts and measures  Each sales record is a fact, and its sales value is a measure • Dimensions  Group correlated attributes into the same dimension  easier for analysis tasks  Each sales record is associated with its values of Product, Store, Time Product Type Category Store City County Date Sales Top Beer Beverage Trøjborg Århus Århus 25 May, 2009 5.75 Product Store Time 18 •9Multidimensional Modeling • How do we model the Time dimension?  Hierarchies with multiple levels  Attributes, e.g., holiday, event tid day day week month year work … T day 1 January 1 1 1 2009 No … Year Week 1st 2009 Month 2 January 2 1 1 2009 Yes … 2nd 2009 Day … … … … … … … • Advantage of this model?  Easy for query (more about this later) • Disadvantage?  Data redundancy (but controlled redundancy is acceptable) 19 Quick Review: Normalized Database Product ID Type Category Price 001 Beer Beverage 6.00 002 Rice Cereal 4.00 003 Beer Beverage 7.00 004 Wheat Cereal 5.00 Product ID TypeID Price TypeID Type CategoryID CategoryID Category 001 013 6.00 013 Beer 042 042 Beverage 002 052 4.00 052 Rice 099 099 Cereal 003 013 7.00 067 Wheat 099 004 067 5.00 • Normalized database avoids  Redundant data  Modification anomalies • How to get the original table? (join them) • No redundancy in OLTP, controlled redundancy in OLAP 20 •10OLTP vs. OLAP OLTP OLAP Target operational needs business analysis Data small, operational data large, historical data Model normalized denormalized/ multidimensional Query language SQL not unified – but MDX is used by many Queries small large Updates frequent and small infrequent and batch Transactional recovery necessary not necessary Optimized for update operations query operations 21 OLAP Data Cube Store Product Time Sales • Data cube Aalborg Bread 2000 57  Useful data analysis tool in DW Aalborg Milk 2000 56  Generalized GROUP BY queries Copenhagen Bread 2000 123  Aggregate facts based on chosen … … … … dimensions  Product, store, time dimensions  Sales measure of sale facts • Why data cube?  Good for visualization (i.e., text results hard to understand)  Multidimensional, intuitive  Support interactive OLAP operations • How is it different from a spreadsheet? 22 •11On-Line Analytical Processing (OLAP) 102 250 • On-Line Analytical Processing  Interactive analysis All Time  Explorative discovery  Fast response times required • OLAP operations/queries  Aggregation, e.g., SUM 9 10 11 15  Starting level, (Year, City)  Roll Up: Less detail  Drill Down: More detail  Slice/Dice: Selection, Year=2000 23 Advanced Multidimensional Modeling • Changing dimensions  Some dimensions are not static. They change over time.  A store moves to a new location with more space  The name of a product changes  A customer moves from Aalborg Øst to Hasseris  How do we handle these changes? • Large-scale dimensional modeling  How do we coordinate the dimensions in different data cubes and data marts? Dimensions Time Customer Product Supplier Sales + + + Data Costs + + marts Bus architecture Profit + + + + 24 •12Extract, Transform, Load (ETL) • “Getting multidimensional data into the DW” • Problems 1. Data from different sources 2. Data with different formats 3. Handling of missing data and erroneous data 4. Query performance of DW • ETL  Extract (for problem 1)  Transformations / cleansing (for problems 2, 3)  Load (for problem 4) • The most time-consuming process in DW development  80% of development time spent on ETL 25 Performance Optimization Sales tid pid locid sales 1 1 1 10 2 1 1 20 • The data warehouse contains GBytes or 3 2 3 40 even TBytes of data … … … … 1 billion rows • OLAP users require fast query response time  They don’t want to wait for the result for 1 hour  Acceptable: answer within 10 seconds • Idea: precompute some partial result in advance and store it  At query time, such partial result can be utilized to derive the final result very fast 26 •13Materialization Example • Imagine 1 billion sales rows, 1000 products, 100 locations Sales • CREATE VIEW TotalSales (pid, locid, total) AS SELECT, s.locid, SUM(s.sales) tid pid locid sales FROM Sales s 1 1 1 10 GROUP BY, s.locid 2 1 1 20 3 2 3 40 • The materialized view has 100,000 rows … … … … 1 billion rows • Wish to answer the query:  SELECT p.category, SUM(s.sales) FROM Products p, Sales s WHERE VIEW TotalSales GROUP BY p.category pid locid sales • Rewrite the query to use the view: 1 1 30  SELECT p.category, SUM( 2 3 40 FROM Products p, TotalSales t … … … WHERE GROUP BY p.category 100,000 rows  Query becomes 10,000 times faster 27 Data Warehouse Architecture • Central • Federated • Tiered 28 •14Central DW Architecture Clients • All data in one, central DW • All client queries directly on the central DW • Pros  Simplicity  Easy to manage Central • Cons DW  Bad performance due to no redundancy/workload distribution Source Source 29 Federated DW Architecture Clients • Data stored in separate data marts, aimed at special departments • Logical DW (i.e., virtual) Finance Distr. Mrktng • Data marts contain detail data mart mart mart • Pros  Performance due to distribution • Cons Logical  More complex DW Source Source 30 •15Tiered Architecture • Central DW is materialized • Data is distributed to data marts in one or more tiers • Only aggregated data in cube tiers • Data is aggregated/reduced as it moves through tiers • Pros Central  Best performance due to DW redundancy and distribution • Cons  Most complex  Hard to manage 31 Common DW Issues • Metadata management  Need to understand data = metadata needed  Greater need in OLAP than in OLTP as “raw” data is used  Need to know about:  Data definitions, dataflow, transformations, versions, usage, security • DW project management  DW projects are large and different from ordinary SW projects  12-36 months and US 1+ million per project  Data marts are smaller and “safer” (bottom up approach)  Reasons for failure  Lack of proper design methodologies  High HW+SW cost  Deployment problems (lack of training)  Organizational change is hard… (new processes, data ownership,..)  Ethical issues (security, privacy,…) 32 •16Topics not Covered in the Course • Privacy/security of data during ETL  Encryption may not work  During extraction/transformation, someone may need to know original values in order to check whether ETL performs correctly • Data Visualization (VIS) • Decision Analysis (What-if) • Customer Relationship Management (CRM) 33 Summary • Why Business Intelligence? • Data analysis problems • Data Warehouse (DW) introduction • DW Topics  Multidimensional modeling  ETL  Performance optimization • BI provide many advantages to your organization  A good DW is a prerequisite for BI 34 •17Multidimensional Databases Overview • Cubes: Dimensions, Facts, Measures • OLAP Queries • Relational Implementation • Redundancy 2 1ER Model vs. Multidimensional Model • Why don’t we use the ER model in data warehousing? • ER model: a data model for general purposes  All types of data are “equal”, difficult to identify the data that is important for business analysis  No difference between:  What is important  What just describes the important  Normalized databases spread information  When analyzing data, the information must be integrated again  Hard to overview a large ER diagram (e.g., over 100 entities/relations for an enterprise) 3 ER Model vs. Multidimensional Model • The multidimensional model  Its only purpose: data analysis  It is not suitable for OLTP systems  More built in “meaning”  What is important  What describes the important  What we want to optimize  Easy for query operations • Recognized by OLAP/BI tools  Tools offer powerful query facilities based on MD design  Example: TARGIT Analysis 4 2The multidimensional model • Data is divided into:  Facts  Dimensions • Facts are the important entity: a sale • Facts have measures that can be aggregated: sales price • Dimensions describe facts  A sale has the dimensions Product, Store and Time • Facts “live” in a multidimensional cube (dice)  Think of an array from programming languages • Goal for dimensional modeling:  Surround facts with as much context (dimensions) as possible  Hint: redundancy may be ok (in well-chosen places)  But you should not try to model all relationships in the data (unlike E/R and OO modeling) 5 Cube Example Milk 56 67 Bread Aalborg 57 45 211 Copenhagen 123 127 2000 2001 6 3

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.