Question? Leave a message!

Data Warehousing and Data Mining

Data Warehousing and Data Mining 45
dr.naveensingla Profile Pic
Published Date:18-07-2017
Website URL
Data Warehousing and Data Mining – Introduction – General introduction to DWDM Business intelligence OLTP vs. OLAP Data integration Methodological framework DW definition Acknowledgements: I am indebted to Michael Böhlen and Stefano Rizzi for providing me their slides, upon which these lecture notes are based. J. Gamper, Free University of Bolzano, DWDM 2012/13The Big Picture of DWDM/1 • What's important for researchers: DB Algorithms and Theory Systems J. Gamper, Free University of Bolzano, DWDM 2012/13 2The Big Picture of DWDM/2 • What's important for real world applications: Systems Database (and System Integration) Customer Algorithms J. Gamper, Free University of Bolzano, DWDM 2012/13 3The Big Picture of DWDM/3 • What's important for businesses: Algorithms  Database Customer Systems J. Gamper, Free University of Bolzano, DWDM 2012/13 4Computer Science and Decision Making • An exponential increase in operational data has made computers the only tools suitable for providing data for decision-making performed by business managers. • The massive use of techniques for analyzing enterprise data made information systems a key factor to achieve business goals. J. Gamper, Free University of Bolzano, DWDM 2012/13 5Remarks about the DW part • We learn how to design, build, and use a data warehouse. • Relevance to the real world is an important guideline. • Not only/mainly crisp algorithms, theorems, etc. • We will look at a number of concrete and important case studies. • A good way to prepare and learn the subject is to participate to lectures. J. Gamper, Free University of Bolzano, DWDM 2012/13 6Content of the DW Part 1) Data warehousing: business intelligence, data integration, data warehouse, facts, dimensions, DW design 2) SQL OLAP extensions: analytical functions, crosstab, group by extensions, hierarchical cube, moving windows 3) Generalized multi-dimensional join: GMDJ, evaluation, subqueries, optimization rules, distributed evaluation 4) DW performance: pre-aggregation, lattice framework, view selection, view maintenance, bitmap indexing 5) ETL and advanced modeling: ETL process, handling changes in dimensions J. Gamper, Free University of Bolzano, DWDM 2012/13 7What is Business Intelligence?/1 • BI is a set of processes, tools, and technologies to transform business data into timely and accurate information to support decisional processes  Data Warehousing (DW)  On-Line Analytical Processing (OLAP)  Data Mining (DM) and Data Visualization (VIS)  Decision Analysis (what-if)  Customer Relationship Management (CRM) • BI systems are used by decision makers to get a comprehensive knowledge of the business and to define and support their business strategies. • The goal is to enable data-based decisions aimed at gaining competitive advantage, improving operative performance, responding more quickly to changes, increasing profitability, and, in general, creating added value for the company. J. Gamper, Free University of Bolzano, DWDM 2012/13 8What is Business Intelligence?/2 • BI is the “opposite” of Artificial Intelligence (AI)  AI systems make decisions for the users  BI systems help users make the right decisions, based on the available data  Many BI techniques have roots in AI, though. J. Gamper, Free University of Bolzano, DWDM 2012/13 9The BI Pyramid decisions WHAT-IF ANALYSIS simulation models DATA MINING knowledge learning models INFORMATION EXPLORATION statistical techniques information OLAP ANALYSIS data warehouse OPERATIONAL APPLICATIONS data data sources J. Gamper, Free University of Bolzano, DWDM 2012/13 10Example BI Queries • Q1: On October 11, 2000, find the 5 top-selling products for each product subcategory that contributes more than 20% of the sales within its product category. • Q2: As of March 15, 1995, determine shipping priority and potential gross revenue of the orders that have the 10 largest gross revenues among the orders that had not yet been shipped. Consider orders from the book market segment only. • Regular database models and systems are not suitable for this type of queries. J. Gamper, Free University of Bolzano, DWDM 2012/13 11BI is Crucial and Growing/1 • Meta Group: DW alone = 15 Bio. in 2000 • Palo Alto Management Group: BI = 113 Bio. in 2002 • The Web made BI more necessary:  Customers do not appear ”physically” in the store  Customers can change to other stores more easily • Thus:  You have to know your customers using data and BI.  Web logs make is possible to analyze customer behavior in more detailed than before (what was not bought?)  Combine web data with traditional customer data • Wireless Internet adds further to this:  Customers are always ”online”  Customer’s position is known  Combine position and knowledge about customer = very valuable J. Gamper, Free University of Bolzano, DWDM 2012/13 12BI is Crucial and Growing/2 • Gartner, 2009:  Organizations will expect IT leaders in charge of BI and performance management initiatives to help transform and significantly improve their business  Because of lack of information, processes, and tools, through 2012, more than 35% of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets.  By 2010, 20% of organizations will have an industry- specific analytic application delivered via software as a standard service of their business intelligence portfolio.  In 2009, collaborative decision making will emerge as a new product category that combines social software with business intelligence platform capabilities. • S. Chaudhuri, U. Dayal, V. Narasayya, CACM 2011:  Today, it is difficult to find a successful enterprise that has not leveraged BI technology for their business. J. Gamper, Free University of Bolzano, DWDM 2012/13 13BI: Key Problems 1) Complex and unusable models  Many DB models are difficult to understand  DB models do not focus on a single clear business purpose 2) Same data found in many different systems  Example: customer data in many different systems  The same concept is defined differently 3) Data is suited for operational systems  Accounting, billing, etc.  Do not support analysis across business functions 4) Data quality is bad  Missing data, imprecise data, different use of systems 5) Data are ”volatile”  Data deleted in operational systems (6 months)  Data change over time – no historical information J. Gamper, Free University of Bolzano, DWDM 2012/13 14BI: Solution • A new analysis environment with a data warehouse at the core, where data is  Integrated (logically and physically)  Subject oriented (versus function oriented)  Supporting management decisions (different organization)  Stable (data is not deleted, several versions)  Time variant (data can always be related to time) J. Gamper, Free University of Bolzano, DWDM 2012/13 15Definition of a Data Warehouse/1 • Barry Devlin, IBM Consultant A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. J. Gamper, Free University of Bolzano, DWDM 2012/13 16Definition of a Data Warehouse/2 • W. H. Inmon, Building the Data Warehouse A data warehouse is a - subject-oriented, - integrated, - time-varying, - non-volatile collection of data that is used primarily in - organizational decision making. J. Gamper, Free University of Bolzano, DWDM 2012/13 17DW Architecture/1 • Basic elements of a Data Warehouse environment J. Gamper, Free University of Bolzano, DWDM 2012/13 18DW Architecture/2 EXTRACTION, TRANSFORMATION, AND Source layer LOADING: Operational data ETL processes extract data External data Data staging from sources, transform and ETL tools clean them, and finally load them in the ODS and in the Operational data warehouse Reconciled layer Data Store OPERATIONAL DATA Loading STORE: Operational data obtained after integrating and cleansing source data. As a Data Warehouse result, those data are Data warehouse integrated, consistent, Data marts layer appropriate, current, and detailed DATA MART: A subset or an aggregation of Analysis the data stored to a primary data warehouse. It includes a set of information pieces Reporting What-if analysis tools relevant to a specific tools Data mining OLAP business area, corporate tools tools department, or category of users J. Gamper, Free University of Bolzano, DWDM 2012/13 19Data Integration Problem: • Different interfaces • Different data representations • Duplicated information • Inconsistent information Integrated System Goal: • Collect and combine information • Provide an integrated view • Provide a uniform user interface • Support sharing of data J. Gamper, Free University of Bolzano, DWDM 2012/13 20