Question? Leave a message!




Data Warehousing and Data Mining

Data Warehousing and Data Mining 45
Data Warehousing and Data Mining – Introduction – General introduction to DWDM Business intelligence OLTP vs. OLAP Data integration Methodological framework DW definition Acknowledgements: I am indebted to Michael Böhlen and Stefano Rizzi for providing me their slides, upon which these lecture notes are based. J. Gamper, Free University of Bolzano, DWDM 2012/13The Big Picture of DWDM/1 • What's important for researchers: DB Algorithms and Theory Systems J. Gamper, Free University of Bolzano, DWDM 2012/13 2The Big Picture of DWDM/2 • What's important for real world applications: Systems Database (and System Integration) Customer Algorithms J. Gamper, Free University of Bolzano, DWDM 2012/13 3The Big Picture of DWDM/3 • What's important for businesses: Algorithms  Database Customer Systems J. Gamper, Free University of Bolzano, DWDM 2012/13 4Computer Science and Decision Making • An exponential increase in operational data has made computers the only tools suitable for providing data for decisionmaking performed by business managers. • The massive use of techniques for analyzing enterprise data made information systems a key factor to achieve business goals. J. Gamper, Free University of Bolzano, DWDM 2012/13 5Remarks about the DW part • We learn how to design, build, and use a data warehouse. • Relevance to the real world is an important guideline. • Not only/mainly crisp algorithms, theorems, etc. • We will look at a number of concrete and important case studies. • A good way to prepare and learn the subject is to participate to lectures. J. Gamper, Free University of Bolzano, DWDM 2012/13 6Content of the DW Part 1) Data warehousing: business intelligence, data integration, data warehouse, facts, dimensions, DW design 2) SQL OLAP extensions: analytical functions, crosstab, group by extensions, hierarchical cube, moving windows 3) Generalized multidimensional join: GMDJ, evaluation, subqueries, optimization rules, distributed evaluation 4) DW performance: preaggregation, lattice framework, view selection, view maintenance, bitmap indexing 5) ETL and advanced modeling: ETL process, handling changes in dimensions J. Gamper, Free University of Bolzano, DWDM 2012/13 7What is Business Intelligence/1 • BI is a set of processes, tools, and technologies to transform business data into timely and accurate information to support decisional processes  Data Warehousing (DW)  OnLine Analytical Processing (OLAP)  Data Mining (DM) and Data Visualization (VIS)  Decision Analysis (whatif)  Customer Relationship Management (CRM) • BI systems are used by decision makers to get a comprehensive knowledge of the business and to define and support their business strategies. • The goal is to enable databased decisions aimed at gaining competitive advantage, improving operative performance, responding more quickly to changes, increasing profitability, and, in general, creating added value for the company. J. Gamper, Free University of Bolzano, DWDM 2012/13 8What is Business Intelligence/2 • BI is the “opposite” of Artificial Intelligence (AI)  AI systems make decisions for the users  BI systems help users make the right decisions, based on the available data  Many BI techniques have roots in AI, though. J. Gamper, Free University of Bolzano, DWDM 2012/13 9The BI Pyramid decisions WHATIF ANALYSIS simulation models DATA MINING knowledge learning models INFORMATION EXPLORATION statistical techniques information OLAP ANALYSIS data warehouse OPERATIONAL APPLICATIONS data data sources J. Gamper, Free University of Bolzano, DWDM 2012/13 10Example BI Queries • Q1: On October 11, 2000, find the 5 topselling products for each product subcategory that contributes more than 20 of the sales within its product category. • Q2: As of March 15, 1995, determine shipping priority and potential gross revenue of the orders that have the 10 largest gross revenues among the orders that had not yet been shipped. Consider orders from the book market segment only. • Regular database models and systems are not suitable for this type of queries. J. Gamper, Free University of Bolzano, DWDM 2012/13 11BI is Crucial and Growing/1 • Meta Group: DW alone = 15 Bio. in 2000 • Palo Alto Management Group: BI = 113 Bio. in 2002 • The Web made BI more necessary:  Customers do not appear ”physically” in the store  Customers can change to other stores more easily • Thus:  You have to know your customers using data and BI.  Web logs make is possible to analyze customer behavior in more detailed than before (what was not bought)  Combine web data with traditional customer data • Wireless Internet adds further to this:  Customers are always ”online”  Customer’s position is known  Combine position and knowledge about customer = very valuable J. Gamper, Free University of Bolzano, DWDM 2012/13 12BI is Crucial and Growing/2 • Gartner, 2009:  Organizations will expect IT leaders in charge of BI and performance management initiatives to help transform and significantly improve their business  Because of lack of information, processes, and tools, through 2012, more than 35 of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets.  By 2010, 20 of organizations will have an industry specific analytic application delivered via software as a standard service of their business intelligence portfolio.  In 2009, collaborative decision making will emerge as a new product category that combines social software with business intelligence platform capabilities. • S. Chaudhuri, U. Dayal, V. Narasayya, CACM 2011:  Today, it is difficult to find a successful enterprise that has not leveraged BI technology for their business. J. Gamper, Free University of Bolzano, DWDM 2012/13 13BI: Key Problems 1) Complex and unusable models  Many DB models are difficult to understand  DB models do not focus on a single clear business purpose 2) Same data found in many different systems  Example: customer data in many different systems  The same concept is defined differently 3) Data is suited for operational systems  Accounting, billing, etc.  Do not support analysis across business functions 4) Data quality is bad  Missing data, imprecise data, different use of systems 5) Data are ”volatile”  Data deleted in operational systems (6 months)  Data change over time – no historical information J. Gamper, Free University of Bolzano, DWDM 2012/13 14BI: Solution • A new analysis environment with a data warehouse at the core, where data is  Integrated (logically and physically)  Subject oriented (versus function oriented)  Supporting management decisions (different organization)  Stable (data is not deleted, several versions)  Time variant (data can always be related to time) J. Gamper, Free University of Bolzano, DWDM 2012/13 15Definition of a Data Warehouse/1 • Barry Devlin, IBM Consultant A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. J. Gamper, Free University of Bolzano, DWDM 2012/13 16Definition of a Data Warehouse/2 • W. H. Inmon, Building the Data Warehouse A data warehouse is a subjectoriented, integrated, timevarying, nonvolatile collection of data that is used primarily in organizational decision making. J. Gamper, Free University of Bolzano, DWDM 2012/13 17DW Architecture/1 • Basic elements of a Data Warehouse environment J. Gamper, Free University of Bolzano, DWDM 2012/13 18DW Architecture/2 EXTRACTION, TRANSFORMATION, AND Source layer LOADING: Operational data ETL processes extract data External data Data staging from sources, transform and ETL tools clean them, and finally load them in the ODS and in the Operational data warehouse Reconciled layer Data Store OPERATIONAL DATA Loading STORE: Operational data obtained after integrating and cleansing source data. As a Data Warehouse result, those data are Data warehouse integrated, consistent, Data marts layer appropriate, current, and detailed DATA MART: A subset or an aggregation of Analysis the data stored to a primary data warehouse. It includes a set of information pieces Reporting Whatif analysis tools relevant to a specific tools Data mining OLAP business area, corporate tools tools department, or category of users J. Gamper, Free University of Bolzano, DWDM 2012/13 19Data Integration Problem: • Different interfaces • Different data representations • Duplicated information • Inconsistent information Integrated System Goal: • Collect and combine information • Provide an integrated view • Provide a uniform user interface • Support sharing of data J. Gamper, Free University of Bolzano, DWDM 2012/13 20QueryDriven Data Integration • Data is integrated on demand (lazy) • PROS  Access to most uptodate data (all source data directly available)  No duplication of data • CONS  Delay in query processing  Slow (or currently unavailable) information sources  Complex filtering and integration  Inefficient and expensive for frequent queries  Competes with local processing at sources  Data loss at the sources (e.g., historical data) cannot be recovered • Has not caught on in industry J. Gamper, Free University of Bolzano, DWDM 2012/13 21WarehouseDriven Data Integration • Data is integrated in advance (eager) • Data is stored in DW for querying and analysis • PROS  High query performance  Does not interfere with local processing at sources  Assumes that data warehouse update is possible during downtime of local processing  Complex queries are run at the data warehouse  OLTP queries are run at the source systems • CONS  Duplication of data  The most current source data is not available • Has caught on in industry J. Gamper, Free University of Bolzano, DWDM 2012/13 22OLTP versus OLAP/1 • OnLine Transaction Processing (OLTP)  Many ”small” queries on a small number of tuples from many tables that need to be joined  Frequent updates  The system is always available for both updates and reads  Smaller data volume (few historical data)  Complex data model (normalized) • OnLine Analytical Processing (OLAP)  Fewer, but ”bigger” queries that typically need to scan a huge amount of records and doing some aggregation  Frequent reads, infrequent updates (daily, weekly)  2phase operation: either reading or updating  Larger data volumes (collection of historical data)  Simple data model (multidimensional/denormalized) J. Gamper, Free University of Bolzano, DWDM 2012/13 23OLTP versus OLAP/2 • A mix of analytical queries (OLAP) with transactional routine queries (OLTP) inevitably slows down the system, and this does not meet the needs of users of both types of queries. • Separate OLAP from OLTP by creating a new repository that integrates data from various sources and then makes data available for analysis and evaluation aimed at decisionmaking processes J. Gamper, Free University of Bolzano, DWDM 2012/13 24OLTP versus OLAP/3 Existing databases and systems (OLTP) New databases Appl. and systems (OLAP) OLAP DM DB Appl. Data DB mining DM Trans. Appl. DW DB ”Global” Data Warehouse Appl. Visua lization DM DB Data Marts Appl. DB J. Gamper, Free University of Bolzano, DWDM 2012/13 25OLTP versus OLAP/4 Functionoriented Systems (OLTP) Subjectoriented Appl. Systems (OLAP) Appl. DM DB Appl. Appl. DB DM Trans. Appl. DW DB All subjects, integrated Appl. Appl. DM DB Selected Appl. subjects DB J. Gamper, Free University of Bolzano, DWDM 2012/13 26OLTP Example: CS Dept/1 1, 2, 3, 4 pstatus persinfs pstatusid hostgrpnms persinfid fqdns hostgrps hostgrpnmid prsnstatus persinf fqdnid hostgrpid hostgrpnmn personidpinf fqdn fqdnidhgr hostgrpnmreason persinfcrtdato fqdncrtdato hstgrnmidhgr hstgrncrtdato persinfexpdato fqdnexpdato persons hstgrnexpdato personID name employees firstname users employeeid homeaddress uids pgecos userid personidemp homeemail uidid pgecoid nameidusr positionidemp, homephone ugididuid personidpge home initials personcrtdato idcatiduid useridpge disklimit emplstdato pstatusidps hostgrpiduid userstatidusr emplexpdato uidcrtdata uididusr uidexpdato hostgrpidusr usercrtdato personwrkgroups personsgroups uguests userexpdato personsgroupid personwrkgroupid uguestid personidprw personidprs personidugu wrkgroupidprw sgroupidprs ughost idcats ugids semesteridprs uguestcrtdato idcatid ugidid uguestexp names idcat ugid nameid userstats wrkgroups osname userstatid wrkgroupid namecrtdato userstat wrkgroup nameexpdato J. Gamper, Free University of Bolzano, DWDM 2012/13 27OLTP Example: CS Dept/2 semesters positioncats semesterid statusempls positions positioncatID semester statusemplid positionID positioncat employees statusempl position employeeid engposition personidemp emplocations positionidemp, sgroupsems persons emplocationid initials sgroupsemid personID employeeidelo emplstdato sgroupidsgs name roomidelo emplexpdato semesteridsgs firstname homeaddress homeemail sgroups homephone rooms sgroupid phones personcrtdato roomid sgroup phoneid pstatusidps roomname sgroupcrtdato employeeidpho roomalias sgroupexpdato phonenridpho roomcatidrom supervisor phonecats phonecatid phonenrs roomcats phonecat phonelocs sgrplocs phonenrid roomcatid phonelocid sgrplocid phonenr roomcat roomidphl roomidsgl phonecatidphn owners phonenridphl sgridsgl owneridphn ownerid owner J. Gamper, Free University of Bolzano, DWDM 2012/13 28OLTP Example: OncoNet • OncoNet is a system for the management of patients undergoing a cancer therapy  200 tables • Wellsuited for daily management of patients • But: statistical analysis are expensive  takes up to 12 hours  tables are locked for that time  run queries over weekend • A DW approach reduced the runtime of the same queries to a few seconds (BScthesis of A. Heinisch) J. Gamper, Free University of Bolzano, DWDM 2012/13 29Methodological/Design Framework • Building a DW is a very complex task, which requires an accurate planning aimed at devising satisfactory answers to organizational and architectural questions. • A large number of organizations lack experience and skills that are required to meet the challenges involved in DW projects. • The reports of DW project failures state that a major cause lies in the absence of a global view of the design process: in other terms, in the absence of a design methodology. • Methodologies are created by closely studying similar experiences and minimizing the risks for failure by basing new approaches on a constructive analysis of the mistakes made previously. J. Gamper, Free University of Bolzano, DWDM 2012/13 30Many Ways not to Do/1 Appl. App DM DB Trans. Appl. DB App DM Appl. DB Trans. Appl. App DM DB Trans. Appl. DB J. Gamper, Free University of Bolzano, DWDM 2012/13 31Many Ways not to Do/2 DAppl. Appl. DM DB Trans. Appl. DB Appl. DM Appl. DB Appl. DM Appl. Trans. DW DB Appl. DB Appl. DM J. Gamper, Free University of Bolzano, DWDM 2012/13 32Topdown Approach • Analyze global business needs, plan how to develop a data warehouse, design it, and implement it as a whole  This procedure is promising: it is based on a global picture of the goal to achieve, and in principle it ensures consistent, well integrated data warehouses.  Highcost estimates with longterm implementations discourage company managers from embarking on these kind of projects.  Analyzing and integrating all relevant sources at the same time is a very difficult task, even because it is not very likely that they are all available and stable at the same time.  It is extremely difficult to forecast the specific needs of every department involved in a project, which can result in the analysis process coming to a standstill.  Since no working system is going to be delivered in the short term, users cannot check for this project to be useful, so they lose trust and interest in it. J. Gamper, Free University of Bolzano, DWDM 2012/13 33Bottomup Approach • DWs are incrementally built and several data marts are iteratively created. Each data mart is based on a set of facts that are linked to a specific department and that can be interesting for a user group  Leads to concrete results in a short time  Does not require huge investments  Enables designers to investigate one area at a time  Gives managers a quick feedback about the actual benefits of the system being built  Keeps the interest for the project constantly high  May determine a partial vision of the business domain J. Gamper, Free University of Bolzano, DWDM 2012/13 34Topdown vs. Bottomup Approach Appl. Appl. DM DB Appl. Appl. DB DM Trans. Appl. DW DB ”Inbetween”: 1. Design of DW for Appl. Appl. DM1 DM 2. Design of DM2 and DB integration with DW Bottomup: Appl. 3. Design of DM3 and 1. Design of DMs DB integration with DW 2. Integration of Topdown: 4. ... DMs in DW 1. Design of DW 3. Maybe no 2. Design of DMs physical DW J. Gamper, Free University of Bolzano, DWDM 2012/13 35Topdown vs. Bottomup Approach Appl. Appl. DM DB Appl. Appl. DB DM Trans. Appl. DW DB ”Inbetween”: 1. Design of DW for Appl. Appl. DM1 DM 2. Design of DM2 and DB integration with DW Bottomup: Appl. 3. Design of DM3 and 1. Design of DMs DB integration with DW 2. Integration of Topdown: 4. ... DMs in DW 1. Design of DW 3. Maybe no 2. Design of DMs physical DW J. Gamper, Free University of Bolzano, DWDM 2012/13 36The Lifecycle/1 • set system goals, borders, and size • select an approach for design and implementation Goal setting and planning • estimate costs and benefits • analyze risks and expectations Infrastructure design • examine the skills of the working team Design and developm. of data marts J. Gamper, Free University of Bolzano, DWDM 2012/13 37The Lifecycle/2 • analyze and compare the Goal setting and planning possible architectural solutions • assess the available Infrastructure design technologies and tools • create a preliminary plan of Design and developm. the whole system of data marts J. Gamper, Free University of Bolzano, DWDM 2012/13 38The Lifecycle/3 Goal setting and planning Infrastructure design Every iteration causes a new Design and developm. data mart and new applications of data marts to be created and progressively added to the DW system J. Gamper, Free University of Bolzano, DWDM 2012/13 39Data mart design phases db administrator Source analysis and integration Requirement analysis Conceptual design designer Workload and data volume business user Logical design ETL design Physical design J. Gamper, Free University of Bolzano, DWDM 2012/13 40The First Data Mart  is the one playing the most strategic role for the enterprise  should be a DM4 DM2 backbone for the whole DW DM1  should lean on available and DM3 DM5 consistent data sources Source 3 Source 2 Source 1 J. Gamper, Free University of Bolzano, DWDM 2012/13 41Summary • BI is wellrecognized and is a combination of a number of techniques to support decision making. • DW is at the core of BI that  provides a complete, consistent, subjectoriented and timevarying collection of the data;  allows to separte OLTP from OLAP. • Applications that use the DW include OLAP, data mining, visualization • BI can provide many advantages to an organization  Creates added value by transforming data into information  Provides comprehensive knowledge about your business  A good DW is a prerequisite for BI  But, a DW is a means rather than a goal … it is only a success if it is heavily used • Following a clear design methodology is important. J. Gamper, Free University of Bolzano, DWDM 2012/13 42