Question? Leave a message!

Data Warehousing and Data Mining

Data Warehousing and Data Mining
Dr.JakeFinlay Profile Pic
Published Date:22-07-2017
Website URL
Data Warehousing and Data Mining A.A. 04-05 Datawarehousing & Datamining 1Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining • Association rules • Sequential patterns • Classification • Clustering A.A. 04-05 Datawarehousing & Datamining 2Introduction and Terminology Evolution of database technology File processing (60s) Relational DBMS (70s) Advanced data models e.g., Web-Based Repositories (90s) Object-Oriented, application-oriented (80s) Data Warehousing and Data Mining (90s) Global/Integrated Information Systems (2000s) A.A. 04-05 Datawarehousing & Datamining 3Introduction and Terminology Major types of information systems within an organization Few EXECUTIVE sophisticated SUPPORT users SYSTEMS MANAGEMENT Decision Support Systems (DSS) LEVEL Management Information Sys. (MIS) SYSTEMS KNOWLEDGE Office automation, Groupware, Content LEVEL Distribution systems, Workflow mangement systems SYSTEMS TRANSACTION Enterprise Resource Planning (ERP) Many un- PROCESSING sophisticated Customer Relationship Management (CRM) SYSTEMS users A.A. 04-05 Datawarehousing & Datamining 4Introduction and Terminology Transaction processing systems:  Support the operational level of the organization, possibly integrating needs of different functional areas (ERP);  Perform and record the daily transactions necessary to the conduct of the business  Execute simple read/update operations on traditional databases, aiming at maximizing transaction throughput  Their activity is described as: OLTP (On-Line Transaction Processing) A.A. 04-05 Datawarehousing & Datamining 5Introduction and Terminology Knowledge level systems: provide digital support for managing documents (office automation), user cooperation and communication (groupware), storing and retrieving information (content distribution), automation of business procedures (workflow management) Management level systems: support planning, controlling and semi-structured decision making at management level by providing reports and analyses of current and historical data Executive support systems: support unstructured decision making at the strategic level of the organization A.A. 04-05 Datawarehousing & Datamining 6Introduction and Terminology Multi-Tier architecture for Management level and Executive support systems Decision Making PRESENTATION Data Presentation Reporting/Visualization engines BUSINESS Data Analysis LOGIC OLAP, Data Mining engines Data Warehouses / Data Marts DATA Data Sources Transactional DB, ERP, CRM, Legacy systems A.A. 04-05 Datawarehousing & Datamining 7Introduction and Terminology OLAP (On-Line Analytical Processing):  Reporting based on (multidimensional) data analysis  Read-only access on repositories of moderate-large size (typically, data warehouses), aiming at maximizing response time Data Mining:  Discovery of novel, implicit patterns from, possibly heterogeneous, data sources  Use a mix of sophisticated statistical and high- performance computing techniques A.A. 04-05 Datawarehousing & Datamining 8Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining • Association rules • Sequential patterns • Classification • Clustering A.A. 04-05 Datawarehousing & Datamining 9Data Warehousing DATA WAREHOUSE Database with the following distinctive characteristics:  Separate from operational databases  Subject oriented: provides a simple, concise view on one or more selected areas, in support of the decision process  Constructed by integrating multiple, heterogeneous data sources  Contains historical data: spans a much longer time horizon than operational databases  (Mostly) Read-Only access: periodic, infrequent updates A.A. 04-05 Datawarehousing & Datamining 10Data Warehousing Types of Data Warehouses  Enterprise Warehouse: covers all areas of interest for an organization  Data Mart: covers a subset of corporate-wide data that is of interest for a specific user group (e.g., marketing).  Virtual Warehouse: offers a set of views constructed on demand on operational databases. Some of the views could be materialized (precomputed) A.A. 04-05 Datawarehousing & Datamining 11Data Warehousing Multi-Tier Architecture Analysis Cleaning Data Warehouse Reporting Server extraction Data Mining DB DB OLAP Front-End Data sources Data Storage engine Tools A.A. 04-05 Datawarehousing & Datamining 12Data Warehousing Multidimensional (logical) Model Data are organized around one or more FACT TABLEs. Each Fact Table collects a set of omogeneous events (facts) characterized by dimensions and dependent attributes Example: Sales at a chain of stores Product Supplier Store Period Units Sales P1 S1 St1 1qtr 30 1500€ P2 S1 St3 2qtr 100 9000€ dimensions dependent attributes A.A. 04-05 Datawarehousing & Datamining 13Data Warehousing Multidimensional (logical) Model (cont’d) Each dimension can in turn consist of a number of attributes. In this case the value in the fact table is a foreign key referring to an appropriate dimension table Store product Product Code Code Supplier dimension Name Description table Store Address dimension Period table supplier Manager Units Code Sales Name dimension STAR table Fact table SCHEMA Address A.A. 04-05 Datawarehousing & Datamining 14Data Warehousing Conceptual Star Schema (E-R Model) PERIOD (0,N) (1,1) (1,1) (1,1) (0,N) (0,N) PRODUCT STORE FACTs (1,1) (0,N) SUPPLIER A.A. 04-05 Datawarehousing & Datamining 15Data Warehousing OLAP Server Architectures They are classified based on the underlying storage layouts ROLAP (Relational OLAP): uses relational DBMS to store and manage warehouse data (i.e., table-oriented organization), and specific middleware to support OLAP queries. MOLAP (Multidimensional OLAP): uses array-based data structures and pre-computed aggregated data. It shows higher performance than OLAP but may not scale well if not properly implemented HOLAP (Hybird OLAP): ROLAP approach for low-level raw data, MOLAP approach for higher-level data (aggregations). A.A. 04-05 Datawarehousing & Datamining 16Data Warehousing MOLAP Approach Total sales of PCs in Period europe in the 4th 1Qtr 2Qtr 3Qtr 4Qtr quarter of the year TV Europe PC DVD CD North America Middle East Far East DATA CUBE representing a Fact Table A.A. 04-05 Datawarehousing & Datamining 17 Product RegionData Warehousing OLAP Operations: SLICE Fix values for one or more dimensions E.g. Product = CD A.A. 04-05 Datawarehousing & Datamining 18Data Warehousing OLAP Operations: DICE Fix ranges for one or more dimensions E.g. Product = CD or DVD Period = 1Qrt or 2Qrt Region = Europe or North America A.A. 04-05 Datawarehousing & Datamining 19Data Warehousing OLAP Operations: Roll-Up Aggregate data by grouping along one (or more) Period year dimensions TV Europe PC E.g.: group quarters DVD CD North America Middle East Far East Region -1 Drill-Down = (Roll-Up) A.A. 04-05 Datawarehousing & Datamining 20 ro c P du t