Question? Leave a message!




Data Warehousing and Data Mining

Data Warehousing and Data Mining
Data Warehousing and Data Mining A.A. 0405 Datawarehousing Datamining 1Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining • Association rules • Sequential patterns • Classification • Clustering A.A. 0405 Datawarehousing Datamining 2Introduction and Terminology Evolution of database technology File processing (60s) Relational DBMS (70s) Advanced data models e.g., WebBased Repositories (90s) ObjectOriented, applicationoriented (80s) Data Warehousing and Data Mining (90s) Global/Integrated Information Systems (2000s) A.A. 0405 Datawarehousing Datamining 3Introduction and Terminology Major types of information systems within an organization Few EXECUTIVE sophisticated SUPPORT users SYSTEMS MANAGEMENT Decision Support Systems (DSS) LEVEL Management Information Sys. (MIS) SYSTEMS KNOWLEDGE Office automation, Groupware, Content LEVEL Distribution systems, Workflow mangement systems SYSTEMS TRANSACTION Enterprise Resource Planning (ERP) Many un PROCESSING sophisticated Customer Relationship Management (CRM) SYSTEMS users A.A. 0405 Datawarehousing Datamining 4Introduction and Terminology Transaction processing systems:  Support the operational level of the organization, possibly integrating needs of different functional areas (ERP);  Perform and record the daily transactions necessary to the conduct of the business  Execute simple read/update operations on traditional databases, aiming at maximizing transaction throughput  Their activity is described as: OLTP (OnLine Transaction Processing) A.A. 0405 Datawarehousing Datamining 5Introduction and Terminology Knowledge level systems: provide digital support for managing documents (office automation), user cooperation and communication (groupware), storing and retrieving information (content distribution), automation of business procedures (workflow management) Management level systems: support planning, controlling and semistructured decision making at management level by providing reports and analyses of current and historical data Executive support systems: support unstructured decision making at the strategic level of the organization A.A. 0405 Datawarehousing Datamining 6Introduction and Terminology MultiTier architecture for Management level and Executive support systems Decision Making PRESENTATION Data Presentation Reporting/Visualization engines BUSINESS Data Analysis LOGIC OLAP, Data Mining engines Data Warehouses / Data Marts DATA Data Sources Transactional DB, ERP, CRM, Legacy systems A.A. 0405 Datawarehousing Datamining 7Introduction and Terminology OLAP (OnLine Analytical Processing):  Reporting based on (multidimensional) data analysis  Readonly access on repositories of moderatelarge size (typically, data warehouses), aiming at maximizing response time Data Mining:  Discovery of novel, implicit patterns from, possibly heterogeneous, data sources  Use a mix of sophisticated statistical and high performance computing techniques A.A. 0405 Datawarehousing Datamining 8Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining • Association rules • Sequential patterns • Classification • Clustering A.A. 0405 Datawarehousing Datamining 9Data Warehousing DATA WAREHOUSE Database with the following distinctive characteristics:  Separate from operational databases  Subject oriented: provides a simple, concise view on one or more selected areas, in support of the decision process  Constructed by integrating multiple, heterogeneous data sources  Contains historical data: spans a much longer time horizon than operational databases  (Mostly) ReadOnly access: periodic, infrequent updates A.A. 0405 Datawarehousing Datamining 10Data Warehousing Types of Data Warehouses  Enterprise Warehouse: covers all areas of interest for an organization  Data Mart: covers a subset of corporatewide data that is of interest for a specific user group (e.g., marketing).  Virtual Warehouse: offers a set of views constructed on demand on operational databases. Some of the views could be materialized (precomputed) A.A. 0405 Datawarehousing Datamining 11Data Warehousing MultiTier Architecture Analysis Cleaning Data Warehouse Reporting Server extraction Data Mining DB DB OLAP FrontEnd Data sources Data Storage engine Tools A.A. 0405 Datawarehousing Datamining 12Data Warehousing Multidimensional (logical) Model Data are organized around one or more FACT TABLEs. Each Fact Table collects a set of omogeneous events (facts) characterized by dimensions and dependent attributes Example: Sales at a chain of stores Product Supplier Store Period Units Sales P1 S1 St1 1qtr 30 1500€ P2 S1 St3 2qtr 100 9000€ dimensions dependent attributes A.A. 0405 Datawarehousing Datamining 13Data Warehousing Multidimensional (logical) Model (cont’d) Each dimension can in turn consist of a number of attributes. In this case the value in the fact table is a foreign key referring to an appropriate dimension table Store product Product Code Code Supplier dimension Name Description table Store Address dimension Period table supplier Manager Units Code Sales Name dimension STAR table Fact table SCHEMA Address A.A. 0405 Datawarehousing Datamining 14Data Warehousing Conceptual Star Schema (ER Model) PERIOD (0,N) (1,1) (1,1) (1,1) (0,N) (0,N) PRODUCT STORE FACTs (1,1) (0,N) SUPPLIER A.A. 0405 Datawarehousing Datamining 15Data Warehousing OLAP Server Architectures They are classified based on the underlying storage layouts ROLAP (Relational OLAP): uses relational DBMS to store and manage warehouse data (i.e., tableoriented organization), and specific middleware to support OLAP queries. MOLAP (Multidimensional OLAP): uses arraybased data structures and precomputed aggregated data. It shows higher performance than OLAP but may not scale well if not properly implemented HOLAP (Hybird OLAP): ROLAP approach for lowlevel raw data, MOLAP approach for higherlevel data (aggregations). A.A. 0405 Datawarehousing Datamining 16Data Warehousing MOLAP Approach Total sales of PCs in Period europe in the 4th 1Qtr 2Qtr 3Qtr 4Qtr quarter of the year TV Europe PC DVD CD North America Middle East Far East DATA CUBE representing a Fact Table A.A. 0405 Datawarehousing Datamining 17 Product RegionData Warehousing OLAP Operations: SLICE Fix values for one or more dimensions E.g. Product = CD A.A. 0405 Datawarehousing Datamining 18Data Warehousing OLAP Operations: DICE Fix ranges for one or more dimensions E.g. Product = CD or DVD Period = 1Qrt or 2Qrt Region = Europe or North America A.A. 0405 Datawarehousing Datamining 19Data Warehousing OLAP Operations: RollUp Aggregate data by grouping along one (or more) Period year dimensions TV Europe PC E.g.: group quarters DVD CD North America Middle East Far East Region 1 DrillDown = (RollUp) A.A. 0405 Datawarehousing Datamining 20 ro c P du tData Warehousing Cube Operator: summaries for each subset of dimensions SUM 1Qtr 2Qtr 3Qtr 4Qtr TV PC Europe DVD CD North America SUM Middle East Far East SUM Yearly sales of PCs in the middle east Yearly sales of electronics in the middle east A.A. 0405 Datawarehousing Datamining 21Data Warehousing Cube Operator: it is equivalent to computing the following lattice of cuboids  product period region product, period product, region period, region product, period, region FACT TABLE A.A. 0405 Datawarehousing Datamining 22Data Warehousing Cube Operator In SQL: SELECT Product, Period, Region, SUM(Total Sales) FROM FACTTABLE GROUP BY Product, Period, Region WITH CUBE A.A. 0405 Datawarehousing Datamining 23Data Warehousing Product Period Region Tot. Sales CD 1qtr Europe All combinations of Product, Period and Region … … … CD 1qtr ALL All combinations of Product … … ALL and Period ALL 1qtr Europe ALL … … CD ALL Europe … ALL … CD ALL ALL All combinations of Product … … … ALL 1qtr ALL … … … ALL ALL Europe ALL ALL … ALL ALL ALL A.A. 0405 Datawarehousing Datamining 24Data Warehousing Rollup (= partial cube) Operator In SQL: SELECT Product, Period, Region, SUM(Total Sales) FROM FACTTABLE GROUP BY Product, Period, Region WITH ROLLUP A.A. 0405 Datawarehousing Datamining 25Data Warehousing Product Period Region Tot. Sales CD 1qtr Europe All combinations of Product, Period and Region … … … CD 1qtr ALL All combinations of Product … … ALL and Period CD ALL ALL All combinations of Product … ALL ALL ALL ALL ALL Reduces the complexity from exponential to linear in the number of dimensions A.A. 0405 Datawarehousing Datamining 26Data Warehousing it is equivalent to computing the following subset of the lattice of cuboids  product period region product, period product, region period, region product, period, region A.A. 0405 Datawarehousing Datamining 27Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining • Association rules • Sequential patterns • Classification • Clustering A.A. 0405 Datawarehousing Datamining 28Data Mining Data Explosion: tremendous amount of data accumulated in digital repositories around the world (e.g., databases, data warehouses, web, etc.) Production of digital data /Year: 18  35 Exabytes (10 bytes) in 2002  30 increase per year (9902) See: www.sims.berkeley.edu/howmuchinfo We are drowning in data, but starving for knowledge A.A. 0405 Datawarehousing Datamining 29Data Mining KNOWLEDGE Knowledge Discovery in Databases Data Mining Evaluation presentation Selection transformation Training data Data cleaning integration Domain Data Warehouse knowledge DB DB Information repositories A.A. 0405 Datawarehousing Datamining 30Data Mining Typologies of input data:  Unaggregated data (e.g., records, transactions)  Aggregated data (e.g., summaries)  Spatial, geographic data  Data from timeseries databases  Text, video, audio, web data A.A. 0405 Datawarehousing Datamining 31Data Mining DATA MINING Process of discovering interesting patterns or knowledge from a (typically) large amount of data stored either in databases, data warehouses, or other information repositories Interesting: nontrivial, implicit, previously unknown, potentially useful Alternative names: knowledge discovery/extraction, information harvesting, business intelligence In fact, data mining is a step of the more general process of knowledge discovery in databases (KDD) A.A. 0405 Datawarehousing Datamining 32Data Mining Interestingness measures Purpose: filter irrelevant patterns to convey concise and useful knowledge. Certain data mining tasks can produce thousands or millions of patterns most of which are redundant, trivial, irrelevant. Objective measures: based on statistics and structure of patterns (e.g., frequency counts) Subjective measures: based on user’s belief about the data. Patterns may become interesting if they confirm or contradict a user’s hypothesis, depending on the context. Interestingness measures can employed both after and during the pattern discovery. In the latter case, they improve the search efficiency A.A. 0405 Datawarehousing Datamining 33Data Mining Multidisciplinarity of Data Mining Algorithms High Performance Statistics Computing Data Mining Machine learning Visualization DB Technology A.A. 0405 Datawarehousing Datamining 34Data Mining Data Mining Problems Association Rules: discovery of rules XY (“objects that satisfy condition X are also likely to satisfy condition Y”). The problem first found application in market basket or transaction data analysis, where “objects” are transactions and “conditions” are containment of certain itemsets A.A. 0405 Datawarehousing Datamining 35Association Rules Statement of the Problem I =Set of items D =Set of transactions: t D then t I For an itemset X I, support(X) = fraction or number of transactions containing X ASSOCIATION RULE: XY    Y, with Y X I support = support(X) confidence = support(X)/support(XY) PROBLEM: find all association rules with support than min sup and confidence than min confidence A.A. 0405 Datawarehousing Datamining 36Association Rules Market Basket Analysis: Items = products sold in a store or chain of stores Transactions = customers’ shopping baskets Rule XY   Y = customers who buy items in XY are likely to buy items in Y Of diapers and beer ….. Analysis of customers behaviour in a supermarket chain has revealed that males who on thursdays and saturdays buy diapers are likely to buy also beer …. That’s why these two items are found close to each other in most stores A.A. 0405 Datawarehousing Datamining 37Association Rules Applications:  Cross/Upselling (especially in ecomm., e.g., Amazon)  Crossselling: push complemetary products  Upselling: push similar products  Catalog design  Store layout (e.g., diapers and beer close to each other)  Financial forecast  Medical diagnosis A.A. 0405 Datawarehousing Datamining 38Association Rules Def.: Frequent itemset = itemset with support min sup General Strategy to discover all association rules: 1. Find all frequent itemsets 2.  frequent itemset X, output all rules XY   Y, with YX, which satisfy the min confindence constraint Observation: Min sup and min confidence are objective measures of interestingness. Their proper setting, however, requires user’s domain knowledge. Low values may yield exponentially (in I) many rules, high values may cut off interesting rules A.A. 0405 Datawarehousing Datamining 39Association Rules Example Transactionid Items bought Frequent Itemsets Support 10 A, B, C A 75 20 A, C B 50 30 A, D C 50 40 B, E, F A, C 50 Min. sup 50 Min. confidence 50 For rule A C : support = support(A,C) = 50 confidence = support(A,C)/support(A) = 66.6 For rule C A (same support as A C): confidence = support(A,C)/support(C) = 100 A.A. 0405 Datawarehousing Datamining 40Association Rules Dealing with Large Outputs Observation: depending on the values of min sup and on the dataset, the number frequent itemsets can be exponentially large but may contain a lot of redundancy Goal: determine a subset of frequent itemsets of considerably smaller size, which provides the same information content, i.e., from which the complete set of frequent itemsets can be derived without further information. A.A. 0405 Datawarehousing Datamining 41Association Rules Notion of Closedness I (items), D (transactions), min sup (support threshold) Closed Frequent Itemsets = XI : supp(X)min sup supp(Y)supp(X) IY X  It’s a subset of all frequent itemsets  For every frequent itemset X there exists a closed frequent itemset YX such that supp(Y)=supp(X), i.e., Y and X occur in exactly the same transcations  All frequent itemsets and their frequencies can be derived from the closed frequent itemsets without further information A.A. 0405 Datawarehousing Datamining 42Association Rules Notion of Maximality Maximal Frequent Itemsets = XI : supp(X)min sup supp(Y)min sup IY X  It’s a subset of the closed frequent itemsets, hence of all frequent itemsets  For every (closed) frequent itemset X there exists a maximal frequent itemset YX  All frequent itemsets can be derived from the maximal frequent itemsets without further information, however their frequencies must be determined from D     information loss A.A. 0405 Datawarehousing Datamining 43Association Rules Tid Items 10 B, A, D, F, G, H 20 B, L Example of closed and 30 B, A, D, F, L, H maximal frequent itemsets 40 B, L, G, H 50 B, A, D, F 60 B, A, D, F, L, G DATASET Min sup = 3 Closed Frequent Maximal Support Supporting Itemsets Transactions B NO 6 all B, A, D, F YES 4 10, 30, 50, 60 B, L YES 4 20, 30, 40, 60 B, G YES 3 10, 40, 60 B, H YES 3 10, 30, 40 A.A. 0405 Datawarehousing Datamining 44Association Rules (All frequent) VS (closed frequent) VS (maximal frequent) (•,6) (B,6) (A,4) (D,4) (F,4) (L,4) (G,3) (H,3) (B,4) (B,4) (A,4) (B,4) (A,4) (D,4) (B,4) (B,3) (B,3) (B,4) (B,4) (B,4) (A,4) = closed maximal = closed but not maximal (B,4) A.A. 0405 Datawarehousing Datamining 45Sequential Patterns Data Mining Problems Sequential Patterns: discovery of frequent subsequences in a collection of sequences (sequence database), each representing a set of events occurring at subsequent times. The ordering of the events in the subsequences is relevant. A.A. 0405 Datawarehousing Datamining 46Sequential Patterns Sequential Patterns PROBLEM: Given a set of sequences, find the complete set of frequent subsequences A sequence: (ef)(ab)(df)(c)(b) sequence database SID sequence An element may contain a set of items. Items within an element are unordered 10 (a)(abc)(ac)(d)(cf) and are listed alphabetically. 20 (ad)(c)(bc)(ae) 30 (ef)(ab)(df)(c)(b) (a)(bc)(d)(c) 40 (e)(g)(af)(c)(b)(c) is a subsequence of (a)(abc)(ac)(d)(cf) ( For min sup = 2, (ab)(c) is a frequent subsequence A.A. 0405 Datawarehousing Datamining 47Sequential Patterns Applications:  Marketing  Natural disaster forecast  Analysis of web log data  DNA analysis A.A. 0405 Datawarehousing Datamining 48Classification Data Mining Problems Classification/Regression: discovery of a model or function that maps objects into predefined classes (classification) or into suitable values (regression). The model/function is computed on a training set (supervised learning) A.A. 0405 Datawarehousing Datamining 49Classification Statement of the Problem Training Set: T = t1, …, tn set of n examples Each example ti  characterized by m features (ti(A ), …, ti(A )) 1 m  belongs to one of k classes (C : 1 i k) i GOAL From the training data find a model to describe the classes accurately and synthetically using the data’s features. The model will then be used to assign class labels to unknown (previously unseen) records A.A. 0405 Datawarehousing Datamining 50Classification Applications:  Classification of (potential) customers for: Credit approval, risk prediction, selective marketing  Performance prediction based on selected indicators  Medical diagnosis based on symptoms or reactions to therapy A.A. 0405 Datawarehousing Datamining 51Classification Classification process BUILD Training Set Unseen Data model Test Data VALIDATE PREDICT class labels A.A. 0405 Datawarehousing Datamining 52Classification Observations:  Features can be either categorical if belonging to unordered domains (e.g., Car Type), or continuous, if belonging to ordered domains (e.g., Age)  The class could be regarded as an additional attribute of the examples, which we want to predict  Classification vs Regression: Classification: builds models for categorical classes Regression: builds models for continuous classes  Several types of models exist: decision trees, neural networks, bayesan (statistical) classifiers. A.A. 0405 Datawarehousing Datamining 53Classification Classification using decision trees Definition: Decision Tree for a training set T  Labeled tree  Each internal node v represents a test on a feature. The edges from v to its children are labeled with mutually exclusive results of the test  Each leaf w represents the subset of examples of T whose features values are consistent with the test results found along the path from the root to w. The leaf is labeled with the majority class of the examples it contains A.A. 0405 Datawarehousing Datamining 54Classification Example: examples = car insurance applicants, class = insurance risk features class Age 25 Age Car Type Risk Car typesports 23 family high High 18 sports high 43 sports high Low High 68 family low 32 truck low Model 20 family high (decision tree) A.A. 0405 Datawarehousing Datamining 55Clustering Data Mining Problems Clustering: grouping objects into classes with the objective of maximizing intraclass similarity and minimizing interclass similarity (unsupervised learning) A.A. 0405 Datawarehousing Datamining 56Clustering Statement of the Problem GIVEN: N objects, each characterized by p attributes (a.k.a. variables) GROUP: the objects into K clusters featuring  High intracluster similarity  Low intercluster similarity Remark: Clustering is an instance of unsupervised learning or learning by observations, as opposed to supervised learning or learning by examples (classification) A.A. 0405 Datawarehousing Datamining 57Clustering Example outlier attribute Y attribute Y object attribute X attribute X A.A. 0405 Datawarehousing Datamining 58Clustering Several types of clustering problems exist depending on the specific input/output requirements, and on the notion of similarity:  The number of clusters K may be provided in input or not  As output, for each cluster one may want a representative object, or a set of aggregate measurements, or the complete set of objects belonging to the cluster  Distancebased clustering: similarity of objects is related to some kind of geometric distance  Conceptual clustering: a group of objects forms a cluster if they define a certain concept A.A. 0405 Datawarehousing Datamining 59Clustering Applications:  Marketing: identify groups of customers based on their purchasing patterns  Biology: categorize genes with similar functionalities  Image processing: clustering of pixels  Web: clustering of documents in metasearch engines  GIS: identification of areas of similar land use A.A. 0405 Datawarehousing Datamining 60Clustering Challenges:  Scalability: strategies to deal with very large datasets  Variety of attribute types: defining a good notion of similarity is hard in the presence of different types of attributes and/or different scales of values  Variety of cluster shapes: common distance measures provide only spherical clusters  Noisy data: outliers may affect the quality of the clustering  Sensitivity to input ordering: the ordering of the input data should not affect the output (or its quality) A.A. 0405 Datawarehousing Datamining 61Clustering Main DistanceBased Clustering Methods Partitioning Methods: create an initial partition of the objects into K clusters, and refine the clustering using iterative relocation techniques. A cluster is represented either by the mean value of its component objects or by a centrally located component object (medoid) Hierarchical Methods: start with all objects belonging to distinct clusters and then successively merge the pair of closest clusters/objects into one single cluster (agglomerative approach); or start with all objects belonging to one cluster and successively split up a cluster into smaller ones (divisive approach) A.A. 0405 Datawarehousing Datamining 62
Website URL
Comment