Question? Leave a message!




Data Warehousing and OLAP Technology for Data Mining

Data Warehousing and OLAP Technology for Data Mining
Data Warehouses and OLAP www.ThesisScientist.comMultiTiered Architecture Monitor OLAP Server Metadata other Integrator sources Analysis Extract Query Operational Serve Transform Data Reports DBs Load Warehouse Data mining Refresh Data Marts www.ThesisScientist.com Data Sources Data Storage OLAP Engine FrontEnd ToolsOLAP Server Architectures  Relational OLAP (ROLAP)  Use relational or extendedrelational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services  greater scalability  Multidimensional OLAP (MOLAP)  Arraybased multidimensional storage engine (sparse matrix techniques)  fast indexing to precomputed summarized data  Hybrid OLAP (HOLAP)  User flexibility, e.g., low level: relational, highlevel: array  Specialized SQL servers  specialized support for SQL queries over star/snowflake www.ThesisScientist.com schemasData Warehousing and OLAP Technology for Data Mining  What is a data warehouse  A multidimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining www.ThesisScientist.comEfficient Data Cube Computation  Data cube can be viewed as a lattice of cuboids  The bottommost cuboid is the base cuboid  The topmost cuboid (apex) contains only one cell  How many cuboids in an ndimensional cube with L levels n T (L1)  i i1  Materialization of data cube  Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)  Selection of which cuboids to materialize  Based on size, sharing, access frequency, etc. www.ThesisScientist.comCube Operations  Cube definition and computation in DMQL define cube salesitem, city, year: sum(salesindollars) compute cube sales  Transform it into a SQLlike language (with a new operator cube by, introduced by Gray et al.’96) () SELECT item, city, year, SUM (amount) FROM SALES (city) (item) (year) CUBE BY item, city, year  Need compute the following GroupBys (date, product, customer), (date,product),(date, customer), (product, customer), (city, item) (city, year) (item, year) (date), (product), (customer) () www.ThesisScientist.com (city, item, year)Cube Computation: ROLAPBased Method  Efficient cube computation methods  ROLAPbased cubing algorithms (Agarwal et al’96)  Arraybased cubing algorithm (Zhao et al’97)  Bottomup computation method (Bayer Ramarkrishnan’99)  ROLAPbased cubing algorithms  Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples  Grouping is performed on some subaggregates as a “partial grouping step”  Aggregates may be computed from previously computed aggregates, rather than from the base fact table www.ThesisScientist.comMultiway Array Aggregation for Cube Computation  Partition arrays into chunks (a small subcube which fits in memory).  Compressed sparse array addressing: (chunkid, offset)  Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the of times to visit each cell, and reduces memory access and storage cost. c3 61 62 63 64 C c2 45 46 47 48 c1 29 30 31 32 What is the best c 0 60 traversing order B 13 14 15 16 b3 44 to do multiway 28 56 9 b2 40 aggregation B 24 52 5 b1 36 20 1 2 3 4 b0 www.ThesisScientist.com a0 a1 a2 a3 AMultiway Array Aggregation for Cube Computation c3 61 62 63 64 C c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 B 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A www.ThesisScientist.comMultiway Array Aggregation for Cube Computation c3 61 62 63 64 C c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 B 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A www.ThesisScientist.comMultiWay Array Aggregation for Cube Computation (Cont.)  Method: the planes should be sorted and computed according to their size in ascending order.  See the details of Example 4.4  Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane  Limitation of the method: works well only for a small number of dimensions  If there are a large number of dimensions, “bottomup computation” and iceberg cube computation methods can be explored www.ThesisScientist.comIndexing OLAP Data: Bitmap Index  Index on a particular column  Each value in the column has a bit vector: bitop is fast  The length of the bit vector: of records in the base table  The ith bit is set if the ith row of the base table has the value for the indexed column  not suitable for high cardinality domains Base table Index on Region Index on Type Cust Region Type RecID Retail Dealer RecIDAsia Europe America C1 Asia Retail 1 1 0 1 1 0 0 C2 Europe Dealer 2 0 1 0 2 0 1 C3 Asia Dealer 3 1 0 0 3 0 1 C4 America Retail 4 0 0 1 4 1 0 5 0 1 0 5 0 1 C5 Europe Dealer www.ThesisScientist.comIndexing OLAP Data: Join Indices  Join index: JI(Rid, Sid) where R (Rid, …)  S (Sid, …)  Traditional indices map the values to a list of record ids  It materializes relational join in JI file and speeds up relational join — a rather costly operation  In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.  E.g. fact table: Sales and two dimensions city and product  A join index on city maintains for each distinct city a list of RIDs of the tuples recording the Sales in the city  Join indices can span multiple dimensions www.ThesisScientist.comEfficient Processing OLAP Queries  Determine which operations should be performed on the available cuboids:  transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection  Determine to which materialized cuboid(s) the relevant operations should be applied.  Exploring indexing structures and compressed vs. dense array structures in MOLAP www.ThesisScientist.comMetadata Repository  Meta data is the data defining warehouse objects. It has the following kinds  Description of the structure of the warehouse  schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents  Operational metadata  data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)  The algorithms used for summarization  The mapping from operational environment to the data warehouse  Data related to system performance  warehouse schema, view and derived data definitions  Business data  business terms and definitions, ownership of data, charging policies www.ThesisScientist.comData Warehouse BackEnd Tools and Utilities  Data extraction:  get data from multiple, heterogeneous, and external sources  Data cleaning:  detect errors in the data and rectify them when possible  Data transformation:  convert data from legacy or host format to warehouse format  Load:  sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions  Refresh  propagate the updates from the data sources to the warehouse www.ThesisScientist.comData Warehousing and OLAP Technology for Data Mining  What is a data warehouse  A multidimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining www.ThesisScientist.comDiscoveryDriven Exploration of Data Cubes  Hypothesisdriven: exploration by user, huge search space  Discoverydriven (Sarawagi et al.’98)  precompute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation  Exception: significantly different from the value anticipated, based on a statistical model  Visual cues such as background color are used to reflect the degree of exception of each cell  Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction www.ThesisScientist.comExamples: DiscoveryDriven Data Cubes www.ThesisScientist.comData Warehousing and OLAP Technology for Data Mining  What is a data warehouse  A multidimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining www.ThesisScientist.comData Warehouse Usage  Three kinds of data warehouse applications  Information processing  supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs  Analytical processing  multidimensional analysis of data warehouse data  supports basic OLAP operations, slicedice, drilling, pivoting  Data mining  knowledge discovery from hidden patterns  supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. www.ThesisScientist.com  Differences among the three tasksFrom OnLine Analytical Processing to On Line Analytical Mining (OLAM)  Why online analytical mining  High quality of data in data warehouses  DW contains integrated, consistent, cleaned data  Available information processing structure surrounding data warehouses  ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools  OLAPbased exploratory data analysis  mining with drilling, dicing, pivoting, etc.  Online selection of data mining functions  integration and swapping of multiple mining functions, algorithms, and tasks.  Architecture of OLAM www.ThesisScientist.comAn OLAM Architecture Layer4 Mining query Mining result User Interface User GUI API Layer3 OLAM OLAP OLAP/OLAM Engine Engine Data Cube API Layer2 MDDB MDDB Meta Data Database API FilteringIntegration Filtering Layer1 Data cleaning Data Data Databases www.ThesisScientist.com Warehouse Data integration RepositorySummary  Data warehouse  A subjectoriented, integrated, timevariant, and nonvolatile collection of data in support of management’s decisionmaking process  A multidimensional model of a data warehouse  Star schema, snowflake schema, fact constellations  A data cube consists of dimensions measures  OLAP operations: drilling, rolling, slicing, dicing and pivoting  OLAP servers: ROLAP, MOLAP, HOLAP  Efficient computation of data cubes  Partial vs. full vs. no materialization  Multiway array aggregation  Bitmap index and join index implementations  Further development of data cube technology  Discoverydrive and multifeature cubes  From OLAP to OLAM (online analytical mining) www.ThesisScientist.comReferences (I)  S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large Data Bases, 506521, Bombay, India, Sept. 1996.  D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. In Proc. 1997 ACMSIGMOD Int. Conf. Management of Data, 417427, Tucson, Arizona, May 1997.  R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACMSIGMOD Int. Conf. Management of Data, 94105, Seattle, Washington, June 1998.  R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. 1997 Int. Conf. Data Engineering, 232243, Birmingham, England, April 1997.  K. Beyer and R. Ramakrishnan. BottomUp Computation of Sparse and Iceberg CUBEs. In Proc. 1999 ACMSIGMOD Int. Conf. Management of Data (SIGMOD'99), 359370, Philadelphia, PA, June 1999.  S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:6574, 1997.  OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998.  J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, www.ThesisScientist.com and H. Pirahesh. Data cube: A relational aggregation operator generalizing groupby, crosstab and subtotals. Data Mining and Knowledge Discovery, 1:2954, 1997.References (II)  V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proc. 1996 ACMSIGMOD Int. Conf. Management of Data, pages 205216, Montreal, Canada, June 1996.  Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998.  K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf. Very Large Data Bases, 116125, Athens, Greece, Aug. 1997.  K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), 263 277, Valencia, Spain, March 1998.  S. Sarawagi, R. Agrawal, and N. Megiddo. Discoverydriven exploration of OLAP data cubes. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), pages 168 182, Valencia, Spain, March 1998.  E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley Sons, 1997.  Y. Zhao, P. M. Deshpande, and J. F. Naughton. An arraybased algorithm for simultaneous multidimensional aggregates. In Proc. 1997 ACMSIGMOD Int. Conf. www.ThesisScientist.com Management of Data, 159170, Tucson, Arizona, May 1997.Selection of tables, attributes, domains in the DW design process  If you are asked to design a data warehouse for a set of operational databases how would you do it  Use specifications of requirements to design the data warehouse schema and select:  The central theme(s) of the analysis (e.g., sales)  The measures on the central themes (e.g., sum(dollars))  The dimensions used by analytical processing  The attributes and hierarchies of the dimensions  Clean, transform, and integrate information www.ThesisScientist.comExample  A large company which sells engine parts  Database 1 (Los Angeles)  employee(id, name, dept, lot, salary, age)  department(id, name, type, managerid)  part(id, name, type, brand, manufacturer, color)  customer(id, name, type, age, city, state, zip, tel)  sales(id, partid, customerid, quantity, price)  Database 2 (New York)  employee(id, ename, deptid, salary, age)  department(id, name, type, manager)  part(id, title, type, brand, manufacturer, color)  customer(id, name, type, zip, tel)  location(zip, cityid)  city(cityid, state, country) www.ThesisScientist.com  sales(id, partid, customerid, quantity, price)Example: requirements and selection  Requirements of the data warehouse  we want to analyze the total sales in dollars and the average price of sold units with respect to time, part, customer.  Selection of the basic features of the warehouse  central theme(s): sales  measure(s): sum(salesindollars), avg(pricesoldunits)  dimensions: time, part, customer www.ThesisScientist.comExample: selection of hierarchies  To determine the dimension hierarchies we have to select which dimensional attributes are required to include for analysis  We go back to the requirements and ask the analyst:  time: day, week, month, quarter, year  part: name, type, color, brand, manufacturer  customer: name, type, city, state, country www.ThesisScientist.comExercise  Find the hierarchies for time, part, customer •time: day, week, month, quarter, year •part: name, type, color, brand, manufacturer •customer: name, type, city, state, country www.ThesisScientist.comExample: selection of hierarchies  Definition of the hierarchies: year manufacturer country state quarter type color brand type city month week name name day customer part time www.ThesisScientist.comExample: is the information that we need to analyze available  We have to check if the required information for analysis exists in the databases to be integrated  All requested attributes exist, except from the time. This can be determined by accessing the transaction logs of the databases. www.ThesisScientist.comExample: Design the DW schema  We use the star schema in this example Time Fact table timeid Customer day timeid custid week name partid month type custid quarter city year quantity state price country Part partid name We need just quantity and (unit) price to derive type sum(salesindollars) = quantityprice color avg(pricesoldunits) = brand www.ThesisScientist.com Σ(quantityprice)/ Σ(quantity) manufacturerIntegration tasks convert:  A large company which sells engine parts •attribute names  Database 1 (Los Angeles) •attribute types  employee(id, name, dept, lot, salary, age) join tables  department(id, name, type, managerid)  part(id, name, type, brand, manufacturer, color) derive data not  customer(id, name, type, age, city, state, zip, tel) stored explicitly  sales(id, partid, customerid, quantity, price) in the databases  Database 2 (New York) fill in missing  employee(id, ename, deptid, salary, age) values  department(id, name, type, manager)  part(id, title, type, brand, manufacturer, color) ignore irrelevant data:  customer(id, name, type, zip, tel) •tables  location(zip, cityid) •attributes  city(cityid, state, country) www.ThesisScientist.com  sales(id, partid, customerid, quantity, price)Example: how to populate the DW Load, Clean, Integrate  Convert attribute names and types  E.g., part.name = part.title  Convert values to be consistent  Join tables if necessary  Join customer,location,city from “New York” DB  Derive time if not present  Check transaction log for sales table to get the time and convert it to the required format  Complete missing values  The “Los Angeles” database does not record customer country information because all its customers are from US. In the integrated data from LA country value is set to “USA”  Ignore irrelevant tables and attributes  Tables employee, department are ignored  Attributes zip, tel, id are ignored. www.ThesisScientist.comHow many cuboids  How do we compute the total number of cuboids of a data cube  Compute the product of the number of levels for each dimension  Number of cuboids = (levels for time) (levels for part) (levels for customer) = 545 = 100 www.ThesisScientist.comWhat is the data cube  The data cube is NOT a cube  Multiple dimensions, variable range and interpretations of the cells, does not look always like a “cube”  The data cube is the set of all nonredundand, multidimensional views from which we can analyze the measures on the central theme(s)  Remember: the terms multidimensional view and cuboid have the same meaning  A multidimensional view is nonredundant, if there are no hierarchical relationships between its dimensional attributes www.ThesisScientist.comRedundancy in views  A nonredundant combination of month attributes: 23 12 23 11 5067  (month, parttype) 10 20 43 33 1322 58 18 72 25 30 25 23 12 0 40 9012 23 45 43 33 13 1 2 0 56 23 6 25  A redundant (not useful) partbrand combination of attributes: 23 12 0 0 0 0  (partbrand, partmanufacturer) 0 0 43 33 13 0 0 0 0 0 0 25 www.ThesisScientist.com parttype manufacturerExercise  Find the nonredundant combinations of the attributes of part. manufacturer type color brand name part www.ThesisScientist.comVisualization of nonredundant attribute combinations for part ALL manufacturer type color brand (type,color) (type, manufacturer) (color, manufacturer) (color,brand) (type,brand) (type,color,manufacturer) (type,color,brand) name part www.ThesisScientist.comHow many (nonredundant) cuboids  How do we compute the total number of cuboids of a data cube  Compute the product of the number of combinations for each dimension  Number of cuboids = (combinations for time) (combinations for part) (combinations for customer) = 6139 = 702  Note: sometimes we use the term cuboid to also denote multidimensional views with selections:  total sales for each (part.type,customer.city) for year = “2001”  We do not count such cuboids in the computation above www.ThesisScientist.comCube: A Lattice of Cuboids E.g., (location) is dependent on all (time,item,location) time item location supplier time,item time,location item,location location,supplier time,supplier item,supplier time,location,supplier time,item,location time,item,supplier item,location,supplier www.ThesisScientist.com time, item, location, supplierHow to answer queries from a set of materialized cuboids  In reallife examples it is not possible to materialize the whole data cube  Typically, a small subset of cuboids is materialized  To answer a query we have to select the cuboid that results in the minimum cost for the query  A query typically consists of  a set of groupby attributes  a set of selection clauses  E.g., compute the total sales per part.type, cust.city for year=2002.  The factors for selecting the best cuboid for a particular query are:  the size of the cuboid www.ThesisScientist.com  any indexes on the attributes of the “select” clause of the queryCube: A Lattice of Cuboids How would you compute the = materialized following queries all cuboid Q1: time,item Q2: supplier time item location supplier Q3: location Q4: time,supplier Q5: item time,item time,location item,location location,supplier time,supplier item,supplier time,location,supplier time,item,location time,item,supplier item,location,supplier www.ThesisScientist.com time, item, location, supplierWhich cuboids should we materialize  In reallife examples it is not possible to materialize the whole data cube  We have to select the most beneficial cuboids to materialize  This depends mainly on the size of the cuboids and their usage by queries  Thus to select we need information about (i) the size of cuboids, (ii) the queries and their frequency  The base cuboid almost always corresponds to the fact table, which is already materialized. For example, if the products have unique name, and customers unique name, we can use:  timeid to derive the day  partid to derive part.name  custid to derive custo wmer. ww.Thesin sS ame cientist.comWhich cuboids should we materialize  Example  candidate cuboids:  (day,pname,cname)– 100GB (already materialized)  (day,pname) – 60GB  (day,cname) – 20GB  (pname,cname) – 1GB  (day) – 10GB Exercise: Which views  (pname) – 200MB should we materialize if  (cname) – 30MB the available space is:  (ALL) – 8 bytes 1. 10GB  queries (with equal probability) 2. 1GB  Q1: total sales per (pname,cname)  Q2: total sales per (pname) 3. 100MB  Q3: total sales per (cname) www.ThesisScientist.comWhich cuboids should we materialize  Case 1: Available space = 10GB  We can materialize all three views (pname,cname) , (pname), and (cname)  The cost of Q1 is reading 1GB  The cost of Q2 is reading 200MB  The cost of Q3 is reading 30MB  Average query cost = (1230MB)/3=410 MB/query www.ThesisScientist.comWhich cuboids should we materialize  Case 2: Available space = 1GB  We have two choices:  materialize (pname,cname) using 1GB  Q1 costs 1GB, Q2 costs 1GB, Q3 costs 1GB  Average query cost = 1GB  materialize (pname) and (cname) using 230MB  Q1 costs 100GB, Q2 costs 200MB, Q3 costs 30MB  Average query cost = (100,230MB)/3 = 34GB  First choice is better than the second www.ThesisScientist.comWhich cuboids should we materialize  Case 3: Available space = 100MB  We can only materialize (cname)  Q1 costs 100GB  Q2 costs 100GB  Q3 costs 30MB  Average query cost = (200,030MB)/3 = 67GB www.ThesisScientist.comBitmap Indexes  The bitmap index is used to index attributes with small domains  For each attribute value, a bitmap is defined to indicate the rows of the table that contain this value  Bitmaps are useful especially when we want to join some attribute values  Example: find the total sales of red parts to female customers www.ThesisScientist.comBitmap Indexes Example 100 bytes date pcolor cname cgender sales 10/10/00 red Smith M 21 12/10/00 green Jones F 13 15/10/00 green Kane M 14 20/10/00 blue Nike M 23 1 billion rows 22/11/00 red Ellis F 9 26/11/00 red Jones F 92 ... ... ... ... ... index for pcolor index for gender Exercise: red green blue M F 1. what are the sizes of 1 0 0 1 0 the table and indexes 0 1 0 0 1 2. what is the cost of the 0 1 0 1 0 query: “find the total sales of 0 0 1 1 0 red parts to female 1 0 0 0 1 customers” if: 1 0 0 0 1 a) the table is used www.ThesisScientist.com ... ... ... ... ... b) the indexes are usedBitmap Indexes Example  The size of the table is 100GB=100bytes1billion  The size of the indexes are:  3bits1billion = 3000Mbits = 375MB  2bits1billion = 2000Mbits = 250MB  The cost of evaluating the query directly on the table is reading 100GB and for each of the 1 billion tuples perform a comparison with “red” and “F” (expensive)  The cost of evaluating the query using the indexes is:  read bitmap for red, read bitmap for F and join them.  This costs reading 1Gbits+1Gbits = 250MB  for each join result accumulate the sales for the corresponding rid  this retrieves (1/3)(1/2) = 1/6 tuples (estimated) and accumulates them  we will probably read the whole table (since we want to avoid random www.ThesisScientist.com accesses), but we will avoid any comparisons.
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
Dr.JakeFinlay
User Type:
Teacher
Country:
Germany
Uploaded Date:
22-07-2017