Question? Leave a message!




Data Warehousing and OLAP Technology for Data Mining

Data Warehousing and OLAP Technology for Data Mining
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Website URL
Comment
Data Warehouses and OLAP www.ThesisScientist.comMulti-Tiered Architecture Monitor OLAP Server & Metadata other Integrator sources Analysis Extract Query Operational Serve Transform Data Reports DBs Load Warehouse Data mining Refresh Data Marts www.ThesisScientist.com Data Sources Data Storage OLAP Engine Front-End ToolsOLAP Server Architectures  Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services  greater scalability  Multidimensional OLAP (MOLAP)  Array-based multidimensional storage engine (sparse matrix techniques)  fast indexing to pre-computed summarized data  Hybrid OLAP (HOLAP)  User flexibility, e.g., low level: relational, high-level: array  Specialized SQL servers  specialized support for SQL queries over star/snowflake www.ThesisScientist.com schemasData Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining www.ThesisScientist.comEfficient Data Cube Computation  Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid  The top-most cuboid (apex) contains only one cell  How many cuboids in an n-dimensional cube with L levels? n T (L1)  i i1  Materialization of data cube  Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)  Selection of which cuboids to materialize  Based on size, sharing, access frequency, etc. www.ThesisScientist.comCube Operations  Cube definition and computation in DMQL define cube salesitem, city, year: sum(sales_in_dollars) compute cube sales  Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) () SELECT item, city, year, SUM (amount) FROM SALES (city) (item) (year) CUBE BY item, city, year  Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (city, item) (city, year) (item, year) (date), (product), (customer) () www.ThesisScientist.com (city, item, year)Cube Computation: ROLAP-Based Method  Efficient cube computation methods  ROLAP-based cubing algorithms (Agarwal et al’96)  Array-based cubing algorithm (Zhao et al’97)  Bottom-up computation method (Bayer & Ramarkrishnan’99)  ROLAP-based cubing algorithms  Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples  Grouping is performed on some sub-aggregates as a “partial grouping step”  Aggregates may be computed from previously computed aggregates, rather than from the base fact table www.ThesisScientist.comMulti-way Array Aggregation for Cube Computation  Partition arrays into chunks (a small subcube which fits in memory).  Compressed sparse array addressing: (chunk_id, offset)  Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the of times to visit each cell, and reduces memory access and storage cost. c3 61 62 63 64 C c2 45 46 47 48 c1 29 30 31 32 What is the best c 0 60 traversing order B 13 14 15 16 b3 44 to do multi-way 28 56 9 b2 40 aggregation? B 24 52 5 b1 36 20 1 2 3 4 b0 www.ThesisScientist.com a0 a1 a2 a3 AMulti-way Array Aggregation for Cube Computation c3 61 62 63 64 C c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 B 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A www.ThesisScientist.comMulti-way Array Aggregation for Cube Computation c3 61 62 63 64 C c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 B 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A www.ThesisScientist.comMulti-Way Array Aggregation for Cube Computation (Cont.)  Method: the planes should be sorted and computed according to their size in ascending order.  See the details of Example 4.4  Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane  Limitation of the method: works well only for a small number of dimensions  If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored www.ThesisScientist.comIndexing OLAP Data: Bitmap Index  Index on a particular column  Each value in the column has a bit vector: bit-op is fast  The length of the bit vector: of records in the base table  The i-th bit is set if the i-th row of the base table has the value for the indexed column  not suitable for high cardinality domains Base table Index on Region Index on Type Cust Region Type RecID Retail Dealer RecIDAsia Europe America C1 Asia Retail 1 1 0 1 1 0 0 C2 Europe Dealer 2 0 1 0 2 0 1 C3 Asia Dealer 3 1 0 0 3 0 1 C4 America Retail 4 0 0 1 4 1 0 5 0 1 0 5 0 1 C5 Europe Dealer www.ThesisScientist.comIndexing OLAP Data: Join Indices  Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)  Traditional indices map the values to a list of record ids  It materializes relational join in JI file and speeds up relational join — a rather costly operation  In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.  E.g. fact table: Sales and two dimensions city and product  A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city  Join indices can span multiple dimensions www.ThesisScientist.comEfficient Processing OLAP Queries  Determine which operations should be performed on the available cuboids:  transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection  Determine to which materialized cuboid(s) the relevant operations should be applied.  Exploring indexing structures and compressed vs. dense array structures in MOLAP www.ThesisScientist.comMetadata Repository  Meta data is the data defining warehouse objects. It has the following kinds  Description of the structure of the warehouse  schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents  Operational meta-data  data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)  The algorithms used for summarization  The mapping from operational environment to the data warehouse  Data related to system performance  warehouse schema, view and derived data definitions  Business data  business terms and definitions, ownership of data, charging policies www.ThesisScientist.comData Warehouse Back-End Tools and Utilities  Data extraction:  get data from multiple, heterogeneous, and external sources  Data cleaning:  detect errors in the data and rectify them when possible  Data transformation:  convert data from legacy or host format to warehouse format  Load:  sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions  Refresh  propagate the updates from the data sources to the warehouse www.ThesisScientist.comData Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining www.ThesisScientist.comDiscovery-Driven Exploration of Data Cubes  Hypothesis-driven: exploration by user, huge search space  Discovery-driven (Sarawagi et al.’98)  pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation  Exception: significantly different from the value anticipated, based on a statistical model  Visual cues such as background color are used to reflect the degree of exception of each cell  Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction www.ThesisScientist.comExamples: Discovery-Driven Data Cubes www.ThesisScientist.comData Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining www.ThesisScientist.com