Mining data streams a review

mining and storing data streams for reliability analysis and querying and mining data streams
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Data Mining for Data Streams July 20, 2017 WWW.ThesisScientist.com 1 1Mining Data Streams  What is stream data? Why Stream Data Systems?  Stream data management systems: Issues and solutions  Stream data cube and multidimensional OLAP analysis  Stream frequent pattern analysis  Stream classification  Stream cluster analysis  Sketching 2 July 20, 2017 WWW.ThesisScientist.comCharacteristics of Data Streams  Data Streams Model:  Data enters at a high speed rate  The system cannot store the entire stream, but only a small fraction  How do you make critical calculations about the stream using a limited amount of memory?  Characteristics  Huge volumes of continuous data, possibly infinite  Fast changing and requires fast, real-time response  Random access is expensive—single scan algorithms(can only have one look) 3 July 20, 2017 WWW.ThesisScientist.comArchitecture: Stream Query Processing SDMS (Stream Data User/Application Management System) Continuous Query Results Multiple streams Stream Query Processor Scratch Space (Main memory and/or Disk) 4 July 20, 2017 WWW.ThesisScientist.comStream Data Applications  Telecommunication calling records  Business: credit card transaction flows  Network monitoring and traffic engineering  Financial market: stock exchange  Engineering & industrial processes: power supply & manufacturing  Sensor, monitoring & surveillance: video streams, RFIDs  Web logs and Web page click streams  Massive data sets (even saved but random access is too expensive) 5 July 20, 2017 WWW.ThesisScientist.comDBMS versus DSMS  Persistent relations Transient streams  One-time queries Continuous queries  Random access Sequential access  ―Unbounded‖ disk store Bounded main memory  Only current state matters Historical data is important  No real-time services Real-time requirements  Relatively low update rate Possibly multi-GB arrival rate  Data at any granularity Data at fine granularity  Assume precise data Data stale/imprecise  Access plan determined by  Unpredictable/variable data query processor, physical DB arrival and characteristics design Ack. From Motwani’s PODS tutorial slides 6 July 20, 2017 WWW.ThesisScientist.comMining Data Streams  What is stream data? Why Stream Data Systems?  Stream data management systems: Issues and solutions  Stream data cube and multidimensional OLAP analysis  Stream frequent pattern analysis  Stream classification  Stream cluster analysis 7 July 20, 2017 WWW.ThesisScientist.comProcessing Stream Queries  Query types  One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)  Predefined query vs. ad-hoc query (issued on-line)  Unbounded memory requirements  For real-time response, main memory algorithm should be used  Memory requirement is unbounded if one will join future tuples  Approximate query answering  With bounded memory, it is not always possible to produce exact answers  High-quality approximate answers are desired  Data reduction and synopsis construction methods  Sketches, random sampling, histograms, wavelets, etc. 8 July 20, 2017 WWW.ThesisScientist.comMethodologies for Stream Data Processing  Major challenges  Keep track of a large universe, e.g., pairs of IP address, not ages  Methodology  Synopses (trade-off between accuracy and storage) k  Use synopsis data structure, much smaller (O(log N) space) than their base data set (O(N) space)  Compute an approximate answer within a small error range (factor ε of the actual answer)  Major methods  Random sampling  Histograms  Sliding windows  Multi-resolution model  Sketches  Radomized algorithms 9 July 20, 2017 WWW.ThesisScientist.comStream Data Processing Methods (1)  Random sampling (but without knowing the total length in advance)  Reservoir sampling: maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain probability (s/N) of replacing an old element in the reservoir.  Sliding windows  Make decisions based only on recent data of sliding window size w  An element arriving at time t expires at time t + w  Histograms  Approximate the frequency distribution of element values in a stream  Partition data into a set of contiguous buckets  Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)  Multi-resolution models  Popular models: balanced binary trees, micro-clusters, and wavelets 10 July 20, 2017 WWW.ThesisScientist.comStream Data Mining vs. Stream Querying  Stream mining—A more challenging task in many cases  It shares most of the difficulties with stream querying  But often requires less ―precision‖, e.g., no join, grouping, sorting  Patterns are hidden and more general than querying  It may require exploratory analysis  Not necessarily continuous queries  Stream data mining tasks  Multi-dimensional on-line analysis of streams  Mining outliers and unusual patterns in stream data  Clustering data streams  Classification of stream data 11 July 20, 2017 WWW.ThesisScientist.comMining Data Streams  What is stream data? Why Stream Data Systems?  Stream data management systems: Issues and solutions  Stream data cube and multidimensional OLAP analysis  Stream frequent pattern analysis  Stream classification  Stream cluster analysis  Research issues 12 July 20, 2017 WWW.ThesisScientist.comChallenges for Mining Dynamics in Data Streams  Most stream data are at pretty low-level or multi- dimensional in nature: needs ML/MD processing  Analysis requirements  Multi-dimensional trends and unusual patterns  Capturing important changes at multi-dimensions/levels  Fast, real-time detection and response  Comparing with data cube: Similarity and differences  Stream (data) cube or stream OLAP: Is this feasible?  Can we implement it efficiently? 13 July 20, 2017 WWW.ThesisScientist.comMulti-Dimensional Stream Analysis: Examples  Analysis of Web click streams  Raw data at low levels: seconds, web page addresses, user IP addresses, …  Analysts want: changes, trends, unusual patterns, at reasonable levels of details  E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.‖  Analysis of power consumption streams  Raw data: power consumption flow for every household, every minute  Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago 14 July 20, 2017 WWW.ThesisScientist.comA Stream Cube Architecture  A tilted time frame  Different time granularities  second, minute, quarter, hour, day, week, …  Critical layers  Minimum interest layer (m-layer)  Observation layer (o-layer)  User: watches at o-layer and occasionally needs to drill-down down to m-layer  Partial materialization of stream cubes  Full materialization: too space and time consuming  No materialization: slow response at query time  Partial materialization… 15 July 20, 2017 WWW.ThesisScientist.comA Titled Time Model  Natural tilted time frame:  Example: Minimal: quarter, then 4 quarters  1 hour, 24 hours  day, … 4 qtrs 24 hours 12 months 31 days time  Logarithmic tilted time frame:  Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, … 64t 32t 16t 8t 4t 2t t t Time 16 July 20, 2017 WWW.ThesisScientist.comTwo Critical Layers in the Stream Cube (, theme, quarter) o-layer (observation) (user-group, URL-group, minute) m-layer (minimal interest) (individual-user, URL, second) (primitive) stream data layer 17 July 20, 2017 WWW.ThesisScientist.comOn-Line Partial Materialization vs. OLAP Processing  On-line materialization  Materialization takes precious space and time  Only incremental materialization (with tilted time frame)  Only materialize ―cuboids‖ of the critical layers?  Online computation may take too much time  Preferred solution:  popular-path approach: Materializing those along the popular drilling paths  H-tree structure: Such cuboids can be computed and stored efficiently using the H-tree structure  Online aggregation vs. query-based computation  Online computing while streaming: aggregating stream cubes  Query-based computation: using computed cuboids 18 July 20, 2017 WWW.ThesisScientist.comMining Data Streams  What is stream data? Why Stream Data Systems?  Stream data management systems: Issues and solutions  Stream data cube and multidimensional OLAP analysis  Stream frequent pattern analysis  Stream classification  Stream cluster analysis 19 July 20, 2017 WWW.ThesisScientist.comMining Approximate Frequent Patterns  Mining precise freq. patterns in stream data: unrealistic  Even store them in a compressed form, such as FPtree  Approximate answers are often sufficient (e.g., trend/pattern analysis)  Example: a router is interested in all flows:  whose frequency is at least 1% (s) of the entire traffic stream seen so far  and feels that 1/10 of s (ε = 0.1%) error is comfortable  How to mine frequent patterns with good approximation?  Lossy Counting Algorithm (Manku & Motwani, VLDB’02)  Based on Majority Voting… 20 July 20, 2017 WWW.ThesisScientist.com