Question? Leave a message!




Sampling data from a stream in Big data

Sampling data from a stream in Big data
Dr.GordenMorse Profile Pic
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Website URL
Comment
Big Data Analytics CSCI 4030Infinite High dim. Graph Machine Apps data data learning data Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Queries on Decision Association Clustering Detection streams Trees Rules Dimensional Duplicate Spam Web Perceptron, ity document Detection advertising kNN reduction detection Big Data Analytics CSCI 4030 2 In many data mining situations, we do not know the entire data set in advance  Stream Management is important when the input rate is controlled externally:  Google queries  Twitter or Facebook status updates  We can think of the data as infinite and non-stationary (the distribution changes over time) Big Data Analytics CSCI 4030 3 Input elements enter at a rapid rate, at one or more input ports (i.e., streams)  We call elements of the stream tuples  The system cannot store the entire stream accessibly  Q: How do you make critical calculations about the stream using a limited amount of (secondary) memory? Big Data Analytics CSCI 4030 4Ad-Hoc Queries Standing . . . 1, 5, 2, 7, 0, 9, 3 Queries . . . a, r, v, t, y, h, b Output Processor . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering. Each is stream is composed of elements/tuples Limited Working Archival Storage Storage Big Data Analytics CSCI 4030 5 Stochastic Gradient Descent (SGD) is an example of a stream algorithm  In Machine Learning we call this: Online Learning  Allows for modeling problems where we have a continuous stream of data  We want an algorithm to learn from it and slowly adapt to the changes in data  Idea: Do slow updates to the model  SGD (SVM, Perceptron) makes small updates  So: First train the classifier on training data.  Then: For every example from the stream, we slightly update the model (using small learning rate) Big Data Analytics CSCI 4030 6 Types of queries one wants on answer on a data stream:  Sampling data from a stream  Construct a random sample  Queries over sliding windows  Number of items of type x in the last k elements of the stream Big Data Analytics CSCI 4030 7 Types of queries one wants answer on a data stream:  Filtering a data stream  Select elements with property x from the stream  Counting distinct elements  Number of distinct elements in the last k elements of the stream  Finding frequent elements Big Data Analytics CSCI 4030 8 Mining query streams  Google wants to know what queries are more frequent today than yesterday  Mining click streams  Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour  Mining social network news feeds  E.g., look for trending topics on Twitter, Facebook Big Data Analytics CSCI 4030 9 Sensor Networks  Many sensors feeding into a central controller  Telephone call records  Data feeds into customer bills  IP packets monitored at a switch  Gather information for optimal routing  Detect denial-of-service attacks Big Data Analytics CSCI 4030 10As the stream grows the sample also gets bigger Since we can not store the entire stream, one obvious approach is to store a sample  Two different problems:  (1) Sample a fixed proportion of elements in the stream (say 1 in 10)  (2) Maintain a random sample of fixed size over a potentially infinite stream  At any “time” k we would like a random sample of s elements  What is the property of the sample we want to maintain? For all time steps k, each of k elements seen so far has equal prob. of being sampled Big Data Analytics CSCI 4030 12 Problem 1: Sampling fixed proportion  Scenario: Search engine query stream  Stream of tuples: (user, query, time)  Answer questions such as: How often did a user run the same query in single days th  Have space to store 1/10 of query stream  Naïve solution:  Generate a random integer in 0..9 for each query  Store the query if the integer is 0, otherwise discard Big Data Analytics CSCI 4030 13 Simple question: What fraction of queries by a search engine average user are duplicates?  Suppose each user issues x queries once and d queries twice (total of x+2d queries)  Correct answer: d/(x+d)  Proposed solution: We keep 10% of the queries 𝒅  The sample-based answer is.. 𝟏𝟎𝒙 +𝟏𝟗𝒅 Big Data Analytics CSCI 4030 14Solution: th  Pick 1/10 of users and take all their searches in the sample  Use a hash function that hashes the user name or user id uniformly into 10 buckets st  consider users that only hash only 1 bucket Big Data Analytics CSCI 4030 15As the stream grows, the sample is of fixed size Problem 2: Fixed-size sample  Suppose we need to maintain a random sample S of size exactly s tuples  E.g., main memory size constraint  Why? Don’t know length of stream in advance  Suppose at time n we have seen n items  Each item is in the sample S with equal prob. s/n How to think about the problem: say s = 2 Stream: a x c y z k c d e g… At n= 5, each of the first 5 tuples is included in the sample S with equal prob. At n= 7, each of the first 7 tuples is included in the sample S with equal prob. Impractical solution would be to store all the n tuples seen so far and out of them pick s at random Big Data Analytics CSCI 4030 17 Algorithm (a.k.a. Reservoir Sampling)  Store all the first s elements of the stream to S  Suppose we have seen n-1 elements, and now th the n element arrives (n s) th  With probability s/n, keep the n element, else discard it th  If we picked the n element, then it replaces one of the s elements in the sample S, picked uniformly at random  Claim: This algorithm maintains a sample S with the desired property:  After n elements, the sample contains each element seen so far with probability s/n Big Data Analytics CSCI 4030 18 A useful model of stream processing is that queries are about a window of length N: the N most recent elements received  Interesting case: N is so large that the data cannot be stored in memory, or even on disk  Or, there are so many streams that windows for all cannot be stored  Amazon example:  For every product X we keep 0/1 stream of whether that product was sold in the n-th transaction  We want answer queries, how many times have we sold X in the last k sales Big Data Analytics CSCI 4030 20