Question? Leave a message!




Data Sampling for Big Data

Data Sampling for Big Data
Data Sampling for Big Data Risto Tuomainen March 31, 2016Table of Contents Introduction Problems of uniform sampling and a new hope An algorithm for sparse data samplingBig Data and Sampling I Handling small data is easy while handling big data is dicult, so why not make big data small I From statistics we know that sampling for often give results that are practically as good as the exact values I The methods here come not from actual Big Data literature, but rather from earlier OLAP research I However, even if the methods themselves predate Big Data, they have been put to use recently for example in BlinkDB system for Big Data I Central to big data applications would be sampling stream data. We will not be discussing that however, but rather focus on static datasetsSampling I Sampling has always been central to statistics, and has been extensively researched I In survey research collecting data is expensive (while analyzing it is cheap), so sampling to limit data I In big data analyzing data is expensive (while collecting it is cheap), and again sampling to limit data I Most straightforward idea is to just sample uniformly at randomContrast to classical setting I Often in statistics we think about sampling values at random from some parametric distribution is order to estimate the parameters I For example we might sample a normal distribution to estimate the mean, and in this case we know very well how for example sample mean function behaves (Student's tdistribution etc.) I Now we are doing something di erent, that is sampling a nite collection which we can, if we so desire, scan several times. We can for example nd the minimum and maximum values in this collection. I This di erence in setting will lead to some theory which is not so familiar from elementary statisticsProblems with uniform sampling I Uniform sampling will sometimes yield abysmal results I In this presentation we are concerned with a speci c failure mode: I If the data is spread on a large interval, obtaining useful estimates can require large samples I A very speci c method to handle this situationSampling sparse data I Sparse here means that the values are spread over a long interval (not to be mixed with the usual de nition of sparseness) ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●● ●●●●● ● ●●● ●●●●●● ●● ●●● ●●● ●●●●●●● ● ●●●● ●● ● ● ●●●● ●●● ● ● ●●●●● ●●● ●●●●●●●●●●●●●●● ●●● ●●●● ●●●●●●●●●● ● ●●●●●●●●●●●● ●●●● ●●●●●●●●●●●● ●●● ●●● ●● ●●●●●● ●●● ●●●●●●●●●●●● ●●●● ●●● ● ● ●●● ●● ● ●●● ●●●●●●●●● ● ●●● ● ●●●● ● ●●●● ●●●●●●●●● ●● ●● ●●● ● ●●● ●●●● ●●●●●●●● ●●●●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1000 0 1000 2000 3000 0 200 400 600 800 1000 Observation index Frequency 0 200 400 600 Value −1000 0 1000 3000How badly does uniform sampling work for sparse datasets I Pretty badly, plot shows estimation errors in red, blue line is 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ●●● ● ● ●●●●● ●●● ●●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ●●●● ●●● ● ●● ●●● ● ●● ●●●● ● ● ●● ● ● ●● ●●●●●●●●● ●●● ● ●●●●● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●●● ●● ●● ●●●●●●●●● ●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 200 400 600 800 1000 Sample size Estimation error on mean −50 0 50 100Strati ed sampling I Idea is to split the dataset into buckets and ensure each is represented in the sample I This way we can have outliers (which are disproportionaly important for estimates) represented −1000 0 1000 2000 3000 Frequency 0 100 300 500 Sampling bucket 1 Sampling bucket 2 Sampling bucket 3 Sampling bucket 4Using a strati ed sample I Suppose we have k buckets, and we sample N points from i bucket i. I An estimator for sample mean is then    N X +N X + +N X 1 1 2 2 k k P k N i i=1Choosing a good sampling scheme I In the previous toyexample, we used just some buckets that looked good with respect to the distribution I How to do this systematically I How to reason about the behaviour of the resulting estimatesHoe ding equation I Let us x and error bound , which is the distance of the true value and estimate I Also, set  to be the probability that  is greater than some t q 1 2 I t = (ba) log 2n 1 I Where a is minimum of the dataset and b is the maximum. I Dependence of the range of the data is intutive I We can solve for the sample required for achieving some error bound.Finding good buckets I Finding the optimal buckets: split the dataset into parts each of which is characterized by the error, and minimize the sample size (think rod cutting) I Total error is obtained as weighted average of errors over all P P K K buckets: N= N i i i i=1 i=1 I Unfortunately the dynamic programming solution runs in 4 O(N ) and is of no practical relevance for big data (and of dubious relevance for any purpose) I A more practical solution runs in O(N logN), and in practice can deliver good resultsThe algorithm I Choosing optimal error for each bucket separately is dicult, so instead we x an error bound  . 0 I Now clearly the total error equals the maximum error for each bucket I A greedy approximation of the optimal scheme: traverse the data set, and at each step see if it seems better to add the element to the previous bucket or to create a new bucket with that element alone. I Never worse than uniform sampling, oftentimes better even if not optimalResulting buckets I The same dataset, using my own R implementation with  = 0:9 and  = 5 0 0 2000 4000 6000 Frequency 0 100 200 300 400 500 600 Sampling bucket 1 Sample size 275 Sampling bucket 2 Sample size 5 Sampling bucket 3 Sample size 4 Sampling bucket 4 Sample size 1 Sampling bucket 5 Sample size 1 Sampling bucket 6 Sample size 6Comparison on 150 000 samples of 300 −30 −20 −10 0 10 20 30 Error on estimating mean Frequency 0 1000 2000 3000 4000 5000Inserting new records I I Suppose we want to append a new record to the data I Often it will be possible to avoid computing everything from a scratch I Algorithm: nd the bucket to which new record belongs to I Compute the new error bound, taking into account the updated range and sample size (of the bucket) I If the global error bound requirement is still satis ed, no computation will be needed I Record is added to the bucket according to reservoir sampling I We can be sure of satisfying global error requirement, but optimality of is not considered. Then again, our algorithm is approximation in any caseInserting new records II I It is possible that the record is outside of the range of the bucket I In this case only possibility is creating a new bucket with that record alone I If this happens very often, all will be lostConclusions I Nice method, but limitations are severe I Notice that computing the mean or sum of any data is O(N). Now we are sampling with an O(N logN) method, what sense might that make I If samples can somehow be reused in many queries, this can still be worth it I Or if samples can be incrementally updated this might be useful I Methods for incrementing the sample are relatively easy I Also notice that the method used theoretical error bound results established for sum of random variables (and thus the mean), but how about arbitrary functions I In any case, certainly no silver bullet for handling big data
Website URL
Comment