Question? Leave a message!




Big Data finding similar items

Big Data finding similar items
Big Data Analytics CSCI 4030High dim Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection Big Data Analytics CSCI 4030 2Hays and Efros, SIGGRAPH 2007 Big Data Analytics CSCI 4030 3Hays and Efros, SIGGRAPH 2007 Big Data Analytics CSCI 4030 4Hays and Efros, SIGGRAPH 2007 10 nearest neighbors from a collection of 20,000 images Big Data Analytics CSCI 4030 5Hays and Efros, SIGGRAPH 2007 10 nearest neighbors from a collection of 2 million images Big Data Analytics CSCI 4030 6 Many problems can be expressed as finding “similar” sets:  Find nearneighbors in highdimensional space  Examples:  Pages with similar words  For duplicate detection (Mirror Pages, Plagiarism)  Customers who purchased similar products  Online Purchases (Amazon)  Images with similar features  Users who visited similar websites Big Data Analytics CSCI 4030 7 Given: High dimensional data points 𝒙,𝒙,… 𝟏𝟐  For example: Image is a long vector of pixel colors 121 →121021010 021 010  And some distance function 𝒅(𝒙,𝒙) 𝟏𝟐  Which quantifies the “distance” between 𝒙 and 𝒙 𝟏𝟐  Goal: Find all pairs of data points (𝒙,𝒙) that are 𝒊𝒋 within some distance threshold 𝒅𝒙,𝒙≤𝒔 𝒊𝒋 𝟐  Note: Naïve solution would take 𝑶𝑵 where 𝑵 is the number of data points  MAGIC: This can be done in 𝑶𝑵 How Big Data Analytics CSCI 4030 8Today’s lecture: Find pairs of similar docs Main idea: Candidates Pass 1: Take documents and hash them to buckets such that documents that are similar hash to the same bucket Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket) 2 Benefits: Instead of O(N ) comparisons, we need O(N) comparisons to find similar documents Big Data Analytics CSCI 4030 9 Goal: Find nearneighbors in highdim. space  We formally define “near neighbors” as points that are a “small distance” apart  For each application, we first need to define what “distance” means  Today: Jaccard distance/similarity  The Jaccard similarity of two sets is the size of their intersection divided by the size of their union: sim(C , C ) = CC /CC 1 2 1 2 1 2  Jaccard distance: d(C , C ) = 1 CC /CC 1 2 1 2 1 2 3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8 Big Data Analytics CSCI 4030 11 Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs  Applications:  Mirror websites, or approximate mirrors  Don’t want to show both in search results  Similar news articles at many news sites  Cluster articles by “same story”  Problems:  Many small pieces of one document can appear out of order in another  Too many documents to compare all pairs  Documents are so large or so many that they cannot fit in main memory Big Data Analytics CSCI 4030 121. Shingling: Convert documents to sets 2. MinHashing: Convert large sets to short signatures, while preserving similarity 3. LocalitySensitive Hashing: Focus on pairs of signatures likely to be from similar documents  Candidate pairs Big Data Analytics CSCI 4030 13Candidate pairs: Locality those pairs Document Sensitive of signatures Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc sets, and ument reflect their similarity Big Data Analytics CSCI 4030 14Docu ment The set of strings of length k that appear in the doc ument Step 1: Shingling: Convert documents to sets Step 1: Convert documents to sets  Simple approaches:  Document = set of words appearing in document  Document = set of “important” words (eliminate stop words: “and”, “a”, “the”, “to”, “you” and so on)  Don’t work well for this application. Why  Need to account for ordering of words  A different way: Shingles Big Data Analytics CSCI 4030 16 A kshingle (or kgram) for a document is a sequence of k tokens that appears in the doc  Tokens can be characters, words or something else, depending on the application  Assume tokens = characters for examples  Example: k=2; document D = abcab 1 Set of 2shingles: S(D ) = ab, bc, ca 1  Option: Shingles as a bag (multiset), count ab twice: S’(D ) = ab, bc, ca, ab 1 Big Data Analytics CSCI 4030 17 To compress long shingles, we can hash (or simply map) them to (say) 4 bytes  Represent a document by the set of hash values of its kshingles  Idea: Rare (or none) collisions of shingles  Example: k=2; document D = abcab 1 Set of 2shingles: S(D ) = ab, bc, ca 1 Hash the singles: h(D ) = 1, 5, 7 1 Big Data Analytics CSCI 4030 18 Document D is a set of its kshingles C =S(D ) 1 1 1  Equivalently, each document is a 0/1 vector in the space of kshingles  Each unique shingle is a dimension  Vectors are very sparse  A natural similarity measure is the Jaccard similarity: sim(D , D ) = CC /CC 1 2 1 2 1 2 Big Data Analytics CSCI 4030 19 Documents that have lots of shingles in common have similar text, even if the text appears in different order  Caveat: You must pick k large enough, or most documents will have most shingles  k = 5 is OK for short documents  k = 10 is better for long documents Big Data Analytics CSCI 4030 20 Suppose we need to find nearduplicate documents among 𝑵=𝟏 million documents  Naïvely, we would have to compute pairwise Jaccard similarities for every pair of docs 11 𝑵(𝑵−𝟏)/𝟐 ≈ 510 comparisons 5 6  At 10 secs/day and 10 comparisons/sec, it would take 5 days  For 𝑵= million, it takes more than a year… Big Data Analytics CSCI 4030 21 𝟏𝟎Docu ment The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc sets, and ument reflect their similarity Step 2: Minhashing: Convert large sets to short signatures, while preserving similarity Many similarity problems can be formalized as finding subsets that have significant intersection  Encode sets using 0/1 (bit, boolean) vectors  One dimension per element in the universal set  Interpret set intersection as bitwise AND, and set union as bitwise OR  Example: C = 10111; C = 10011 1 2  Size of intersection = 3; size of union = 4,  Jaccard similarity (not distance) = 3/4  Distance: d(C ,C ) = 1 – (Jaccard similarity) = 1/4 1 2 Big Data Analytics CSCI 4030 23 Rows = elements (shingles)  Columns = sets (documents) Documents  1 in row e and column s if and only 1 1 1 0 if e is a member of s 1 1 0 1  Column similarity is the Jaccard similarity of the corresponding 0 1 0 1 sets (rows with value 1) 0 0 0 1  Typical matrix is sparse 1 0 0 1  Each document is a column:  Example: sim(C ,C ) = 1 1 1 0 1 2  Size of intersection = 3; size of union = 6, 1 0 1 0 Jaccard similarity (not distance) = 3/6  d(C ,C ) = 1 – (Jaccard similarity) = 3/6 1 2 Big Data Analytics CSCI 4030 24 Shingles So far:  Documents  Sets of shingles  Represent sets as boolean vectors in a matrix  Next goal: Find similar columns while computing small signatures  Similarity of columns == similarity of signatures Big Data Analytics CSCI 4030 25 Next Goal: Find similar columns, Small signatures  Approach:  1) Signatures of columns: small summaries of columns  2) Examine pairs of signatures to find similar columns  Essential: Similarities of signatures and columns are related  3) Optional: Check that columns with similar signatures are really similar  Warnings:  Comparing all pairs may take too much time: Job for LSH  These methods can produce false negatives, and even false positives (if the optional check is not made) Big Data Analytics CSCI 4030 26 Key idea: “hash” each column C to a small signature h(C), such that:  (1) h(C) is small enough that the signature fits in RAM  (2) sim(C , C ) is the same as the “similarity” of 1 2 signatures h(C ) and h(C ) 1 2  Goal: Find a hash function h(·) such that:  If sim(C ,C ) is high, then with high prob. h(C ) = h(C ) 1 2 1 2  If sim(C ,C ) is low, then with high prob. h(C ) ≠ h(C ) 1 2 1 2  Hash docs into buckets. Expect that “most” pairs of near duplicate docs hash into the same bucket Big Data Analytics CSCI 4030 27 Goal: Find a hash function h(·) such that:  if sim(C ,C ) is high, then with high prob. h(C ) = h(C ) 1 2 1 2  if sim(C ,C ) is low, then with high prob. h(C ) ≠ h(C ) 1 2 1 2  Clearly, the hash function depends on the similarity metric:  Not all similarity metrics have a suitable hash function  There is a suitable hash function for the Jaccard similarity: It is called MinHashing Big Data Analytics CSCI 4030 28 Imagine the rows of the boolean matrix permuted under random permutation   Define a “hash” function h (C) = the index of  the first (in the permuted order ) row in which column C has value 1:  Use several (e.g., 100) independent hash functions (that is, permutations) to create a signature of a column Big Data Analytics CSCI 4030 29nd 2 element of the permutation is the first to map to a 1 Input matrix (Shingles x Documents) Permutation  Signature matrix M 4 3 1 0 1 0 2 2 1 2 1 1 0 0 1 2 4 3 2 1 4 1 0 1 0 1 1 7 7 1 2 1 2 0 1 0 1 3 2 6 th 0 1 0 1 4 element of the permutation 6 6 1 is the first to map to a 1 7 1 1 0 1 0 5 5 5 1 0 1 0 4 Big Data Analytics CSCI 4030 300 0 0 0 1 1  Choose a random permutation  0 0  Claim: Prh (C ) = h (C ) = sim(C , C )  1 2 1 2 = C1C2/C1C2 0 1 1 0 Big Data Analytics CSCI 4030 31 We know: Prh (C ) = h (C ) = sim(C , C )  1 2 1 2  Now generalize to multiple hash functions  The similarity of two signatures is the fraction of the hash functions in which they agree  Note: Because of the MinHash property, the similarity of columns is “almost the same” as the expected similarity of their signatures Big Data Analytics CSCI 4030 32Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 Similarities: 0 1 0 1 1 6 6 13 24 12 34 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 4 5 5 1 0 1 0 Big Data Analytics CSCI 4030 33 Pick K=100 random permutations of the rows  Think of sig(C) as a column vector  sig(C)i = according to the ith permutation, the index of the first row that has a 1 in column C Note: The sketch (signature) of document C is small bytes  We achieved our goal We “compressed” long bit vectors into short signatures Big Data Analytics CSCI 4030 34 𝟏𝟎𝟎 Permuting rows even once is prohibitive  Row hashing  Pick K = 100 hash functions k i  Ordering under k gives a random row permutation i  Onepass implementation  For each column C and hashfunc. k keep a “slot” for i the minhash value  Initialize all sig(C)i =  How to pick a random hash function h(x)  Scan rows looking for 1s Universal hashing:  Suppose row j has 1 in column C h (x)=((a·x+b) mod p) mod N a,b where:  Then for each k : i a,b … random integers p … prime number (p N)  If k (j) sig(C)i, then sig(C)i  k (j) i i Big Data Analytics CSCI 4030 35Candidate pairs: Locality those pairs Docu Sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc sets, and ument reflect their similarity Step 3: LocalitySensitive Hashing: Focus on pairs of signatures likely to be from similar documents2 1 4 1 1 2 1 2 2 1 2 1  Goal: Find documents with Jaccard similarity at least s (for some similarity threshold, e.g., s=0.8)  LSH – General idea: Use a function f(x,y) that tells whether x and y is a candidate pair: a pair of elements whose similarity must be evaluated  For MinHash matrices:  Hash columns of signature matrix M to many buckets  Each pair of documents that hashes into the same bucket is a candidate pair Big Data Analytics CSCI 4030 372 1 4 1 1 2 1 2 2 1 2 1  Pick a similarity threshold s (0 s 1)  Columns x and y of M are a candidate pair if their signatures agree on at least fraction s of their rows: M (i, x) = M (i, y) for at least frac. s values of i  We expect documents x and y to have the same (Jaccard) similarity as their signatures Big Data Analytics CSCI 4030 382 1 4 1 1 2 1 2 2 1 2 1  Big idea: Hash columns of signature matrix M several times  Arrange that (only) similar columns are likely to hash to the same bucket, with high probability  Candidate pairs are those that hash to the same bucket Big Data Analytics CSCI 4030 392 1 4 1 1 2 1 2 2 1 2 1 r rows per band b bands One signature Signature matrix M Big Data Analytics CSCI 4030 40 Divide matrix M into b bands of r rows  For each band, hash its portion of each column to a hash table with k buckets  Make k as large as possible  Candidate column pairs are those that hash to the same bucket for ≥ 1 band  Tune b and r to catch most similar pairs, but few nonsimilar pairs Big Data Analytics CSCI 4030 41Columns 2 and 6 Buckets are probably identical (candidate pair) Columns 6 and 7 are probably different. Matrix M b bands r rows Big Data Analytics CSCI 4030 42 There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band  Hereafter, we assume that “same bucket” means “identical in that band”  Assumption needed only to simplify analysis, not for correctness of algorithm Big Data Analytics CSCI 4030 432 1 4 1 1 2 1 2 2 1 2 1 Assume the following case:  Suppose 100,000 columns of M (100k docs)  Signatures of 100 integers (rows)  Therefore, signatures take 40Mb  Choose b = 20 bands of r = 5 integers/band  Goal: Find pairs of documents that are at least s = 0.8 similar Big Data Analytics CSCI 4030 442 1 4 1 1 2 1 2 2 1 2 1  Find pairs of  s=0.8 similarity, set b=20, r=5  Assume: sim(C , C ) = 0.8 1 2  Since sim(C , C )  s, we want C , C to be a candidate 1 2 1 2 pair: We want them to hash to at least 1 common bucket (at least one band is identical)  Probability C , C identical in one particular 1 2 5 band: (0.8) = 0.328  Probability C , C are not similar in all of the 20 1 2 20 bands: (10.328) = 0.00035  i.e., about 1/3000th of the 80similar column pairs are false negatives (we miss them)  We would find 99.965 pairs of truly similar documents Big Data Analytics CSCI 4030 452 1 4 1 1 2 1 2 2 1 2 1  Find pairs of  s=0.8 similarity, set b=20, r=5  Assume: sim(C , C ) = 0.3 1 2  Since sim(C , C ) s we want C , C to hash to NO 1 2 1 2 common buckets (all bands should be different)  Probability C , C identical in one particular 1 2 5 band: (0.3) = 0.00243  Probability C , C identical in at least 1 of 20 1 2 20 bands: 1 (1 0.00243) = 0.0474  In other words, approximately 4.74 pairs of docs with similarity 0.3 end up becoming candidate pairs  They are false positives since we will have to examine them (they are candidate pairs) but then it will turn out their similarity is below threshold s Big Data Analytics CSCI 4030 462 1 4 1 1 2 1 2 2 1 2 1  Pick:  The number of MinHashes (rows of M)  The number of bands b, and  The number of rows r per band to balance false positives/negatives  Example: If we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up Big Data Analytics CSCI 4030 47 Columns C and C have similarity t 1 2  Pick any band (r rows) r  Prob. that all rows in band equal = t r  Prob. that some row in band unequal = 1 t r b  Prob. that no band identical = (1 t )  Prob. that at least 1 band identical = r b 1 (1 t ) Big Data Analytics CSCI 4030 48 Similarity threshold s  Prob. that at least 1 band is identical: r b s 1(1s ) .2 .006 .3 .047 .4 .186 .5 .470 .6 .802 .7 .975 .8 .9996 Big Data Analytics CSCI 4030 49 Picking r and b to get the best Scurve  50 hashfunctions (r=5, b=10) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Blue area: False Negative rate 0.1 Green area: False Positive rate 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Big Data Analytics CSCI 4030 50 Prob. sharing a bucket Tune M, b, r to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures  Check in main memory that candidate pairs really do have similar signatures  Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents Big Data Analytics CSCI 4030 51 Shingling: Convert documents to sets  We used hashing to assign each shingle an ID  MinHashing: Convert large sets to short signatures, while preserving similarity  We used similarity preserving hashing to generate signatures with property Prh (C ) = h (C ) = sim(C , C )  1 2 1 2  We used hashing to get around generating random permutations  LocalitySensitive Hashing: Focus on pairs of signatures likely to be from similar documents  We used hashing to find candidate pairs of similarity  s Big Data Analytics CSCI 4030 52 There are two sets S and T in the figure below. What is their Jaccard similarity Big Data Analytics CSCI 4030 53 We see two sets S and T in the figure below. What is their Jaccard similarity  Ans: There are three elements in their intersection and a total of eight elements that appear in S or T or both. Therefore, sim(S,T) = 3/7 Big Data Analytics CSCI 4030 54 Assume we use k = 9 shingles. Is there some lexicographical similarity in the sentences:  “The plane was ready for touch down”  “The quarterback scored a touchdown”  How about if we eliminate white spaces Big Data Analytics CSCI 4030 55 Assume we use k = 9 shingles. Is there some lexicographical similarity in the sentences:  “The plane was ready for touch down”  “The quarterback scored a touchdown”  How about if we eliminate white spaces  Ans:  If we retain the blanks, then the first sentence has shingles “touch dow” and “ouch down”, while the second has “touchdown” .  If we eliminated the blanks, then both would have “touchdown”. Big Data Analytics CSCI 4030 56 Our corpus of documents is emails. Assume we choose k = 1 shingles. How is going to affect emails similarity  What would be your recommendation for k if corpus of documents is emails Big Data Analytics CSCI 4030 57 Our corpus of documents is emails. Assume we choose k = 1 shingles. How is going to affect emails similarity  What would be your recommendation for k if corpus of documents is emails  Ans:  This is an extreme case. If we use k = 1, most of the email documents will have most of the common characters, so almost all emails will have high similarity.  Thus, if our corpus of documents is emails, picking k = 5 5 should be fine. Then there would be 27 = 14,348,907 possible shingles assuming are 27 characters. Since typical email is smaller that should work well. Big Data Analytics CSCI 4030 58 The matrix representing four sets 𝑆,…,𝑆 is 14 presented below. Suppose we pick the permutation of rows beadc. What is the value of minhash function h(𝑆 ) 𝑖 Big Data Analytics CSCI 4030 59 The matrix representing four sets 𝑆,…,𝑆 is 14 presented below. Suppose we pick the permutation of rows beadc. What is the value of minhash function h(𝑆 ) 𝑖  Ans: In this matrix, we read off the values of h by scanning from the top until we come to a 1. Thus, we see that h(𝑆 ) = a, h(𝑆 ) = c, h(𝑆 ) = b and h(𝑆 ) = a. 1234  Therefore, the answer is h(𝑆 ) = a, c, b, a. 𝑖 Big Data Analytics CSCI 4030 60 The matrix representing four sets 𝑆,…,𝑆 is 14 presented below. Suppose, we pick the permutation ℎ = x + 1 mod 5 and 3x + 1 mod 5. 1  Compute hash functions for the matrix. Big Data Analytics CSCI 4030 61 The matrix representing four sets 𝑆,…,𝑆 is 14 presented below. Suppose, we pick the permutation ℎ = x + 1 mod 5 and ℎ = 3x + 1 mod 5. 12  Compute hash functions for the matrix.  Ans: Big Data Analytics CSCI 4030 62 Compute the signature matrix with single pass over two permutations established by the hash functions in the previous task. Big Data Analytics CSCI 4030 63 Initially, this matrix consists of all ∞’s  First we consider row 0 of the matrix. We see that the values of ℎ0 and ℎ0 are both 1. The row 12 numbered 0 has 1’s in the columns for sets 𝑆 and 𝑆 , 14 so only these columns of the signature matrix can change. As 1 is less than ∞ signature matrix is: Big Data Analytics CSCI 4030 64 Current signature matrix  Next we consider row 1 of the matrix. This row has 1 only in 𝑆 and its hash values are ℎ1=2 and 31 ℎ1 = 4. The new signature matrix is: 2 Big Data Analytics CSCI 4030 65 Current signature matrix  Next we consider row 2 of the matrix. This row has 1 in 𝑆 and 𝑆 and its hash values are ℎ2 = 3 and 241 ℎ2 = 2. Note that for 𝑆 of the signature matrix, 24 1,1 are each less than 3,2. The new signature matrix is: Big Data Analytics CSCI 4030 66 Current signature matrix  Next we consider row 3 of the matrix. This row has 1 in 𝑆 and 𝑆 and its hash values are ℎ2 = 4 and 131 ℎ2 = 0. Note that 0 1 for 𝑆 and 0 4 for 𝑆 . 213 The new signature matrix is: Big Data Analytics CSCI 4030 67 Current signature matrix  Finally we consider row 4 of the matrix. This row has 1 in 𝑆 and its hash values are ℎ2 = 0 and ℎ2 = 312 3. Note that 0 2 for 𝑆 . The updated final signature 3 matrix is: Big Data Analytics CSCI 4030 68 Estimate the Jaccard similarities of the underlying sets 𝑆 and 𝑆 from the signature matrix. 14  What is the actual Jaccard similarity between 𝑆 and 1 𝑆 4 Big Data Analytics CSCI 4030 69 Estimate the Jaccard similarities of the underlying sets 𝑆 and 𝑆 from the signature matrix. 14  What is the actual Jaccard similarity between 𝑆 and 1 𝑆 4  Ans  The estimated Jaccard similarity is 2/3  The actual Jaccard similarity is 1 Big Data Analytics CSCI 4030 70𝑟𝑏  Evaluate the Scurve 1−(1−𝑠) for s = 0.2, . . . , 0.8, for the following values of r and b:  r = 3 and b = 10. You can write a script to calculate it. 𝑟𝑏 s 1−(1−𝑠) .2 .3 .4 .5 .6 .7 .8 Big Data Analytics CSCI 4030 71𝑟𝑏  Evaluate the Scurve 1−(1−𝑠) for s = 0.2, . . . , 0.8, for the following values of r and b:  r = 3 and b = 10. You can write a script to calculate it.  Ans: 𝑟𝑏 s 1−(1−𝑠) .2… .3… .4… .5… .6… .7… .8… Big Data Analytics CSCI 4030 72 The Jaccard similarity of sets is the ratio of the size of the intersection of the sets to the size of the union.  This measure of similarity is suitable for many applications, including textual similarity of documents and similarity of buying habits of customers. Big Data Analytics CSCI 4030 73 A kshingle is any k characters that appear consecutively in a document.  If we represent a document by its set of kshingles, then the Jaccard similarity of the shingle sets measures the textual similarity of documents.  Sometimes, it is useful to hash shingles to bit strings of shorter length, and use sets of hash values to represent documents. Big Data Analytics CSCI 4030 74 A minhash function on sets is based on a permutation of the universal set.  Given any such permutation, the minhash value for a set is that element of the set that appears first in the permuted order. Big Data Analytics CSCI 4030 75 It is not really possible to generate random permutations.  Thus, it is normal to simulate a permutation by  picking a random hash function and  taking the minhash value for a set to be the least hash value of any of the set’s members.. Big Data Analytics CSCI 4030 76 Locality sensitive hashing technique allows us to avoid computing the similarity of every pair of sets or their minhash signatures.  If we are given signatures for the sets, we may divide them into bands, and only measure the similarity of a pair of sets if they are identical in at least one band.  By choosing the size of bands appropriately, we can eliminate from consideration most of the pairs that do not meet our threshold of similarity. Big Data Analytics CSCI 4030 77 Finish Quiz unless you did over the lecture.  Review slides  Read Chapter 3 from course book.  You can find electronic version of the book on Blackboard. Big Data Analytics CSCI 4030 78