Benefits of big data analysis

applications of big data analysis and best open source database for big data
Dr.GordenMorse Profile Pic
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Big Data Analytics CSCI 4030High dim Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection Big Data Analytics CSCI 4030 2Hays and Efros, SIGGRAPH 2007 Big Data Analytics CSCI 4030 3Hays and Efros, SIGGRAPH 2007 Big Data Analytics CSCI 4030 4Hays and Efros, SIGGRAPH 2007 10 nearest neighbors from a collection of 20,000 images Big Data Analytics CSCI 4030 5Hays and Efros, SIGGRAPH 2007 10 nearest neighbors from a collection of 2 million images Big Data Analytics CSCI 4030 6 Many problems can be expressed as finding “similar” sets:  Find near-neighbors in high-dimensional space  Examples:  Pages with similar words  For duplicate detection (Mirror Pages, Plagiarism)  Customers who purchased similar products  Online Purchases (Amazon)  Images with similar features  Users who visited similar websites Big Data Analytics CSCI 4030 7 Given: High dimensional data points 𝒙,𝒙,… 𝟏𝟐  For example: Image is a long vector of pixel colors 121 →121021010 021 010  And some distance function 𝒅(𝒙,𝒙) 𝟏𝟐  Which quantifies the “distance” between 𝒙 and 𝒙 𝟏𝟐  Goal: Find all pairs of data points (𝒙,𝒙) that are 𝒊𝒋 within some distance threshold 𝒅𝒙,𝒙≤𝒔 𝒊𝒋 𝟐  Note: Naïve solution would take 𝑶𝑵 where 𝑵 is the number of data points  MAGIC: This can be done in 𝑶𝑵 How? Big Data Analytics CSCI 4030 8Today’s lecture: Find pairs of similar docs Main idea: Candidates Pass 1: Take documents and hash them to buckets such that documents that are similar hash to the same bucket Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket) 2 Benefits: Instead of O(N ) comparisons, we need O(N) comparisons to find similar documents Big Data Analytics CSCI 4030 9 Goal: Find near-neighbors in high-dim. space  We formally define “near neighbors” as points that are a “small distance” apart  For each application, we first need to define what “distance” means  Today: Jaccard distance/similarity  The Jaccard similarity of two sets is the size of their intersection divided by the size of their union: sim(C , C ) = CC /CC 1 2 1 2 1 2  Jaccard distance: d(C , C ) = 1 - CC /CC 1 2 1 2 1 2 3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8 Big Data Analytics CSCI 4030 11 Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs  Applications:  Mirror websites, or approximate mirrors  Don’t want to show both in search results  Similar news articles at many news sites  Cluster articles by “same story”  Problems:  Many small pieces of one document can appear out of order in another  Too many documents to compare all pairs  Documents are so large or so many that they cannot fit in main memory Big Data Analytics CSCI 4030 121. Shingling: Convert documents to sets 2. Min-Hashing: Convert large sets to short signatures, while preserving similarity 3. Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents  Candidate pairs Big Data Analytics CSCI 4030 13Candidate pairs: Locality- those pairs Document Sensitive of signatures Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity Big Data Analytics CSCI 4030 14Docu- ment The set of strings of length k that appear in the doc- ument Step 1: Shingling: Convert documents to sets Step 1: Convert documents to sets  Simple approaches:  Document = set of words appearing in document  Document = set of “important” words (eliminate stop words: “and”, “a”, “the”, “to”, “you” and so on)  Don’t work well for this application. Why?  Need to account for ordering of words  A different way: Shingles Big Data Analytics CSCI 4030 16 A k-shingle (or k-gram) for a document is a sequence of k tokens that appears in the doc  Tokens can be characters, words or something else, depending on the application  Assume tokens = characters for examples  Example: k=2; document D = abcab 1 Set of 2-shingles: S(D ) = ab, bc, ca 1  Option: Shingles as a bag (multiset), count ab twice: S’(D ) = ab, bc, ca, ab 1 Big Data Analytics CSCI 4030 17 To compress long shingles, we can hash (or simply map) them to (say) 4 bytes  Represent a document by the set of hash values of its k-shingles  Idea: Rare (or none) collisions of shingles  Example: k=2; document D = abcab 1 Set of 2-shingles: S(D ) = ab, bc, ca 1 Hash the singles: h(D ) = 1, 5, 7 1 Big Data Analytics CSCI 4030 18 Document D is a set of its k-shingles C =S(D ) 1 1 1  Equivalently, each document is a 0/1 vector in the space of k-shingles  Each unique shingle is a dimension  Vectors are very sparse  A natural similarity measure is the Jaccard similarity: sim(D , D ) = CC /CC 1 2 1 2 1 2 Big Data Analytics CSCI 4030 19 Documents that have lots of shingles in common have similar text, even if the text appears in different order  Caveat: You must pick k large enough, or most documents will have most shingles  k = 5 is OK for short documents  k = 10 is better for long documents Big Data Analytics CSCI 4030 20