Graph Mining lecture notes

mining data with complex structures and lecture notes on data mining concepts and techniques. lecture notes on association analysis in data mining pdf free download
Dr.NavneetSingh Profile Pic
Dr.NavneetSingh,India,Teacher
Published Date:21-07-2017
Your Website URL(Optional)
Comment
VLDB Database School (China) 2010 August 3-7, 2010, Shenyang Lecture Notes Part 1 Mining and Searching Complex Structures Anthony K.H. Tung(邓锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/atung Mining and Searching Complex Structures Chapter 1 Introduction Mining and Searching Complex Structures Introduction Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/atung Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900 What is data mining? Really nothing different from what scientists had been doing for Correct, useful Generate model data Collect data and verify or Nobel Real World construct model of real world Prize Output most likely model based on some statistical Feed in data measure What’s new? Systematically and efficiently test many statistical models 1Mining and Searching Complex Structures Chapter 1 Introduction Components of data mining Structure of model geneA=high and geneB=low === cancer geneA, geneB and geneC exhibit strong correlation Statistical Score for the model Accuracy of rule 1 is 90% Similarity function: Are they sufficiently similar group of records that support a certain model or hypothesis? Search method for the correct model parameters Given 200 genes, there could be 2200 rules. Which rule give the best prediction power? Database access method Given 1 million records, how to quickly find relevant records to compute the accuracy of a rule? The Apriori Algorithm • Bottom-up, breadth first a,b,c,e search • Only read is perform on b,c,e b,c,e b,c,e the databases a,b,c a,b,e a,c,e • Store candidates in memory to simulate the lattice search a,b c,e a,b a,c a,c a,c a,e a,e b,c b,c b,c b,e b,e b,e c,e c,e • Iteratively follow the two steps: a a a b b b c c c e e e –generate candidates –count and get actual frequent items start 4 2Mining and Searching Complex Structures Chapter 1 Introduction The K-Means Clustering Method • Given k, the k-means algorithm is implemented in 4 steps: –Partition objects into k nonempty subsets –Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. –Assign each object to the cluster with the nearest seed point. –Go back to Step 2, stop when no more new assignment. 5 The K-Means Clustering Method • Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 01 23 45 678 9 10 01 23 45 678 9 10 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 012 34 56 7 8 9 10 01 2345 6 789 10 6 3Mining and Searching Complex Structures Chapter 1 Introduction Training Dataset (Decision Tree) Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No 7 Selecting the Next Attribute S=9+,5- S=9+,5- E=0.940 E=0.940 Humidity Wind High Normal Weak Strong 3+, 4- 6+, 1- 6+, 2- 3+, 3- E=0.985 E=0.592 E=0.811 E=1.0 Gain(S,Wind) Gain(S,Humidity) =0.940-(8/14)0.811 =0.940-(7/14)0.985 – (6/14)1.0 – (7/14)0.592 =0.048 =0.151 8 4Mining and Searching Complex Structures Chapter 1 Introduction Selecting the Next Attribute S=9+,5- E=0.940 Outlook Over Rain Sunny cast 3+, 2- 2+, 3- 4+, 0 E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)0.971 -(4/14)0.0 – (5/14)0.0971 =0.247 9 ID3 Algorithm D1,D2,…,D14 Outlook 9+,5- Sunny Overcast Rain S =D1,D2,D8,D9,D11 D3,D7,D12,D13 D4,D5,D6,D10,D14 sunny 2+,3- 4+,0- 3+,2- Yes ? ? Gain(S , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 sunny Gain(S , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 sunny Gain(S , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 sunny 10 5Mining and Searching Complex Structures Chapter 1 Introduction Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes 11 Can we fit what we learn into the framework? Apriori K-means ID3 clustering classification rule pattern discovery task association rules clusters decision tree structure of the model or pattern lattice of all possible choice of any k all possible search space points as center combination of combination of items m size=infinity decision tree size= 2 size= potentially infinity support, confidence square error accuracy, score function information gain breadth first with gradient descent greedy search /optimization pruning method TBD TBD TBD data management 12 technique 6Mining and Searching Complex Structures Chapter 1 Introduction Components of data mining(II) Models Enumeration Algorithm Statistical Score Function Similarity/Search Function Database Access Method Database Background knowledge • We assume you have some basic knowledge about data mining, some of the slides here will be very useful for this purpose • Association Rule Mining http://www.comp.nus.edu.sg/atung/Renmin56.pdf • Classification and Regression http://www.comp.nus.edu.sg/atung/Renmin67.pdf • Clustering http://www.comp.nus.edu.sg/atung/Renmin78.pdf 7Mining and Searching Complex Structures Chapter 1 Introduction IT Trend Processors are cheap and will become cheaper(multi-core processor, graphic cards) Storage will be cheap but might not be fast Bandwidth will be growing What can we do with this? Play more realistic games Not exactly a joke since any technologies that speed up games can speed up scientific simulation Smarter (more intensive) computation Can store more personal semantic/ontology People can collaborate more over the Internet (Flickr, Wikipedia) to make things more intelligent The AI dream now have the support of much better hardwares Essentially, data mining can be made much more simple for the man on the street Data mining should be human-centered, not machine centered 2010-7-31 15 What is complex data? What is “simple” data? z What are complex data? Regular tabular table, with small number of attributes (of the same type), no Test1 Gene1 Progress comments Pos 2.0 Fever …… missing values. Neg -0.3 Unconscious N.A 5.7 High dimensional data: Lots of attributes with different data types with missing values Sequences/ time series Trees Graphs 8Mining and Searching Complex Structures Chapter 1 Introduction Why complex data? They come naturally in many applications. Bring research nearer to real world Lots of challenges which mean more fun Some fundamental challenges: How do you compare complex objects effectively and efficiently? How do you find special subset in the data that is interesting? Test1 Gene1 Progress comments Pos 2.0 Fever …… What new type of models and score function must you used? Neg -0.3 Unconscious How do you handle noise and error ? N.A 5.7 a a b be c d d cd b d c d c e T T 1 2 Personalized Semantic for Personal Data Management everyone will own terabytes of data soon improve query/search interface by mining and extracting personalized semantics like entities and their relationship etc. by comparing them against high quality tagged databases Query by Query by Query by Query by video documents audio/music photographs/images Wikipedia singers authors High Quality semantic Data actor/actress songs Sources layer papers places movies Personal Data documents audio video Webpage/Blogs/Bookmarks photographs/i music mages 9Mining and Searching Complex Structures Chapter 1 Introduction Integrated Approach to Mining Software Engineering Data software engineering data: code base, change history, bug reports, runtime trace integrated into a data warehouse to support decision making and mining, Example: Which code module should I modify to create a new function? Which module need maintenance? programming defect detection testing debugging maintenance … software engineering tasks helped by data mining association/ classification clustering … patterns Data Warehouse code change program structural bug … bases history states entities reports/nl software engineering data WikiScience Web 2.0: Facebook for scientists Collaborative platform for scientist to build scientific models/hypothesis and share data, applications Based on some articles, I make some changes to Model A supporting articles tagged to to create Model B Model B Centralized, Centralized, Hybrid Model Model A Model B Hybrid Model Model A Model B C Constructed C Constructed by System by System supporting dataset tagged to Model A This is my model of the solar system base on my supporting dataset 10Mining and Searching Complex Structures Chapter 1 Introduction Hey, why not Cloud Computing, Map/Reduce? • These are platform for scaling up services to large number of users on large amount of data • But what exactly do you want to scale up? • Services that provide useful and semantically correct information to the users • We have too many scalable data mining algorithms that find nothing or too many things • Let’s focus on finding useful things first (assuming we have lot’s of processing power) and then try to scale it up Schedule of the Course Date/Time Content Lesson 1 Introduction Lesson 2 Mining and Search High Dimensional Data I Lesson 3 Mining and Search High Dimensional Data II Lesson 4 Mining and Search High Dimensional Data III Lesson 5 Similarity Search for Sequences and Trees I Lesson 6 Similarity Search for Sequences and Trees III Lesson 7 Similarity Search for Graph I Lesson 8 Similarity Search for Graph II Lesson 9 Similarity Search for Graph III Lesson 10 Mining Massive Graph I Lesson 11 Mining Massive Graph II Lesson 12 Mining Massive Graph III 11Mining and Searching Complex Structures Chapter 1 Introduction Focus of the course • Techniques that can handle high dimensional, complex structures –Providing semantics to similarity search –Shotgun and Assembly: Column/Feature Wise Processing using Inverted Index –Row-wise Enumeration –Using local properties to infer global properties • Throughout the course, please try to think of how these techniques are applicable across different type of complex structures Databases Queries z To start off, we will consider something very basic call ranking queries since we need ranking any similarity search (usually from most similar to most dissimilar) z In relational database, SQL returns all results at one go z How many tuples can be fitted in one screen? z How many tuples can you remember? z Options: z Summarize the results z Display representative tuples z How to select representative tuples? 12Mining and Searching Complex Structures Chapter 1 Introduction Retrieve Relevant Information z Search videos related to Shanghai Expo z Too many results: as long as you click “next”, there are 20 more new results z Are we interested in all results? z No, only most relevant ones z Search engines have to rank the results, out of which they make money from Question: How to Select a Small Result Set z Selecting the most representative or most interesting results is not trivial z Find an apartment with rental cheaper than 1000, the cheaper the better z The result tuples can be sorted in the ascending order of rental prices, those in front are more favorable z Find an apartment with rental cheaper than 1000 near NEU, the lower the better, the nearer the better z Apartment with lower rent may not be near, nearer one may not be cheap z Order by prices? Order by distances? 13Mining and Searching Complex Structures Chapter 1 Introduction Top-k Queries z Define a scoring function, which maps a tuple to a real number, as a score z The higher the score is, the more favorable the tuple is z Define an integer k z Answer: k objects with highest scores z Different scoring function may give different top-k result Price Distance to NEU Apartment A 800 500 meter Apartment B 1200 200 meters z Given k = 1, if the score function is defined as the sum of price and distance, the first tuple is better; if it is defined as the product, the second tuple is better Brute Force Top-k z Compute scores for each result tuple z Sort the tuples according to the descending order of the scores z Select the first k tuples z What if the number of tuples is unlimited? Search engines can give unlimited number of results z Even if the number of tuples is limited, it is too slow to compute score for each tuple z We have to do it efficiently 14Mining and Searching Complex Structures Chapter 1 Introduction Outline z Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) z Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA) Monotonicity z A score function f is monotone if f(x ,x ,...,x )≤f(y ,y ,...,y ) 1 2 m 1 2 m whenever x≤y for every i i i z Select top-3 students with highest total score in mathematics, physics and computer science: • select name, math+phys+comp as score from student order by score desc limit 3 z sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp 15Mining and Searching Complex Structures Chapter 1 Introduction Sorted Lists z We shall think of a database consisting of m sorted lists L , 1 L , … L 2 m L L L math phys comp Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Outline z Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) z Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA) 16Mining and Searching Complex Structures Chapter 1 Introduction Fagin's Algorithm (I) z Do sequential access until there are at least k matches Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... z Sequential accesses are stopped when 3 students are seen, i.e. Ann, Hugh and Kurt Fagin's Algorithm (II) z For each object that has been seen, do random accesses on other lists to compute its score Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... z Random accesses need to be done for Ben, Carl, Jane and Ryan 17Mining and Searching Complex Structures Chapter 1 Introduction Fagin's Algorithm (III) z Select the k objects with highest score as top-k result Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Why is FA correct? (I) z There are at least k objects seen on all attributes when sequential access is stopped z By monotonicity, those objects that are not seen do not have higher score than the above k objects Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 18Mining and Searching Complex Structures Chapter 1 Introduction Why is FA correct? (II) z For those that have been seen, it is either all attributes has been seen, or random accesses are performed to know all attributes z The k objects with highest scores are therefore the top-k result Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Outline z Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) z Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA) 19

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.