Web graph in graph theory

Web as a directed graph and mining web graphs for recommendations pdf and web based graph
Dr.GordenMorse Profile Pic
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Big Data Analytics CSCI 4030 High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection Big Data Analytics CSCI 4030 2  A graph is a representation of a set of objects (e.g., users, computers, …) where some pairs of objects are connected by links  Objects are called nodes or vertices  Links are called edges  The edges may be directed or undirected Big Data Analytics CSCI 4030 3 Facebook social graph 4-degrees of separation Backstrom-Boldi-Rosa-Ugander-Vigna, 2011 Big Data Analytics CSCI 4030 4 Connections between political blogs Polarization of the network Adamic-Glance, 2005 Big Data Analytics CSCI 4030 5 Citation networks and Maps of science Börner et al., 2012 Big Data Analytics CSCI 4030 6 domain2 domain1 router domain3 Internet Big Data Analytics CSCI 4030 7  Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks I teach a class on Databases. CSCI3030: Classes are in the UA building Computer Science Department at UOIT UOIT Big Data Analytics CSCI 4030 8  Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks I teach a class on Databases. CSCI 3030: Classes are in the UA building Computer Science Department at UOIT UOIT J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9 Big Data Analytics CSCI 4030 10  How to make the Web accessible?  First try: Human curated Web directories  Yahoo, DMOZ, LookSmart  Second try: Web Search  Information Retrieval investigates: Find relevant docs in a small and trusted set  Newspaper articles, Patents, etc.  But: Web is huge, full of untrusted documents, random things, web spam, etc. Big Data Analytics CSCI 4030 11 2 challenges of web search:  (1) Web contains many sources of information Who to “trust”?  Trick: Trustworthy pages may point to each other  (2) What is the “best” answer to query “newspaper”?  No single right answer  Trick: Pages that actually know about newspapers might all be pointing to many newspapers Big Data Analytics CSCI 4030 12  All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  Let’s rank the pages by the link structure Big Data Analytics CSCI 4030 13  We will cover the following Link Analysis approaches for computing importances of nodes in a graph:  Page Rank  Topic-Specific (Personalized) Page Rank  Web Spam Detection Algorithms Big Data Analytics CSCI 4030 14  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 in-links  www.joe-schmoe.com has 1 in-link  Are all in-links are equal?  Links from important pages count more  Recursive question Big Data Analytics CSCI 4030 16 A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 Big Data Analytics CSCI 4030 17  Each link’s vote is proportional to the importance of its source page  If page j with importance r has n out-links, j each link gets r / n votes j  Page j’s own importance is the sum of the votes on its in-links i k r /3 i r /4 k j r /3 j r = r /3+r /4 j i k r /3 r /3 j j Big Data Analytics CSCI 4030 18 The web in 1839  A “vote” from an important y/2 page is worth more  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r for page j m j a m a/2 r i “Flow” equations: r   j r = r /2 + r /2 y y a d i j i r = r /2 + r a y m r = r /2 m a 𝒅 … out-degree of node 𝒊 𝒊 Big Data Analytics CSCI 4030 19 Flow equations:  3 equations, 3 unknowns, r = r /2 + r /2 y y a no constants r = r /2 + r a y m  No unique solution r = r /2 m a  Additional constraint forces uniqueness: 𝒓 +𝒓 + 𝒓 = 𝟏 𝒚 𝒂 𝒎 𝟐 𝟐 𝟏  Solution: 𝒓 = ,𝒓 = ,𝒓 = 𝒚 𝒂 𝒎 𝟓 𝟓 𝟓  Gaussian elimination method (an algorithm for solving linear equations) works for small examples, but we need a better method for large web-size graphs  We need a new formulation Big Data Analytics CSCI 4030 20