Question? Leave a message!

Graph Data Management

Graph Data Management
Dr.GordenMorse Profile Pic
Published Date:22-07-2017
Website URL
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn LeitowOverview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of frameworks a) Framework implementation 4) Framework analysis a) Performance comparison b) Optimization techniques by Fynn Leitow 5) Conclusion & CriticismMotivation ➢How to implement graph analysis algorithms on huge graphs? ➢How do they perform? ➢How about parallel computing?Application for big data ➢Social Networks & Web of Data ➢extremely large & dynamic ➢can’t be handled by legacy programsSocial Networks ➢vertices: people, pictures, videos ➢edges: relations among nodes (friendship, follower) ➢scale-free: follow power-law distribution (few vertices have high popularity)Web of Data ➢Large-scale structured data by governments, researchers, companies ➢Publishing principles: ○ Unique ressource identifier ○ publication at this URI in RDF triples (ressource-data-framework, statements similar to entity- relationship model but more general) ○ links to similar online ressources ➢RDF example: “The sky has the color blue” subject: predicate: object: “the sky” “has” “the color blue”Typical queries ➢ social networks: ○ punctual updates (of vertices, adding edges) ○ transitive closures (“other people you might know”, ...) ○ betweenness queries (“common friends”, shortest path, ...) ➢ Web of data: ○ bulk inserts ○ joins ○ logical inference (deductions) All men are Socrates Therefore, mortal is a man Socrates is mortal.Objective of the paper ➢ Analyse existing graph frameworks - also for machine learning ➢ native, hand-optimized implementation as reference point ➢ give suggestions to improve performance gapAlgorithms PageRank (statistics) Breadth-first search (graph traversal) Triangle Counting (statistics) Collaborative filtering (machine learning)1.PageRank (site popularity) ➢how many links go to this this site? ➢technically: probability for a random walk to end on this vertex t - iteration r - probability for random jump e - set of directed edges degree(j) - number of outgoing edges2. Breadth-first search (BFS) - Graph traversal ➢calculates “distance” (smallest number of edges) from start to any other vertex ➢Distance(start) initialized to zero, all others to infinity ➢iteratively computes for neighboring vertices:3. Triangle Counting - graph statistics ➢Triangle := two vertices are both neighbors of a common third vertex ➢Algorithm: ○ each vertex shares his neighborhood list with its neighbors ○ do neighbors overlap with the neighborhood lists? - triangle k i jCollaborative filtering - machine learning ➢Recommendation system: predicts user ratings based on an incomplete set of (user, item) ratings (a)Given a rating matrix R, find P and Q so that R = PQ is approximated best. (b) find the bipartite graph with edge weights R u,vOverview Algorithm Application Graph type Vertex property Message size (Bytes/edge) PageRank Graph statistics Directed, Double (pagerank) Constant (8) unweighted edges Breadth First Graph traversal Undirected, Int (distance) Constant (4) Search unweighted edges Collaborative Machine learning Bipartite graph; Array of Doubles Constant (8K) Filtering Undirected, (p or q ) u v weighted edges 6 Triangle Counting Graph statistics Directed, Long ( triangles) Variable (0-10 ) unweighted edges ➢We focus on PageRank and BFSFrameworks Graphlab CombinatorialBLAS SocialLite Galois GiraphExplanation: Framework & Ninja gap ➢Framework: layered structure indicating what kind of programs can/should be built and how they should interrelate ➢can include programs, specify interfaces, offer programming tools,... ➢Ninja performance gap: “Performance gap between naively written C++/C code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors” 1Choice of Frameworks: (Austin Texas University) (Stanford University) COMBINATORIAL_BLAS (UCSB)Overview Framework Programming Language Graph Communication model Partitioning layer GraphLab Vertex C++ 1-D (vertex-part.) Sockets CombBLAS Sparse (adjacency) matrix C++ 2-D (edge-part.) MPI SociaLite Datalog (declarative, Java 1-D Sockets deductive database tables) Galois Task-based C/C++ N/A N/A Giraph (Hadoop) Vertex Java 1-D Netty ➢Galois: parallel computing framework for irregular data structuresPageRank - vertex programming ➢Graphlab & Giraph, similar for Galois ➢runs on a single vertex and communicates with adjacent vertices ➢Vertex program for one iteration:PageRank - sparse matrix (CombBLAS) ➢single iteration of PageRank: A - adjacency matrix p - vector of PageRank values t p = p / vector of vertex out-degrees t t