Question? Leave a message!

Optimization of Distributed Queries

Optimization of Distributed Queries
Dr.JakeFinlay Profile Pic
Published Date:22-07-2017
Website URL
Chapter 7: Optimization of Distributed Queries • Basic Concepts • Distributed Cost Model • Database Statistics • Joins and Semijoins • Query Optimization Algorithms Acknowledgements: I am indebted to Arturas Mazeika for providing me his slides of this course. DDB 2008/09 J. Gamper Page 1Basic Concepts • Query optimization: Process of producing an op- timal (close to optimal) query execution plan which represents an execution strategy for the query – The main task in query optimization is to con- sider different orderings of the operations • Centralized query optimization: – Find (the best) query execution plan in the space of equivalent query trees – Minimize an objective cost function – Gather statistics about relations • Distributed query optimization brings additional issues – Linear query trees are not necessarily a good choice – Bushy query trees are not necessarily a bad choice – What and where to ship the relations – How to ship relations (ship as a whole, ship as needed) – When to use semi-joins instead of joins DDB 2008/09 J. Gamper Page 2Basic Concepts . . . • Search space: The set of alternative query execution plans (query trees) – Typically very large – The main issue is to optimize the joins – ForN relations, there areO(N) equivalent join trees that can be obtained by applying commutativity and associativity rules • Example: 3 equivalent query trees (join trees) of the joins in the following query SELECT ENAME,RESP FROM EMP, ASG, PROJ WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO DDB 2008/09 J. Gamper Page 3Basic Concepts . . . • Reduction of the search space – Restrict by means of heuristics ∗ Perform unary operations before binary operations, etc – Restrict the shape of the join tree ∗ Consider the type of trees (linear trees, vs. bushy ones) Linear Join Tree Bushy Join Tree DDB 2008/09 J. Gamper Page 4Basic Concepts . . . • There are two main strategies to scan the search space – Deterministic – Randomized • Deterministic scan of the search space – Start from base relations and build plans by adding one relation at each step – Breadth-first strategy: build all possible plans before choosing the “best” plan (dynamic programming approach) – Depth-first strategy: build only one plan (greedy approach) DDB 2008/09 J. Gamper Page 5Basic Concepts . . . • Randomized scan of the search space – Search for optimal solutions around a particular starting point – e.g., iterative improvement or simulated annealing techniques – Trades optimization time for execution time ∗ Does not guarantee that the best solution is obtained, but avoid the high cost of optimization – The strategy is better when more than 5-6 relations are involved DDB 2008/09 J. Gamper Page 6Distributed Cost Model • Two different types of cost functions can be used – Reduce total time ∗ Reduce each cost component (in terms of time) individually, i.e., do as little for each cost component as possible ∗ Optimize the utilization of the resources (i.e., increase system throughput) – Reduce response time ∗ Do as many things in parallel as possible ∗ May increase total time because of increased total activity DDB 2008/09 J. Gamper Page 7Distributed Cost Model . . . • Total time: Sum of the time of all individual components – Local processing time: CPU time + I/O time – Communication time: fixed time to initiate a message + time to transmit the data Total time =T ∗instructions +T ∗I/Os + CPU I/O T ∗messages +T ∗bytes MSG TR • The individual components of the total cost have different weights: – Wide area network ∗ Message initiation and transmission costs are high ∗ Local processing cost is low (fast mainframes or minicomputers) ∗ Ratio of communication to I/O costs is 20:1 – Local area networks ∗ Communication and local processing costs are more or less equal ∗ Ratio of communication to I/O costs is 1:1.6 (10MB/s network) DDB 2008/09 J. Gamper Page 8Distributed Cost Model . . . • Response time: Elapsed time between the initiation and the completion of a query Response time =T ∗seq instructions +T ∗seq I/Os + CPU I/O T ∗seq messages +T ∗seq bytes MSG TR – whereseq x (x in instructions, I/O, messages, bytes) is the maximum number of x which must be done sequentially. • Any processing and communication done in parallel is ignored DDB 2008/09 J. Gamper Page 9Distributed Cost Model . . . • Example: Query at site 3 with data from sites 1 and 2. – Assume that only the communication cost is considered – Total time =T ∗2+T ∗(x+y) MSG TR – Response time = maxT +T ∗x,T +T ∗y MSG TR MSG TR DDB 2008/09 J. Gamper Page 10Database Statistics • The primary cost factor is the size of intermediate relations – that are produced during the execution and – must be transmitted over the network, if a subsequent operation is located on a different site • It is costly to compute the size of the intermediate relations precisely. • Instead global statistics of relations and fragments are computed and used to provide approximations DDB 2008/09 J. Gamper Page 11Database Statistics . . . • LetR(A ,A ,...,A ) be a relation fragmented intoR ,R ,...,R . 1 2 1 2 r k • Relation statistics – min and max values of each attribute: minA,maxA. i i – length of each attribute:length(A ) i – number of distinct values in each fragment (cardinality):card(A ), i (card(dom(A ))) i • Fragment statistics – cardinality of the fragment:card(R ) i – cardinality of each attribute of each fragment:card(Π (R )) A j i DDB 2008/09 J. Gamper Page 12Database Statistics . . . • Selectivity factor of an operation: the proportion of tuples of an operand relation that participate in the result of that operation • Assumption: independent attributes and uniform distribution of attribute values • Selectivity factor of selection 1 SF (A =value) = σ card(Π (R)) A max(A)−value SF (Avalue) = σ max(A)−min(A) value−min(A) SF (Avalue) = σ max(A)−min(A) • Properties of the selectivity factor of the selection SF (p(A )∧p(A )) =SF (p(A ))∗SF (p(A )) σ i j σ i σ j SF (p(A )∨p(A )) =SF (p(A ))+SF (p(A ))−(SF (p(A ))∗SF (p(A )) σ i j σ i σ j σ i σ j SF (A∈values) =SF (A =value)∗card(values) σ σ DDB 2008/09 J. Gamper Page 13Database Statistics . . . • Cardinality of intermediate results – Selection card(σ (R)) =SF (P)∗card(R) σ P – Projection ∗ More difficult: duplicates, correlations between projected attributes are unknown ∗ Simple if the projected attribute is a key card(Π (R)) =card(R) A – Cartesian Product card(R×S) =card(R)∗card(S) – Union ∗ upper bound:card(R∪S)≤card(R)+card(S) ∗ lower bound:card(R∪S)≥ maxcard(R),card(S) – Set Difference ∗ upper bound:card(R−S) =card(R) ∗ lower bound: 0 DDB 2008/09 J. Gamper Page 14Database Statistics . . . • Selectivity factor for joins card(R⋊⋉S) SF = ⋊⋉ card(R)∗card(S) • Cardinality of joins – Upper bound: cardinality of Cartesian Product card(R⋊⋉S)≤card(R)∗card(S) – General case (if SF is given): card(R⋊⋉S) =SF ∗card(R)∗card(S) ⋊⋉ – Special case:R.A is a key ofR andS.A is a foreign key ofS; ∗ eachS-tuple matches with at most one tuple ofR card(R⋊⋉ S) =card(S) R.A=S.A DDB 2008/09 J. Gamper Page 15Database Statistics . . . • Selectivity factor for semijoins: fraction of R-tuples that join with S-tuples – An approximation is the selectivity ofA inS card(Π (S)) A SF (R⊲ S) =SF (S.A) = ⊲ A ⊲ card(domA) • Cardinality of semijoin (general case): card(R⊲ S) =SF (S.A)∗card(R) A ⊲ • Example:R.A is a foreign key inS (S.A is a primary key) ThenSF = 1 and the result size corresponds to the size ofR DDB 2008/09 J. Gamper Page 16Join Ordering in Fragment Queries • Join ordering is an important aspect in centralized DBMS, and it is even more important in a DDBMS since joins between fragments that are stored at different sites may increase the communication time. • Two approaches exist: – Optimize the ordering of joins directly ∗ INGRES and distributed INGRES ∗ ∗ SystemR and SystemR – Replace joins by combinations of semijoins in order to minimize the communication costs ∗ Hill Climbing and SDD-1 DDB 2008/09 J. Gamper Page 17Join Ordering in Fragment Queries . . . • Direct join odering of two relation/fragments located at different sites – Move the smaller relation to the other site – We have to estimate the size ofR andS DDB 2008/09 J. Gamper Page 18Join Ordering in Fragment Queries . . . • Direct join ordering of queries involving more than two relations is substantially more complex • Example: Consider the following query and the respective join graph, where we make also assumptions about the locations of the three relations/fragments PROJ⋊⋉ ASG⋊⋉ EMP PNO ENO DDB 2008/09 J. Gamper Page 19Join Ordering in Fragment Queries . . . • Example (contd.): The query can be evaluated in at least 5 different ways. – Plan 1: EMP→Site 2 Site 2: EMP’=EMP⋊⋉ASG EMP’→Site 3 Site 3: EMP’⋊⋉PROJ – Plan 2: ASG→Site 1 Site 1: EMP’=EMP⋊⋉ASG – Plan 4: PROJ→Site 2 EMP’→Site 3 Site 2: PROJ’=PROJ⋊⋉ASG Site 3: EMP’⋊⋉PROJ PROJ’→Site 1 Site 1: PROJ’⋊⋉EMP – Plan 3: ASG→Site 3 Site 3: ASG’=ASG⋊⋉PROJ – Plan 5: EMP→Site 2 ASG’→Site 1 PROJ→Site 2 Site 1: ASG’⋊⋉EMP Site 2: EMP⋊⋉PROJ⋊⋉ASG • To select a plan, a lot of information is needed, including – size(EMP),size(ASG),size(PROJ),size(EMP ⋊⋉ASG), size(ASG⋊⋉PROJ) – Possibilities of parallel execution if response time is used DDB 2008/09 J. Gamper Page 20