Done, your profile is created.Finish your profile by filling in the following fields
Forgot Password Earn Money,Free Notes
Password sent to your Email Id, Please Check your Mail
Updating Cart........ Please Wait........
Optimization of Distributed Queries
Optimization of Distributed Queries
Chapter 7: Optimization of Distributed
• Basic Concepts
• Distributed Cost Model
• Database Statistics
• Joins and Semijoins
• Query Optimization Algorithms
Acknowledgements: I am indebted to Arturas Mazeika for providing me his slides of this course.
DDB 2008/09 J. Gamper Page 1Basic Concepts
• Query optimization: Process of producing an op-
timal (close to optimal) query execution plan which
represents an execution strategy for the query
– The main task in query optimization is to con-
sider different orderings of the operations
• Centralized query optimization:
– Find (the best) query execution plan in the
space of equivalent query trees
– Minimize an objective cost function
– Gather statistics about relations
• Distributed query optimization brings additional issues
– Linear query trees are not necessarily a good choice
– Bushy query trees are not necessarily a bad choice
– What and where to ship the relations
– How to ship relations (ship as a whole, ship as needed)
– When to use semi-joins instead of joins
DDB 2008/09 J. Gamper Page 2Basic Concepts . . .
• Search space: The set of alternative query execution plans (query trees)
– Typically very large
– The main issue is to optimize the joins
– ForN relations, there areO(N) equivalent join trees that can be obtained by
applying commutativity and associativity rules
• Example: 3 equivalent query trees (join trees) of the joins in the following query
FROM EMP, ASG, PROJ
WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO
DDB 2008/09 J. Gamper Page 3Basic Concepts . . .
• Reduction of the search space
– Restrict by means of heuristics
∗ Perform unary operations before binary operations, etc
– Restrict the shape of the join tree
∗ Consider the type of trees (linear trees, vs. bushy ones)
Linear Join Tree Bushy Join Tree
DDB 2008/09 J. Gamper Page 4Basic Concepts . . .
• There are two main strategies to scan the search space
• Deterministic scan of the search space
– Start from base relations and build plans by adding one relation at each step
– Breadth-ﬁrst strategy: build all possible plans before choosing the “best” plan
(dynamic programming approach)
– Depth-ﬁrst strategy: build only one plan (greedy approach)
DDB 2008/09 J. Gamper Page 5Basic Concepts . . .
• Randomized scan of the search space
– Search for optimal solutions around a particular starting point
– e.g., iterative improvement or simulated annealing techniques
– Trades optimization time for execution time
∗ Does not guarantee that the best solution is obtained, but avoid the high cost of
– The strategy is better when more than 5-6 relations are involved
DDB 2008/09 J. Gamper Page 6Distributed Cost Model
• Two different types of cost functions can be used
– Reduce total time
∗ Reduce each cost component (in terms of time) individually, i.e., do as little for each
cost component as possible
∗ Optimize the utilization of the resources (i.e., increase system throughput)
– Reduce response time
∗ Do as many things in parallel as possible
∗ May increase total time because of increased total activity
DDB 2008/09 J. Gamper Page 7Distributed Cost Model . . .
• Total time: Sum of the time of all individual components
– Local processing time: CPU time + I/O time
– Communication time: ﬁxed time to initiate a message + time to transmit the data
Total time =T ∗instructions +T ∗I/Os +
T ∗messages +T ∗bytes
• The individual components of the total cost have different weights:
– Wide area network
∗ Message initiation and transmission costs are high
∗ Local processing cost is low (fast mainframes or minicomputers)
∗ Ratio of communication to I/O costs is 20:1
– Local area networks
∗ Communication and local processing costs are more or less equal
∗ Ratio of communication to I/O costs is 1:1.6 (10MB/s network)
DDB 2008/09 J. Gamper Page 8Distributed Cost Model . . .
• Response time: Elapsed time between the initiation and the completion of a query
Response time =T ∗seq instructions +T ∗seq I/Os +
T ∗seq messages +T ∗seq bytes
– whereseq x (x in instructions, I/O, messages, bytes) is the maximum number of
x which must be done sequentially.
• Any processing and communication done in parallel is ignored
DDB 2008/09 J. Gamper Page 9Distributed Cost Model . . .
• Example: Query at site 3 with data from sites 1 and 2.
– Assume that only the communication cost is considered
– Total time =T ∗2+T ∗(x+y)
– Response time = maxT +T ∗x,T +T ∗y
MSG TR MSG TR
DDB 2008/09 J. Gamper Page 10Database Statistics
• The primary cost factor is the size of intermediate relations
– that are produced during the execution and
– must be transmitted over the network, if a subsequent operation is located on a
• It is costly to compute the size of the intermediate relations precisely.
• Instead global statistics of relations and fragments are computed and used to
DDB 2008/09 J. Gamper Page 11Database Statistics . . .
• LetR(A ,A ,...,A ) be a relation fragmented intoR ,R ,...,R .
1 2 1 2 r
• Relation statistics
– min and max values of each attribute: minA,maxA.
– length of each attribute:length(A )
– number of distinct values in each fragment (cardinality):card(A ),
• Fragment statistics
– cardinality of the fragment:card(R )
– cardinality of each attribute of each fragment:card(Π (R ))
DDB 2008/09 J. Gamper Page 12Database Statistics . . .
• Selectivity factor of an operation: the proportion of tuples of an operand relation that
participate in the result of that operation
• Assumption: independent attributes and uniform distribution of attribute values
• Selectivity factor of selection
SF (A =value) =
SF (Avalue) =
SF (Avalue) =
• Properties of the selectivity factor of the selection
SF (p(A )∧p(A )) =SF (p(A ))∗SF (p(A ))
σ i j σ i σ j
SF (p(A )∨p(A )) =SF (p(A ))+SF (p(A ))−(SF (p(A ))∗SF (p(A ))
σ i j σ i σ j σ i σ j
SF (A∈values) =SF (A =value)∗card(values)
DDB 2008/09 J. Gamper Page 13Database Statistics . . .
• Cardinality of intermediate results
card(σ (R)) =SF (P)∗card(R)
∗ More difﬁcult: duplicates, correlations between projected attributes are unknown
∗ Simple if the projected attribute is a key
card(Π (R)) =card(R)
– Cartesian Product
∗ upper bound:card(R∪S)≤card(R)+card(S)
∗ lower bound:card(R∪S)≥ maxcard(R),card(S)
– Set Difference
∗ upper bound:card(R−S) =card(R)
∗ lower bound: 0
DDB 2008/09 J. Gamper Page 14Database Statistics . . .
• Selectivity factor for joins
• Cardinality of joins
– Upper bound: cardinality of Cartesian Product
– General case (if SF is given):
card(R⋊⋉S) =SF ∗card(R)∗card(S)
– Special case:R.A is a key ofR andS.A is a foreign key ofS;
∗ eachS-tuple matches with at most one tuple ofR
card(R⋊⋉ S) =card(S)
DDB 2008/09 J. Gamper Page 15Database Statistics . . .
• Selectivity factor for semijoins: fraction of R-tuples that join with S-tuples
– An approximation is the selectivity ofA inS
SF (R⊲ S) =SF (S.A) =
⊲ A ⊲
• Cardinality of semijoin (general case):
card(R⊲ S) =SF (S.A)∗card(R)
• Example:R.A is a foreign key inS (S.A is a primary key)
ThenSF = 1 and the result size corresponds to the size ofR
DDB 2008/09 J. Gamper Page 16Join Ordering in Fragment Queries
• Join ordering is an important aspect in centralized DBMS, and it is even more
important in a DDBMS since joins between fragments that are stored at different sites
may increase the communication time.
• Two approaches exist:
– Optimize the ordering of joins directly
∗ INGRES and distributed INGRES
∗ SystemR and SystemR
– Replace joins by combinations of semijoins in order to minimize the communication
∗ Hill Climbing and SDD-1
DDB 2008/09 J. Gamper Page 17Join Ordering in Fragment Queries . . .
• Direct join odering of two relation/fragments located at different sites
– Move the smaller relation to the other site
– We have to estimate the size ofR andS
DDB 2008/09 J. Gamper Page 18Join Ordering in Fragment Queries . . .
• Direct join ordering of queries involving more than two relations is substantially more
• Example: Consider the following query and the respective join graph, where we make
also assumptions about the locations of the three relations/fragments
PROJ⋊⋉ ASG⋊⋉ EMP
DDB 2008/09 J. Gamper Page 19Join Ordering in Fragment Queries . . .
• Example (contd.): The query can be evaluated in at least 5 different ways.
– Plan 1: EMP→Site 2
Site 2: EMP’=EMP⋊⋉ASG
Site 3: EMP’⋊⋉PROJ
– Plan 2: ASG→Site 1
Site 1: EMP’=EMP⋊⋉ASG
– Plan 4: PROJ→Site 2
Site 2: PROJ’=PROJ⋊⋉ASG
Site 3: EMP’⋊⋉PROJ
Site 1: PROJ’⋊⋉EMP
– Plan 3: ASG→Site 3
Site 3: ASG’=ASG⋊⋉PROJ
– Plan 5: EMP→Site 2
Site 1: ASG’⋊⋉EMP
Site 2: EMP⋊⋉PROJ⋊⋉ASG
• To select a plan, a lot of information is needed, including
– size(EMP),size(ASG),size(PROJ),size(EMP ⋊⋉ASG),
– Possibilities of parallel execution if response time is used
DDB 2008/09 J. Gamper Page 20