Question? Leave a message!




Optimization of Distributed Queries

Optimization of Distributed Queries
Chapter 7: Optimization of Distributed Queries • Basic Concepts • Distributed Cost Model • Database Statistics • Joins and Semijoins • Query Optimization Algorithms Acknowledgements: I am indebted to Arturas Mazeika for providing me his slides of this course. DDB 2008/09 J. Gamper Page 1Basic Concepts • Query optimization: Process of producing an op timal (close to optimal) query execution plan which represents an execution strategy for the query – The main task in query optimization is to con sider different orderings of the operations • Centralized query optimization: – Find (the best) query execution plan in the space of equivalent query trees – Minimize an objective cost function – Gather statistics about relations • Distributed query optimization brings additional issues – Linear query trees are not necessarily a good choice – Bushy query trees are not necessarily a bad choice – What and where to ship the relations – How to ship relations (ship as a whole, ship as needed) – When to use semijoins instead of joins DDB 2008/09 J. Gamper Page 2Basic Concepts . . . • Search space: The set of alternative query execution plans (query trees) – Typically very large – The main issue is to optimize the joins – ForN relations, there areO(N) equivalent join trees that can be obtained by applying commutativity and associativity rules • Example: 3 equivalent query trees (join trees) of the joins in the following query SELECT ENAME,RESP FROM EMP, ASG, PROJ WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO DDB 2008/09 J. Gamper Page 3Basic Concepts . . . • Reduction of the search space – Restrict by means of heuristics ∗ Perform unary operations before binary operations, etc – Restrict the shape of the join tree ∗ Consider the type of trees (linear trees, vs. bushy ones) Linear Join Tree Bushy Join Tree DDB 2008/09 J. Gamper Page 4Basic Concepts . . . • There are two main strategies to scan the search space – Deterministic – Randomized • Deterministic scan of the search space – Start from base relations and build plans by adding one relation at each step – Breadthfirst strategy: build all possible plans before choosing the “best” plan (dynamic programming approach) – Depthfirst strategy: build only one plan (greedy approach) DDB 2008/09 J. Gamper Page 5Basic Concepts . . . • Randomized scan of the search space – Search for optimal solutions around a particular starting point – e.g., iterative improvement or simulated annealing techniques – Trades optimization time for execution time ∗ Does not guarantee that the best solution is obtained, but avoid the high cost of optimization – The strategy is better when more than 56 relations are involved DDB 2008/09 J. Gamper Page 6Distributed Cost Model • Two different types of cost functions can be used – Reduce total time ∗ Reduce each cost component (in terms of time) individually, i.e., do as little for each cost component as possible ∗ Optimize the utilization of the resources (i.e., increase system throughput) – Reduce response time ∗ Do as many things in parallel as possible ∗ May increase total time because of increased total activity DDB 2008/09 J. Gamper Page 7Distributed Cost Model . . . • Total time: Sum of the time of all individual components – Local processing time: CPU time + I/O time – Communication time: fixed time to initiate a message + time to transmit the data Total time =T ∗instructions +T ∗I/Os + CPU I/O T ∗messages +T ∗bytes MSG TR • The individual components of the total cost have different weights: – Wide area network ∗ Message initiation and transmission costs are high ∗ Local processing cost is low (fast mainframes or minicomputers) ∗ Ratio of communication to I/O costs is 20:1 – Local area networks ∗ Communication and local processing costs are more or less equal ∗ Ratio of communication to I/O costs is 1:1.6 (10MB/s network) DDB 2008/09 J. Gamper Page 8Distributed Cost Model . . . • Response time: Elapsed time between the initiation and the completion of a query Response time =T ∗seq instructions +T ∗seq I/Os + CPU I/O T ∗seq messages +T ∗seq bytes MSG TR – whereseq x (x in instructions, I/O, messages, bytes) is the maximum number of x which must be done sequentially. • Any processing and communication done in parallel is ignored DDB 2008/09 J. Gamper Page 9Distributed Cost Model . . . • Example: Query at site 3 with data from sites 1 and 2. – Assume that only the communication cost is considered – Total time =T ∗2+T ∗(x+y) MSG TR – Response time = maxT +T ∗x,T +T ∗y MSG TR MSG TR DDB 2008/09 J. Gamper Page 10Database Statistics • The primary cost factor is the size of intermediate relations – that are produced during the execution and – must be transmitted over the network, if a subsequent operation is located on a different site • It is costly to compute the size of the intermediate relations precisely. • Instead global statistics of relations and fragments are computed and used to provide approximations DDB 2008/09 J. Gamper Page 11Database Statistics . . . • LetR(A ,A ,...,A ) be a relation fragmented intoR ,R ,...,R . 1 2 1 2 r k • Relation statistics – min and max values of each attribute: minA,maxA. i i – length of each attribute:length(A ) i – number of distinct values in each fragment (cardinality):card(A ), i (card(dom(A ))) i • Fragment statistics – cardinality of the fragment:card(R ) i – cardinality of each attribute of each fragment:card(Π (R )) A j i DDB 2008/09 J. Gamper Page 12Database Statistics . . . • Selectivity factor of an operation: the proportion of tuples of an operand relation that participate in the result of that operation • Assumption: independent attributes and uniform distribution of attribute values • Selectivity factor of selection 1 SF (A =value) = σ card(Π (R)) A max(A)−value SF (Avalue) = σ max(A)−min(A) value−min(A) SF (Avalue) = σ max(A)−min(A) • Properties of the selectivity factor of the selection SF (p(A )∧p(A )) =SF (p(A ))∗SF (p(A )) σ i j σ i σ j SF (p(A )∨p(A )) =SF (p(A ))+SF (p(A ))−(SF (p(A ))∗SF (p(A )) σ i j σ i σ j σ i σ j SF (A∈values) =SF (A =value)∗card(values) σ σ DDB 2008/09 J. Gamper Page 13Database Statistics . . . • Cardinality of intermediate results – Selection card(σ (R)) =SF (P)∗card(R) σ P – Projection ∗ More difficult: duplicates, correlations between projected attributes are unknown ∗ Simple if the projected attribute is a key card(Π (R)) =card(R) A – Cartesian Product card(R×S) =card(R)∗card(S) – Union ∗ upper bound:card(R∪S)≤card(R)+card(S) ∗ lower bound:card(R∪S)≥ maxcard(R),card(S) – Set Difference ∗ upper bound:card(R−S) =card(R) ∗ lower bound: 0 DDB 2008/09 J. Gamper Page 14Database Statistics . . . • Selectivity factor for joins card(R⋊⋉S) SF = ⋊⋉ card(R)∗card(S) • Cardinality of joins – Upper bound: cardinality of Cartesian Product card(R⋊⋉S)≤card(R)∗card(S) – General case (if SF is given): card(R⋊⋉S) =SF ∗card(R)∗card(S) ⋊⋉ – Special case:R.A is a key ofR andS.A is a foreign key ofS; ∗ eachStuple matches with at most one tuple ofR card(R⋊⋉ S) =card(S) R.A=S.A DDB 2008/09 J. Gamper Page 15Database Statistics . . . • Selectivity factor for semijoins: fraction of Rtuples that join with Stuples – An approximation is the selectivity ofA inS card(Π (S)) A SF (R⊲ S) =SF (S.A) = ⊲ A ⊲ card(domA) • Cardinality of semijoin (general case): card(R⊲ S) =SF (S.A)∗card(R) A ⊲ • Example:R.A is a foreign key inS (S.A is a primary key) ThenSF = 1 and the result size corresponds to the size ofR DDB 2008/09 J. Gamper Page 16Join Ordering in Fragment Queries • Join ordering is an important aspect in centralized DBMS, and it is even more important in a DDBMS since joins between fragments that are stored at different sites may increase the communication time. • Two approaches exist: – Optimize the ordering of joins directly ∗ INGRES and distributed INGRES ∗ ∗ SystemR and SystemR – Replace joins by combinations of semijoins in order to minimize the communication costs ∗ Hill Climbing and SDD1 DDB 2008/09 J. Gamper Page 17Join Ordering in Fragment Queries . . . • Direct join odering of two relation/fragments located at different sites – Move the smaller relation to the other site – We have to estimate the size ofR andS DDB 2008/09 J. Gamper Page 18Join Ordering in Fragment Queries . . . • Direct join ordering of queries involving more than two relations is substantially more complex • Example: Consider the following query and the respective join graph, where we make also assumptions about the locations of the three relations/fragments PROJ⋊⋉ ASG⋊⋉ EMP PNO ENO DDB 2008/09 J. Gamper Page 19Join Ordering in Fragment Queries . . . • Example (contd.): The query can be evaluated in at least 5 different ways. – Plan 1: EMP→Site 2 Site 2: EMP’=EMP⋊⋉ASG EMP’→Site 3 Site 3: EMP’⋊⋉PROJ – Plan 2: ASG→Site 1 Site 1: EMP’=EMP⋊⋉ASG – Plan 4: PROJ→Site 2 EMP’→Site 3 Site 2: PROJ’=PROJ⋊⋉ASG Site 3: EMP’⋊⋉PROJ PROJ’→Site 1 Site 1: PROJ’⋊⋉EMP – Plan 3: ASG→Site 3 Site 3: ASG’=ASG⋊⋉PROJ – Plan 5: EMP→Site 2 ASG’→Site 1 PROJ→Site 2 Site 1: ASG’⋊⋉EMP Site 2: EMP⋊⋉PROJ⋊⋉ASG • To select a plan, a lot of information is needed, including – size(EMP),size(ASG),size(PROJ),size(EMP ⋊⋉ASG), size(ASG⋊⋉PROJ) – Possibilities of parallel execution if response time is used DDB 2008/09 J. Gamper Page 20Semijoin Based Algorithms • Semijoins can be used to efficiently implement joins – The semijoin acts as a size reducer (similar as to a selection) such that smaller relations need to be transferred • Consider two relations:R located at site 1 andS located and site 2 – Solution with semijoins: Replace one or both operand relations/fragments by a semijoin, using the following rules: R⋊⋉ S ⇐⇒ (R⊲ S)⋊⋉ S A A A ⇐⇒ R⋊⋉ (S⊲ R) A A ⇐⇒ (R⊲ S)⋊⋉ (S⊲ R) A A A • The semijoin is beneficial if the cost to produce and send it to the other site is less than the cost of sending the whole operand relation and of doing the actual join. DDB 2008/09 J. Gamper Page 21Semijoin Based Algorithms • Cost analysisR⋊⋉ S vs. (R⊲ S)⋊⋉S, assuming thatsize(R)size(S) A A – Perform the joinR⋊⋉S: ∗ R→ Site 2 ∗ Site 2 computesR⋊⋉S – Perform the semijoins(R⊲S)⋊⋉S: ′ ∗ S = Π (S) A ′ ∗ S →Site 1 ′ ′ ∗ Site 1 computesR =R⊲S ′ ∗ R →Site 2 ′ ∗ Site 2 computesR ⋊⋉S – Semijoin is better if:size(Π (S))+size(R⊲S)size(R) A • The semijoin approach is better if the semijoin acts as a sufficient reducer (i.e., a few tuples ofR participate in the join) • The join approach is better if almost all tuples ofR participate in the join DDB 2008/09 J. Gamper Page 22INGRES Algorithm • INGRES uses a dynamic query optimization algorithm that recursively breaks a query into smaller pieces. It is based on the following ideas: – An nrelation queryq is decomposed inton subqueriesq →q →···→q 1 2 n ∗ Eachq is a monorelation (monovariable) query i ∗ The output ofq is consumed byq i i+1 – For the decomposition two basic techniques are used: detachment and substitution – There’s a processor that can efficiently process monorelation queries ∗ Optimizes each query independently for the access to a single relation DDB 2008/09 J. Gamper Page 23INGRES Algorithm . . . ′ ′′ • Detachment: Break a queryq intoq →q , based on a common relation that is the ′ result ofq , i.e. – The query q: SELECT R .A ,...,R .A 2 2 n n FROM R ,R ,...,R 1 2 n ′ WHERE P (R .A ) 1 1 1 AND P (R .A ,...,R .A ) 2 1 1 n n – is decomposed by detachment of the common relationR into 1 ′ ′ q : SELECT R .A INTOR 1 1 1 FROM R 1 ′ WHERE P (R .A ) 1 1 1 ′′ q : SELECT R .A ,...,R .A 2 2 n n ′ FROM R ,R ,...,R 2 n 1 ′ WHERE P (R .A ,...,R .A ) 2 1 n n 1 ′′ • Detachment reduces the size of the relation on which the queryq is defined. DDB 2008/09 J. Gamper Page 24INGRES Algorithm . . . • Example: Consider queryq1: “Names of employees working on the CAD/CAM project” q : SELECT EMP.ENAME 1 FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO AND PROJ.PNAME = ”CAD/CAM” ′ • Decomposeq intoq →q : 1 11 q : SELECT PROJ.PNO INTO JVAR 11 FROM PROJ WHERE PROJ.PNAME = ”CAD/CAM” ′ q : SELECT EMP.ENAME FROM EMP, ASG, JVAR WHERE EMP.ENO = ASG.ENO AND ASG.PNO = JVAR.PNO DDB 2008/09 J. Gamper Page 25INGRES Algorithm . . . ′ • Example (contd.): The successive detachments may transformq intoq →q : 12 13 ′ q : SELECT EMP.ENAME FROM EMP, ASG, JVAR WHERE EMP.ENO = ASG.ENO AND ASG.PNO = JVAR.PNO q : SELECT ASG.ENO INTO GVAR 12 FROM ASG,JVAR WHERE ASG.PNO=JVAR.PNO q : SELECT EMP.ENAME 13 FROM EMP,GVAR WHERE EMP.ENO=GVAR.ENO • q is now decomposed by detachment intoq →q →q 1 11 12 13 • q is a monorelation query 11 • q andq are multirelation queries, which cannot be further detached. 12 13 – also called irreducible DDB 2008/09 J. Gamper Page 26INGRES Algorithm . . . • Tuple substitution allows to convert an irreducible queryq into monorelation queries. – Choose a relationR inq for tuple substitution 1 – For each tuple inR , replace theR attributes referred inq by their actual values, 1 1 ′ thereby generating a set of subqueriesq withn−1 relations, i.e., ′ q(R ,R ,...,R ) is replaced byq (t ,R ,...,R ),t ∈R 1 2 n 1 2 n 1 1 i i • Example (contd.): AssumeGVAR consists only of the tuplesE1,E2. Thenq is 13 rewritten with tuple substitution in the following way q : SELECT EMP.ENAME 13 FROM EMP, GVAR WHERE EMP.ENO = GVAR.ENO q : SELECT EMP.ENAME q : SELECT EMP.ENAME 131 132 FROM EMP FROM EMP WHERE EMP.ENO = ”E1” WHERE EMP.ENO = ”E2” – q andq are monorelation queries 131 132 DDB 2008/09 J. Gamper Page 27Distributed INGRES Algorithm • The distributed INGRES query optimization algorithm is very similar to the centralized INGRES algorithm. – In addition to the centralized INGRES, the distributed one should break up each queryq into subqueries that operate on fragments; only horizontal fragmentation is i handled. – Optimization with respect to a combination of communication cost and response time DDB 2008/09 J. Gamper Page 28System R Algorithm • The System R (centralized) query optimization algorithm – Performs static query optimization based on “exhaustive search” of the solution space and a cost function (IO cost + CPU cost) ∗ Input: relational algebra tree ∗ Output: optimal relational algebra tree ∗ Dynamic programming technique is applied to reduce the number of alternative plans – The optimization algorithm consists of two steps 1. Predict the best access method to each individual relation (monorelation query) ∗ Consider using index, file scan, etc. 2. For each relationR, estimate the best join ordering ∗ R is first accessed using its best singlerelation access method ∗ Efficient access to inner relation is crucial – Considers two different join strategies ∗ (Indexed) nested loop join ∗ Sortmerge join DDB 2008/09 J. Gamper Page 29System R Algorithm . . . • Example: Consider queryq1: “Names of employees working on the CAD/CAM project” PROJ⋊⋉ ASG⋊⋉ EMP PNO ENO – Join graph – Indexes ∗ EMP has an index on ENO ∗ ASG has an index on PNO ∗ PROJ has an index on PNO and an index on PNAME DDB 2008/09 J. Gamper Page 30System R Algorithm . . . • Example (contd.): Step 1 – Select the best singlerelation access paths – EMP: sequential scan (because there is no selection on EMP) – ASG: sequential scan (because there is no selection on ASG) – PROJ: index on PNAME (because there is a selection on PROJ based on PNAME) DDB 2008/09 J. Gamper Page 31System R Algorithm . . . • Example (contd.): Step 2 – Select the best join ordering for each relation – (EMP× PROJ) and (PROJ× EMP) are pruned because they are CPs – (ASG× PROJ) pruned because we assume it has higher cost than (PROJ× ASG); similar for (PROJ× EMP) – Best total join order ((PROJ⋊⋉ ASG)⋊⋉ EMP), since it uses the indexes best ∗ Select PROJ using index on PNAME ∗ Join with ASG using index on PNO ∗ Join with EMP using index on ENO DDB 2008/09 J. Gamper Page 32∗ Distributed SystemR Algorithm ∗ • The SystemR query optimization algorithm is an extension of the System R query optimization algorithm with the following main characteristics: – Only the whole relations can be distributed, i.e., fragmentation and replication is not considered – Query compilation is a distributed task, coordinated by a master site, where the query is initiated – Master site makes all intersite decisions, e.g., selection of the execution sites, join ordering, method of data transfer, ... – The local sites do the intrasite (local) optimizations, e.g., local joins, access paths • Join ordering and data transfer between different sites are the most critical issues to be considered by the master site DDB 2008/09 J. Gamper Page 33∗ Distributed SystemR Algorithm . . . • Two methods for intersite data transfer – Ship whole: The entire relation is shipped to the join site and stored in a temporary relation ∗ Larger data transfer ∗ Smaller number of messages ∗ Better if relations are small – Fetch as needed: The external relation is sequentially scanned, and for each tuple the join value is sent to the site of the inner relation and the matching inner tuples are sent back (i.e., semijoin) ∗ Number of messages = O(cardinality of outer relation) ∗ Data transfer per message is minimal ∗ Better if relations are large and the selectivity is good DDB 2008/09 J. Gamper Page 34∗ Distributed SystemR Algorithm . . . • Four main join strategies forR⋊⋉S: – R is outer relation – S is inner relation • Notation: – LT denotes local processing time – CT denotes communication time – s denotes the average number ofStuples that match anRtuple • Strategy 1: Ship the entire outer relation to the site of the inner relation, i.e., – Retrieve outer tuples – Send them to the inner relation site – Join them as they arrive Total cost =LT(retrievecard(R) tuples fromR)+ CT(size(R))+ LT(retrieves tuples fromS)∗card(R) DDB 2008/09 J. Gamper Page 35∗ Distributed SystemR Algorithm . . . • Strategy 2: Ship the entire inner relation to the site of the outer relation. We cannot join as they arrive; they need to be stored. – The inner relationS need to be stored in a temporary relation Total cost =LT(retrievecard(S) tuples fromS)+ CT(size(S))+ LT(storecard(S) tuples inT)+ LT(retrievecard(R) tuples fromR)+ LT(retrieves tuples fromT)∗card(R) DDB 2008/09 J. Gamper Page 36∗ Distributed SystemR Algorithm . . . • Strategy 3: Fetch tuples of the inner relation as needed for each tuple of the outer relation. – For eachRtuple, the join attributeA is sent to the site ofS – Thes matchingStuples are retrieved and sent to the site ofR Total cost =LT(retrievecard(R) tuples fromR)+ CT(length(A))∗card(R)+ LT(retrieves tuples fromS)∗card(R)+ CT(s∗length(S))∗card(R) DDB 2008/09 J. Gamper Page 37∗ Distributed SystemR Algorithm . . . • Strategy 4: Move both relations to a third site and compute the join there. – The inner relationS is first moved to a third site and stored in a temporary relation. – Then the outer relation is moved to the third site and its tuples are joined as they arrive. Total cost =LT(retrievecard(S) tuples fromS)+ CT(size(S))+ LT(storecard(S) tuples inT)+ LT(retrievecard(R) tuples fromR)+ CT(size(R))+ LT(retrieves tuples fromT)∗card(R) DDB 2008/09 J. Gamper Page 38HillClimbing Algorithm • HillClimbing query optimization algorithm – Refinements of an initial feasible solution are recursively computed until no more cost improvements can be made – Semijoins, data replication, and fragmentation are not used – Devised for wide area pointtopoint networks – The first distributed query processing algorithm DDB 2008/09 J. Gamper Page 39HillClimbing Algorithm . . . • The hillclimbing algorithm proceeds as follows 1. Select initial feasible execution strategy ES0 – i.e., a global execution schedule that includes all intersite communication – Determine the candidate result sites, where a relation referenced in the query exist – Compute the cost of transferring all the other referenced relations to each candidate site – ES0 = candidate site with minimum cost 2. Split ES0 into two strategies: ES1 followed by ES2 – ES1: send one of the relations involved in the join to the other relation’s site – ES2: send the join result to the final result site 3. Replace ES0 with the split schedule which gives cost(ES1)+cost(local join)+cost(ES2)cost(ES0) 4. Recursively apply steps 2 and 3 on ES1 and ES2 until no more benefit can be gained 5. Check for redundant transmissions in the final plan and eliminate them DDB 2008/09 J. Gamper Page 40HillClimbing Algorithm . . . • Example: What are the salaries of engineers who work on the CAD/CAM project ′′ Π (PAY ⋊⋉ EMP ⋊⋉ (ASG⋊⋉ (σ (PROJ)))) SAL TITLE ENO PNO PNAME=“CAD/CAM – Schemas: EMP(ENO, ENBAME, TITLE), ASG(ENO, PNO, RESP, DUR), PROJ(PNO, PNAME, BUDGET, LOC), PAY(TITLE, SAL) – Statistics Relation Size Site EMP 8 1 PAY 4 2 PROJ 1 3 ASG 10 4 – Assumptions: ∗ Size of relations is defined as their cardinality ∗ Minimize total cost ∗ Transmission cost between two sites is 1 ∗ Ignore local processing cost ∗ size(EMP⋊⋉ PAY) = 8, size(PROJ⋊⋉ ASG) = 2, size(ASG⋊⋉ EMP) = 10 DDB 2008/09 J. Gamper Page 41HillClimbing Algorithm . . . • Example (contd.): Determine initial feasible execution strategy – Alternative 1: Resulting site is site 1 Total cost =cost(PAY→ Site1)+cost(ASG→ Site1)+cost(PROJ→ Site1) = 4+10+1 = 15 – Alternative 2: Resulting site is site 2 Total cost = 8+10+1 = 19 – Alternative 3: Resulting site is site 3 Total cost = 8+4+10 = 22 – Alternative 4: Resulting site is site 4 Total cost = 8+4+1 = 13 – Therefore ES0 = EMP→Site4; PAY→ Site4; PROJ→ Site4 DDB 2008/09 J. Gamper Page 42HillClimbing Algorithm . . . • Example (contd.): Candidate split – Alternative 1: ES1, ES2, ES3 Total cost =cost(EMP→ Site2)+ ∗ ES1: EMP→Site 2 cost((EMP⋊⋉ PAY)→ Site4)+ ∗ ES2: (EMP⋊⋉PAY)→ Site4 cost(PROJ→ Site4) ∗ ES3: PROJ→Site 4 = 8+8+1 = 17 – Alternative 2: ES1, ES2, ES3 Total cost =cost(PAYSite→ 1)+ ∗ ES1: PAY→ Site1 cost((PAY⋊⋉ EMP)→ Site4)+ ∗ ES2: (PAY⋊⋉ EMP)→ Site4 cost(PROJ→ Site4) ∗ ES3: PROJ→ Site 4 = 4+8+1 = 13 • Both alternatives are not better than ES0, so keep it (or take alternative 2 which has the same cost) DDB 2008/09 J. Gamper Page 43HillClimbing Algorithm . . . • Problems – Greedy algorithm determines an initial feasible solution and iteratively improves it – If there are local minima, it may not find the global minimum – An optimal schedule with a high initial cost would not be found, since it won’t be chosen as the initial feasible solution • Example: A better schedule is – PROJ→Site 4 – ASG’ = (PROJ⋊⋉ASG)→Site 1 – (ASG’⋊⋉EMP)→Site 2 – Total cost= 1+2+2 = 5 DDB 2008/09 J. Gamper Page 44SDD1 • The SDD1 query optimization algorithm improves the HillClimbing algorithm in a number of directions: – Semijoins are considered – More elaborate statistics – Initial plan is selected better – Postoptimization step is introduced DDB 2008/09 J. Gamper Page 45Conclusion • Distributed query optimization is more complex that centralized query processing, since – bushy query trees are not necessarily a bad choice – one needs to decide what, where, and how to ship the relations between the sites • Query optimization searches the optimal query plan (tree) • For N relations, there areO(N) equivalent join trees. To cope with the complexity heuristics and/or restricted types of trees are considered • There are two main strategies in query optimization: randomized and deterministic • (Few) semijoins can be used to implement a join. The semijoins require more operations to perform, however the data transfer rate is reduced • INGRES, System R, Hill Climbing, and SDD1 are distributed query optimization algorithms DDB 2008/09 J. Gamper Page 46
Website URL
Comment