Question? Leave a message!

Overview of Query Processing

Overview of Query Processing
Chapter 5: Overview of Query Processing • Query Processing Overview • Query Optimization • Distributed Query Processing Steps Acknowledgements: I am indebted to Arturas Mazeika for providing me his slides of this course. DDB 2008/09 J. Gamper Page 1Query Processing Overview • Query processing: A 3step process that transforms a highlevel query (of relational calculus/SQL) into an equivalent and more efficient lowerlevel query (of relational algebra). 1. Parsing and translation – Check syntax and verify relations. – Translate the query into an equivalent relational algebra expression. 2. Optimization – Generate an optimal evaluation plan (with lowest cost) for the query plan. 3. Evaluation – The queryexecution engine takes an (optimal) evaluation plan, executes that plan, and returns the answers to the query. DDB 2008/09 J. Gamper Page 2Query Processing . . . • The success of RDBMSs is due, in part, to the availability – of declarative query languages that allow to easily express complex queries without knowing about the details of the physical data organization and – of advanced query processing technology that transforms the highlevel user/application queries into efficient lowerlevel query execution strategies. • The query transformation should achieve both correctness and efficiency – The main difficulty is to achieve the efficiency – This is also one of the most important tasks of any DBMS • Distributed query processing: Transform a highlevel query (of relational calculus/SQL) on a distributed database (i.e., a set of global relations) into an equivalent and efficient lowerlevel query (of relational algebra) on relation fragments. • Distributed query processing is more complex – Fragmentation/replication of relations – Additional communication costs – Parallel execution DDB 2008/09 J. Gamper Page 3Query Processing Example • Example: Transformation of an SQLquery into an RAquery. Relations: EMP(ENO, ENAME, TITLE), ASG(ENO,PNO,RESP,DUR) Query: Find the names of employees who are managing a project – High level query SELECT ENAME FROM EMP,ASG WHERE EMP.ENO = ASG.ENO AND DUR 37 – Two possible transformations of the query are: ∗ Expression 1: Π (σ (EMP ×ASG)) ENAME DUR37∧EMP.ENO=ASG.ENO ∗ Expression 2: Π (EMP ⋊⋉ (σ (ASG))) ENAME ENO DUR37 – Expression 2 avoids the expensive and large intermediate Cartesian product, and therefore typically is better. DDB 2008/09 J. Gamper Page 4Query Processing Example . . . • We make the following assumptions about the data fragmentation – Data is (horizontally) fragmented: ∗ Site1:ASG1 =σ (ASG) ENO≤”E3” ∗ Site2:ASG2 =σ (ASG) ENO”E3” ∗ Site3:EMP1 =σ (EMP) ENO≤”E3” ∗ Site4:EMP2 =σ (EMP) ENO”E3” ∗ Site5: Result – Relations ASG and EMP are fragmented in the same way – Relations ASG and EMP are locally clustered on attributes RESP and ENO, respectively DDB 2008/09 J. Gamper Page 5Query Processing Example . . . • Now consider the expressionΠ (EMP ⋊⋉ (σ (ASG))) ENAME ENO DUR37 • Strategy 1 (partially parallel execution): ′ – Produce ASG and move to Site 3 1 ′ – Produce ASG and move to Site 4 2 ′ – Join ASG with EMP at Site 3 and 1 1 move the result to Site 5 ′ – Join ASG with EMP at Site 4 and 2 2 move the result to Site 5 – Union the result in Site 5 • Strategy 2: – Move ASG and ASG to Site 5 1 2 – Move EMP and EMP to Site 5 1 2 – Select and join at Site 5 • For simplicity, the final projection is omitted. DDB 2008/09 J. Gamper Page 6Query Processing Example . . . • Calculate the cost of the two strategies under the following assumptions: – Tuples are uniformly distributed to the fragments; 20 tuples satisfyDUR37 – size(EMP) = 400, size(ASG) = 1000 – tuple access cost = 1 unit; tuple transfer cost = 10 units – ASG and EMP have a local index on DUR and ENO • Strategy 1 – Produce ASG’s: (10+10) tuple access cost 20 – Transfer ASG’s to the sites of EMPs: (10+10) tuple transfer cost 200 – Produce EMP’s: (10+10) tuple access cost 2 40 – Transfer EMP’s to result site: (10+10) tuple transfer cost 200 – Total cost 460 • Strategy 2 – Transfer EMP , EMP to site 5: 400 tuple transfer cost 4,000 1 2 – Transfer ASG , ASG to site 5: 1000 tuple transfer cost 10,000 1 2 – Select tuples from ASG ∪ ASG : 1000 tuple access cost 1,000 1 2 – Join EMP and ASG’: 400 20 tuple access cost 8,000 – Total cost 23,000 DDB 2008/09 J. Gamper Page 7Query Optimization • Query optimization is a crucial and difficult part of the overall query processing • Objective of query optimization is to minimize the following cost function: I/O cost+ CPU cost+ communication cost • Two different scenarios are considered: – Wide area networks ∗ Communication cost dominates · low bandwidth · low speed · high protocol overhead ∗ Most algorithms ignore all other cost components – Local area networks ∗ Communication cost not that dominant ∗ Total cost function should be considered DDB 2008/09 J. Gamper Page 8Query Optimization . . . • Ordering of the operators of relational algebra is crucial for efficient query processing • Rule of thumb: move expensive operators at the end of query processing • Cost of RA operations: Operation Complexity Select, Project O(n) (without duplicate elimination) Project O(nlogn) (with duplicate elimination) Group Join Semijoin O(nlogn) Division Set Operators 2 Cartesian Product O(n ) DDB 2008/09 J. Gamper Page 9Query Optimization Issues Several issues have to be considered in query optimization • Types of query optimizers – wrt the search techniques (exhaustive search, heuristics) – wrt the time when the query is optimized (static, dynamic) • Statistics • Decision sites • Network topology • Use of semijoins DDB 2008/09 J. Gamper Page 10Query Optimization Issues . . . • Types of Query Optimizers wrt Search Techniques – Exhaustive search ∗ Costbased ∗ Optimal ∗ Combinatorial complexity in the number of relations – Heuristics ∗ Not optimal ∗ Regroups common subexpressions ∗ Performs selection, projection first ∗ Replaces a join by a series of semijoins ∗ Reorders operations to reduce intermediate relation size ∗ Optimizes individual operations DDB 2008/09 J. Gamper Page 11Query Optimization Issues . . . • Types of Query Optimizers wrt Optimization Timing – Static ∗ Query is optimized prior to the execution ∗ As a consequence it is difficult to estimate the size of the intermediate results ∗ Typically amortizes over many executions – Dynamic ∗ Optimization is done at run time ∗ Provides exact information on the intermediate relation sizes ∗ Have to reoptimize for multiple executions – Hybrid ∗ First, the query is compiled using a static algorithm ∗ Then, if the error in estimate sizes greater than threshold, the query is reoptimized at run time DDB 2008/09 J. Gamper Page 12Query Optimization Issues . . . • Statistics – Relation/fragments ∗ Cardinality ∗ Size of a tuple ∗ Fraction of tuples participating in a join with another relation/fragment – Attribute ∗ Cardinality of domain ∗ Actual number of distinct values ∗ Distribution of attribute values (e.g., histograms) – Common assumptions ∗ Independence between different attribute values ∗ Uniform distribution of attribute values within their domain DDB 2008/09 J. Gamper Page 13Query Optimization Issues . . . • Decision sites – Centralized ∗ Single site determines the ”best” schedule ∗ Simple ∗ Knowledge about the entire distributed database is needed – Distributed ∗ Cooperation among sites to determine the schedule ∗ Only local information is needed ∗ Cooperation comes with an overhead cost – Hybrid ∗ One site determines the global schedule ∗ Each site optimizes the local subqueries DDB 2008/09 J. Gamper Page 14Query Optimization Issues . . . • Network topology – Wide area networks (WAN) pointtopoint ∗ Characteristics · Low bandwidth · Low speed · High protocol overhead ∗ Communication cost dominate; all other cost factors are ignored ∗ Global schedule to minimize communication cost ∗ Local schedules according to centralized query optimization – Local area networks (LAN) ∗ Communication cost not that dominant ∗ Total cost function should be considered ∗ Broadcasting can be exploited (joins) ∗ Special algorithms exist for star networks DDB 2008/09 J. Gamper Page 15Query Optimization Issues . . . • Use of Semijoins – Reduce the size of the join operands by first computing semijoins – Particularly relevant when the main cost is the communication cost – Improves the processing of distributed join operations by reducing the size of data exchange between sites – However, the number of messages as well as local processing time is increased DDB 2008/09 J. Gamper Page 16Distributed Query Processing Steps DDB 2008/09 J. Gamper Page 17Conclusion • Query processing transforms a high level query (relational calculus) into an equivalent lower level query (relational algebra). The main difficulty is to achieve the efficiency in the transformation • Query optimization aims to mimize the cost function: I/O cost+ CPU cost+ communication cost • Query optimizers vary by search type (exhaustive search, heuristics) and by type of the algorithm (dynamic, static, hybrid). Different statistics are collected to support the query optimization process • Query optimizers vary by decision sites (centralized, distributed, hybrid) • Query processing is done in the following sequence: query decomposition→data localization→global optimization→ local optimization DDB 2008/09 J. Gamper Page 18
Document Information
User Name:
User Type:
Uploaded Date: