Question? Leave a message!




Code Optimization Code Optimization

Code Optimization Code Optimization
Spring Spring 2010 2010 It Introd duct tii on tto Code Code Optimization Optimization Instruction Scheduling 5 Outline • Modern architectures • Introduction to instruction scheduling • • List List scheduling scheduling • Resource constraints • Scheduling across basic blocks • Trace scheduling g Saman Amarasinghe 2 6.035 ©MIT Fall 1998 Simp ple Machine Model • Instructions are executed in seq quence – Fetch, decode, execute, store results – One instruction at a time • For branch instructions, start fetching from a different different location location i if f needed needed – Check branch condition – Next Next instruction instruction may may come come from from a a new new location location given by the branch instruction Saman Amarasinghe 3 6.035 ©MIT Fall 1998 Simp ple Execution Model •5 Staggp e pipp e-line fetch decode execute memory writeback • Fetch: get the next instruction • Decode: figure-out what that instruction is • Execute: Perform ALU op peration – address calculation in a memory op • Memory: Memory: Do Do the the memory memory access access in in a a mem. mem. Op. Op. • Write Back: write the results back Saman Amarasinghe 4 6.035 ©MIT Fall 1998 Simp ple Execution Model time IF DE EXE MEM WB Inst 1 IF IF DE DE EXE EXE MEM MEM WB WB Inst 2 IF DE EXE MEM WB Inst 1 IF DE EXE MEM WB Inst 2 IF DE EXE MEM WB Inst 3 IF DE EXE MEM WB It Inst 4 4 IF DE EXE MEM WB Inst 5 Saman Amarasinghe 5 6.035 ©MIT Fall 1998 5 Outline • Modern architectures • Introduction to instruction scheduling • • List List scheduling scheduling • Resource constraints • Scheduling across basic blocks • Trace scheduling g Saman Amarasinghe 6 6.035 ©MIT Fall 1998 From a Simple Machine Model t to a R Real l MM achihi ne M Mod dell • Many pipeline stages – Pentium 5 – Pentium Pro 10 – Pentium IV (130nm) 20 – Pentium IV (90nm) 31 – Core Core 2 Duo 2 Duo 14 14 • Different instructions taking different amount of time to to execute execute • • Hardware Hardware to to stall stall the the pipeline pipeline if if an an instruction instruction uses uses a a result that is not ready Saman Amarasinghe 7 6.035 ©MIT Fall 1998Real Machine Model cont. • Most modern p processors have multip p le cores – Will deal with multicores next week • Each Each core core has has multiple multiple execution execution units units (superscalar) – If If the the instruction instruction sequence sequence is is correct correct , multiple multiple operations will happen in the same cycles – Even Even more more important important to to have have the the right right instruction instruction sequence Saman Amarasinghe 8 6.035 ©MIT Fall 1998 Instruction Scheduling g • Reorder instructions so that ppp ipeline stalls are minimized Saman Amarasinghe 9 6.035 ©MIT Fall 1998 Constraints On Scheduling g • Data dep pendencies • Control dependencies • • Resource Resource Constraints Constraints Saman Amarasinghe 10 6.035 ©MIT Fall 1998 Data Dependency between It Instructiti ons • If two instructions access the same variable,, they can be dependent • Kind of dep pendencies – True: write  read – Anti: read  write – Output: write  write • What to do if two instructions are dep pendent. – The order of execution cannot be reversed – Reduce the possibilities for scheduling Saman Amarasinghe 11 6.035 ©MIT Fall 1998 Comp p uting g Dep p endencies • For basic blocks, comp pute dep pendencies by y walking through the instructions • Identifying Identifying register register dependencies dependencies is is s simple imple – is it the same register? • • For For memory memory accesses accesses – simple: base + offset1 ?= base + offset2 – ddt ata d depend dence anal lysi is: a2i 2i ?? = a2i+1 2i+1 – interprocedural analysis: global ?= parameter – poii nter alili as anallysii s: p1f foo ?= p2f foo Saman Amarasinghe 12 6.035 ©MIT Fall 1998 Rep presenting g Dep pendencies • Using g p pendence DAG, one p per basic block a de • Nodes are instructions, edges represent dependencies dependencies Saman Amarasinghe 13 6.035 ©MIT Fall 1998 Rep p resenting g Dep pendencies •Using g p pendence DAG, one p per basic block a de • Nodes are instructions, edges represent dependencies dependencies 1 2 1: r2 = (r1 + 4) 2: 2: r3 r3 =(r1 (r1 + + 8) 8) 2 2 2 3: r4 = r2 + r3 4: 4: r5 r5 = = r2 r2 1 1 4 4 3 3 • Edge is labeled with Latency: – v(i (i  j) j) = d d el lay requi ired d b between i i ni i ti i ati i on ti i mes of f i and j minus the execution time required by i Saman Amarasinghe 14 6.035 ©MIT Fall 1998Examp ple 1: r2 = (r1 + 4) 2: r3 = (r2 + 4) 3: r4 = r2 + r3 4: r5 = r2 - 1 1 2 2 2 2 4 4 3 3 Saman Amarasinghe 15 6.035 ©MIT Fall 1998 Another Examp ple 1: r2 = (r1 + 4) 2: (r1 + 4) = r3 3: r3 = r2 + r3 1 4: r5 = r2 - 1 1 2 2 2 1 4 4 3 3 Saman Amarasinghe 16 6.035 ©MIT Fall 1998 Control Dependencies and R Resource C C onst t rai i nt ts • For now, lets only worry about basic blocks • For For now now , lets lets look look at at simple simple pipelines pipelines Saman Amarasinghe 17 6.035 ©MIT Fall 1998 Examp ple 1: lea var_a, %rax 2: add 4, %rax 3: inc %r11 4: mov 4: mov 4(%rsp), % 4(%rsp), %r10 r10 5: add %r10, 8(%rsp) 6: and 16(%rsp), %rbx 7: imul %rax, %rbx Saman Amarasinghe 18 6.035 ©MIT Fall 1998 Examp ple Results In 1: lea var_a, %rax 1 cycle 2: add 4, %rax 1 cycle 3: inc %r11 1 cycle 4: 4: mov mov 4(%rsp), % 4(%rsp), %r10 r10 3 3 cycles cycles 5: add %r10, 8(%rsp) 6: and 16(%rsp), %rbx 4 cycles 7: imul %rax, %rbx 3 cycles Saman Amarasinghe 19 6.035 ©MIT Fall 1998 Examp ple Results In 1: lea var_a, %rax 1 cycle 2: add 4, %rax 1 cycle 3: inc %r11 1 cycle 4: 4: mov mov 4(%rsp), % 4(%rsp), %r10 r10 3 3 cycles cycles 5: add %r10, 8(%rsp) 6: and 16(%rsp), %rbx 4 cycles 7: imul %rax, %rbx 3 cycles 1 2 3 4 st st 5 6 st st st 7 Saman Amarasinghe 20 6.035 ©MIT Fall 1998 5 Outline • Modern architectures • Introduction to instruction scheduling • • List List scheduling scheduling • Resource constraints • Scheduling across basic blocks • Trace scheduling g Saman Amarasinghe 21 6.035 ©MIT Fall 1998 List Scheduling g Alg g orithm • Idea – Do a topological sort of the dependence DAG – Consider when an instruction can be scheduled without causing a stall – Schedule the instruction if it causes no stall and all its predecessors are already scheduled •Op ptimal list scheduling g is NP-comp plete – Use heuristics when necessary Saman Amarasinghe 22 6.035 ©MIT Fall 1998 List Scheduling g Alg g orithm • Create a dep pendence DAG of a basic block • Topological Sort READY READY = nodes nodes w with ith no no predecessors predecessors Loop until READY is empty Schedule Schedule each each node node in in READY READY when when no no stalling stalling Update READY Saman Amarasinghe 23 6.035 ©MIT Fall 1998 Heuristics for selection • Heuristics for selecting g from the READY list – pick the node with the longest path to a leaf in the dependence graph – pick a node with most immediate successors – pp ick a node that can gg o to a less busy y p pip peline ( (in a superscalar) Saman Amarasinghe 24 6.035 ©MIT Fall 1998 Heuristics for selection • p pick the node with the long gest p path to a leaf in the dependence graph • Algorithm Algorithm (for (for node node x) x) – If no successors = 0 x d – d d = MAX( MAX( d d + +c c ) ) f for or all all successors successors y y of of x x x y xy – reverse reverse breadth breadth -first first v visitation isitation order order Saman Amarasinghe 25 6.035 ©MIT Fall 1998 Heuristics for selection • p pick a node with most immediate successors • Algorithm (for node x): – f f = number number o of f successors successors of of x x x Saman Amarasinghe 26 6.035 ©MIT Fall 1998 Examp ple Results In 1: lea var_a, %rax 1 cycle 2: add 4, %rax 1 cycle 3: inc %r11 1 cycle 4: 4: mov mov 4(%rsp), % 4(%rsp), %r10 r10 3 3 cycles cycles 5: add %r10, 8(%rsp) 6: and 16(%rsp), %rbx 4 cycles 7: imul %rax, %rbx 3 cycles 8: mov %rbx, 16(%rsp) 9: 9: lea lea var_b var b , %rax %rax Saman Amarasinghe 27 6.035 ©MIT Fall 1998 Examp ple 1 3 4 1: lea var_a, %rax 1 1 3 3 2: add 4, %rax 3: inc %r11 2 6 5 4: mov 4: mov 4(%rsp), % 4(%rsp), %r10 r10 5: add %r10, 8(%rsp) 4 1 6: and 16(%rsp), %rbx 7 7: imul %rax, %rbx 3 1 8: mov %rbx, 16(%rsp) 9: 9: lea lea var_b var b , %rax %rax 8 8 9 9 Saman Amarasinghe 28 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 dd0 =0 8 8 9 9 f=0 f=0 Saman Amarasinghe 29 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 dd0 =0 8 8 9 9 f=0 f=0 Saman Amarasinghe 30 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1, 1, 3, 3, 4, 4, 6 6 1 1 3 3 READY = d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 d0 d=0 8 8 9 9 f=0 f=0 Saman Amarasinghe 31 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = 1, 4, 3 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 d0 d=0 8 8 9 9 f=0 f=0 6 Saman Amarasinghe 32 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = 2, 4, 3 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 d0 d=0 8 8 9 9 f=0 f=0 6 1 Saman Amarasinghe 33 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = 7, 4, 3 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 d0 d=0 8 8 9 9 f=0 f=0 6 1 2 Saman Amarasinghe 34 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = 7, 3, 5 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 dd0 =0 8 8 9 9 f=0 f=0 6 1 2 4 Saman Amarasinghe 35 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = 3, 5, 8, 9 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 dd0 =0 8 8 9 9 f=0 f=0 6 1 2 4 7 Saman Amarasinghe 36 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = 5, 8, 9 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 d0 d=0 8 8 9 9 f=0 f=0 6 1 2 4 7 3 Saman Amarasinghe 37 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = 9 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 d0 d=0 8 8 9 9 f=0 f=0 6 1 2 4 7 3 5 8 Saman Amarasinghe 38 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = 9 d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 d0 d=0 8 8 9 9 f=0 f=0 6 1 2 4 7 3 5 8 9 Saman Amarasinghe 39 6.035 ©MIT Fall 1998 Examp ple d=5 d=0 d=3 1 3 4 f=1 f=0 f=1 1 1 3 3 READY = d=4 d=7 d=0 2 6 5 f=1 f=1 f=0 4 1 d=3 7 f=2 f=2 3 1 dd0 =0 dd0 =0 8 8 9 9 f=0 f=0 6 1 2 4 7 3 5 8 9 Saman Amarasinghe 40 6.035 ©MIT Fall 1998 Examp ple Results In 1: lea var_a, %rax 1 cycle 2: add 4, %rax 1 cycle 3: inc %r11 1 cycle 4: mov 4: mov 4(%rsp), % 4(%rsp), %r10 r10 3 3 cycles cycles 5: add %r10, 8(%rsp) 6: and 16(%rsp), %rbx 4 cycles 7: imul %rax, %rbx 3 cycles 8: mov %rbx, 16(%rsp) 9: 9: lea lea var_b var b , %rax %rax 1 2 3 4 st st 5 6 st st st 7 8 9 14 14 cycles cycles vs vs 6 1 2 4 7 3 5 8 9 9 cycles Saman Amarasinghe 41 6.035 ©MIT Fall 1998 5 Outline • Modern architectures • Introduction to instruction scheduling • • List List scheduling scheduling • Resource constraints • Scheduling across basic blocks • Trace scheduling g Saman Amarasinghe 42 6.035 ©MIT Fall 1998 Resource Constraints • Modern machines have many y resource constraints • Superscalar Superscalar architectures: architectures: – can run few parallel operations – But But have have constraints constraints Saman Amarasinghe 43 6.035 ©MIT Fall 1998 Resource Constraints of a S S uperscal l ar P Processor • Example: – One fully pipelined reg-to-reg unit • All integer operations taking one cycle In parallel with – One fully pipelined memory-to/from-reg unit • Data loads take two cycles • • Data Data stores stores teke teke one one cycle cycle Saman Amarasinghe 44 6.035 ©MIT Fall 1998 List Scheduling Algorithm with resource consti traint ts • Represent Represent t the he superscalar superscalar architecture architecture as as multiple multiple pipelines – Each Each pipeline pipeline represent represent some some resource resource Saman Amarasinghe 45 6.035 ©MIT Fall 1998 List Scheduling Algorithm with resource consti traint ts • Represent Represent t the he superscalar superscalar architecture architecture as as multiple multiple pipelines – Each Each pipeline pipeline represent represent some some resource resource • Example – O O ne si ingl l e cycl l e reg-to-reg ALU ALU uni it – One two-cycle pipelined reg-to/from-memory unit ALU MEM 1 MEM 2 Saman Amarasinghe 46 6.035 ©MIT Fall 1998 List Scheduling Algorithm with resource consti traint ts • Create Create a a dependence dependence DAG DAG of of a a basic basic block block • Topological Sort READY READY = nod d es with ith no pred decessors Loop until READY is empty Let n READY be the node with the highest priority Schedule n in the earliest slot that satisfies precedence + resource constraints Update READY Saman Amarasinghe 47 6.035 ©MIT Fall 1998 Examp ple 1: lea var_a, %rax 2: add 4(%rsp), %rax 3: 3: inc inc %r11 %r11 4: mov 4(%rsp), %r10 5: mov %r10, 8(%rsp) 66: and and 0 00ff 0x00ff, %rb %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9 9: mov %% b rbx, 16(% 16(%rsp)) Saman Amarasinghe 48 6.035 ©MIT Fall 1998 Examp ple 1: lea var_a, %rax 1 3 4 2: add 4(%rsp), %rax 3: 3: inc inc %r11 %r11 1 1 2 2 4: mov 4(%rsp), %r10 5: mov %r10, 8(%rsp) 2 6 5 6: and 0x00ff, %rbx 1 7: imul %rax, %rbx 2 8: lea var_b, %rax 7 9: mov %rbx, 16(%rsp) 1 1 1 1 READY = 8 9 ALUop 1 6 3 7 8 MEM 1 4 4 2 2 5 5 9 9 MEM 2 4 2 Saman Amarasinghe 49 6.035 ©MIT Fall 1998 5 Outline • Modern architectures • Introduction to instruction scheduling • • List List scheduling scheduling • Resource constraints • Scheduling across basic blocks • Trace scheduling g Saman Amarasinghe 50 6.035 ©MIT Fall 1998 Scheduling g across basic blocks • Number of instructions in a basic block is small – Cannot keep a multiple units with long pipelines busy by just scheduling within a basic block • Need to handle control dependence – Scheduling Scheduling constraints constraints across across basic basic blocks blocks – Scheduling policy Saman Amarasinghe 51 6.035 ©MIT Fall 1998 Moving g across basic blocks • Downward to adj jacent basic block A B B C C Saman Amarasinghe 52 6.035 ©MIT Fall 1998 Moving g across basic blocks • Downward to adj jacent basic block A B B C C Saman Amarasinghe 53 6.035 ©MIT Fall 1998 Moving g across basic blocks • Downward to adj jacent basic block A B B C C • A path to B that does not execute A? Saman Amarasinghe 54 6.035 ©MIT Fall 1998 Moving g across basic blocks •Up pward to adj jacent basic block B B C C A Saman Amarasinghe 55 6.035 ©MIT Fall 1998 Moving g across basic blocks •Up pward to adj jacent basic block B B C C A Saman Amarasinghe 56 6.035 ©MIT Fall 1998 Moving g across basic blocks •Up pward to adj jacent basic block B B C C A • A path from C that does not reach A? Saman Amarasinghe 57 6.035 ©MIT Fall 1998 Control Dep pendencies • Constraints in moving g instructions across basic blocks Saman Amarasinghe 58 6.035 ©MIT Fall 1998 Control Dep pendencies • Constraints in moving g instructions across basic blocks if ( . . . ) a = b opp c Saman Amarasinghe 59 6.035 ©MIT Fall 1998 Control Dep pendencies • Constraints in moving g instructions across basic blocks if ( . . . ) a = b opp c Saman Amarasinghe 60 6.035 ©MIT Fall 1998 Control Dep pendencies • Constraints in moving g instructions across basic blocks if ( c = 0 ) a = b // c NO Saman Amarasinghe 61 6.035 ©MIT Fall 1998 Control Dep pendencies • Constraints in moving g instructions across basic blocks If ( ( . . . )) d = (a1) Saman Amarasinghe 62 6.035 ©MIT Fall 1998 Control Dep pendencies • Constraints in moving g instructions across basic blocks If ( ( valid address? ) ) d = (a1) Saman Amarasinghe 63 6.035 ©MIT Fall 1998 5 Outline • Modern architectures • Introduction to instruction scheduling • • List List scheduling scheduling • Resource constraints • Scheduling across basic blocks • Trace scheduling g Saman Amarasinghe 64 6.035 ©MIT Fall 1998 Trace Scheduling g • Find the most common trace of basic blocks – Use profile information • Combine Combine the the basic basic blocks blocks in in the the trace trace and and schedule them as one block • • Create Create clean clean -up up code code if if the the execution execution goes off goes off- trace Saman Amarasinghe 65 6.035 ©MIT Fall 1998 Trace Scheduling g A B B C C D E F G H Saman Amarasinghe 66 6.035 ©MIT Fall 1998 Trace Scheduling g A B B D E G H Saman Amarasinghe 67 6.035 ©MIT Fall 1998 Large Basic Blocks via Cd Code DD upl lii catiti on • Creating g larg g e extended basic blocks by y duplication • Schedule Schedule the the larger larger blocks blocks A B C D E Saman Amarasinghe 68 6.035 ©MIT Fall 1998 Large Basic Blocks via Cd Code DD upl lii catiti on • Creating g larg g e extended basic blocks by y duplication • Schedule Schedule the the larger larger blocks blocks A A B C B C D D D E E E Saman Amarasinghe 69 6.035 ©MIT Fall 1998 Trace Scheduling g A C C B B D D E E F G F G H H H H Saman Amarasinghe 70 6.035 ©MIT Fall 1998 5 Next • Scheduling g for loop ps • Loop unrolling • • Software Software pipelining pipelining • Interaction with register allocation • Hardware vs. Compiler Saman Amarasinghe 71 6.035 ©MIT Fall 1998 MIT OpenCourseWare http://ocw.mit.edu 6.035 Computer Language Engineering Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Website URL
Comment