Question? Leave a message!




Loop Optimizations

Loop Optimizations
Spring Spring 2010 2010 Loop Optimizations Instruction Instruction Scheduling Scheduling5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IId nductii on VVariiabl bl e RR ecogniitii on • loop invariant code motion Saman Amarasinghe 2 6.035 ©MIT Fall 1998 Scheduling g Loop p s •Loop p bodies are small • But, lot of time is spend in loops due to large number number o of f iterations iterations • Need better ways to schedule loops Saman Amarasinghe 3 6.035 ©MIT Fall 1998 Loop p Examp p le • Machine – One load/store unit • load 2 cycles • store store 2 2 cycles cycles – Two arithmetic units • add 2 cycles • branch 2 cycles • multiply 3 cycles – Both Both units units are are p pipelined ipelined (initiate (initiate one one op op each each cycle) cycle) • Source Code for i = 1 to N Ai = Ai b Saman Amarasinghe 4 6.035 ©MIT Fall 1998 Loop p Examp p le • Source Code for i = 1 to N base Ai = Ai b off ffset • Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov mov %r10, %r10, (%rdi,%rax) (%rdi,%rax) sub 4, %rax jz loop Saman Amarasinghe 5 6.035 ©MIT Fall 1998 � mov d=7 Looppp Example 2 d=5 imul • Assembly Code 3 loop: mov (%rdi,%rax), %r10 d=2 mov imul %r11,, %r10 0 0 mov %r10, (%rdi,%rax) d=2 sub sub 4, %rax jz jz loop loop 2 2 d=0 Schedule (9 cycles per iteration) jz mov mov mov mov imul bge imul bge imul sub sub Saman Amarasinghe 6 6.035 ©MIT Fall 19985 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • Id Inductii on VVariiabl bl e RR ecogniitii on • loop invariant code motion Saman Amarasinghe 7 6.035 ©MIT Fall 1998 Loop p Unrolling g • Unroll the loop p y y bod few times • Pros: – Create Create a a much much larger larger basic basic block block for for the the body body – Eliminate few loop bounds checks • • Cons: Cons: – Much larger program – S Setup cod de ( ( of f i iterati ions unroll ll ff actor)) – beginning and end of the schedule can still have unuseddl slotts Saman Amarasinghe 8 6.035 ©MIT Fall 1998 Loop p Examp ple loop: loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov mov %r10, %r10, (%rdi,%rax) (%rdi,%rax) sub 4, %rax jz loop Saman Amarasinghe 9 6.035 ©MIT Fall 1998 Loop p Examp ple loop: loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub 4, %rax mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub 4, %rax jz loop Saman Amarasinghe 10 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 Looppp Example 3 loop: loop: mov d=9 mov (%rdi,%rax), %r10 0 imul %r11, %r10 sub d=9 mov %r10, (%rdi,%rax) 2 2 sub 4, %rax mov d=7 mov (%rdi,%rax), %r10 2 mul imul %r11, %r10 d=5 3 mov %r10, (%rdi,%rax) mov d=2 sub 4, %rax 0 jz loop sub d=2 2 jz d=0 • Schedule (8 cycles per iteration) mov mov mov mov mo mov v mo mov v mo mov v mo mov v imul imul bge imul imul bge imul imul sub sub sub sub Saman Amarasinghe 11 6.035 ©MIT Fall 1998Loop p Unrolling g • Rename reg gisters – Use different registers in different iterations Saman Amarasinghe 12 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 Looppp Example loop: loop: 3 mov (%rdi,%rax), %r10 mov d=9 0 imul %r11, %r10 sub d=9 mov %1 %r100, (% (%rdi di,%%rax)) 2 mov d=7 sub 4, %rax 2 mov (%rdi,%rax), %r10 mul mul d=5 d=5 3 imul %r11, %r10 mov d=2 mov %r10, (%rdi,%rax) 0 sub sub 4, 4, % %rax rax sub sub dd2 =2 2 jz loop jz d=0 Saman Amarasinghe 13 6.035 ©MIT Fall 1998mov d=14 2 mul d=12 Looppp Example loop: loop: 3 mov (%rdi,%rax), %r10 mov d=9 0 imul %r11, %r10 sub d=9 mov %1 %r100, (% (%rdi di,%%rax)) 2 mov d=7 sub 4, %rax 2 mov (%rdi,%rax), %rcx mul mul d=5 d=5 3 imul %r11, %rcx mov d=2 mov %rcx, (%rdi,%rax) 0 sub sub 4, 4, % %rax rax sub sub dd2 =2 2 jz loop jz d=0 Saman Amarasinghe 14 6.035 ©MIT Fall 1998Loop p Unrolling g • Rename reg gisters – Use different registers in different iterations • Eliminate unnecessary dependencies – again again , use use more more registers registers to to eliminate eliminate true true , anti anti and and output dependencies – eliminate eliminate dependent dependent-chains chains of of calculations calculations when when possible Saman Amarasinghe 15 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 Looppp Example 3 loop: loop: mov d=9 mov (%rdi,%rax), %r10 0 imul %r11, %r10 sub d=9 2 2 mov %r%1 100,( (%%rdi di,%%rax)) mov d=7 sub 4, %rax 2 mul d=5 mov (%rdi,%rax), %rcx 3 imul %r11, %rcx mov d=2 0 mov %rcx, (%rdi,%rax) sub d=2 sub sub 4, 4, %%rax rax 2 jz loop jz d=0 Saman Amarasinghe 16 6.035 ©MIT Fall 1998mov d=5 2 mul d=3 Looppp Example 3 loop: loop: mov d=0 mov (%rdi,%rax), %r10 0 imul %r11, %r10 sub d=0 2 2 mov %r%1 100,( (%%rdi di,%%rax)) mov d=7 sub 8, %rax 2 mul d=5 mov (%rdi,%rbx), %rcx 3 imul %r11, %rcx mov d=2 0 mov %rcx, (%rdi,%rbx) sub d=2 sub sub 8 8,, %%rbx rbx 2 jz loop jz d=0 Saman Amarasinghe 17 6.035 ©MIT Fall 1998mov d=5 2 mul d=3 Looppp Example 3 loop: loop: mov d=0 mov (%rdi,%rax), %r10 0 imul %r11, %r10 sub d=0 mov %r10, (%rdi,%rax) 2 2 sub 8, %rax mov d=7 mov (%rdi,%rbx), %rcx 2 mul imul %r11, %rcx d=5 3 mov %rcx, (%rdi,%rbx) mov d=2 sub 8, %rbx 0 jz loop sub d=2 2 • S Sch hed dul le (4 (4.5 5 cycl les per it iterati tion jz d=0 mov mov mov mov mo mov v mo mov v mo mov v mo mov v imul imul jz imul imul jz imul imul su sub b su sub b sub sub Saman Amarasinghe 18 6.035 ©MIT Fall 19985 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • l loop i invari iant cod de moti ion • Induction Variable Recognition Saman Amarasinghe 19 6.035 ©MIT Fall 1998 Software Pip pelining g •Try y p p p ple iterations so that the to overla multi slots will be filled • Find Find the the steady steady-state state window window so so that: that: – all the instructions of the loop body is executed – but but from from d different ifferent iterations iterations Saman Amarasinghe 20 6.035 ©MIT Fall 1998 Loop p Examp p le • Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub 4, %rax jjz loopp • Schedule mov mov mov mov mul jz mul jz mul sub sub Saman Amarasinghe 21 6.035 ©MIT Fall 1998 � Looppp Example • Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub 4, %rax jjz loopp Schedule mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6 mov mov1 1 mov2 2 mov mov3 3 mov1 1 mov4 4 mov2 2 ld ld55 mov3 3 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul mul1 mul2 mul3 mul4 sub sub1 sub2 sub3 sub sub1 sub2 sub3 Saman Amarasinghe 22 6.035 ©MIT Fall 1998� Looppp Example • Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub 4, %rax jjz loopp Schedule (2 cycles per iteration) mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6 mov mov1 1 mov2 2 mov mov3 3 mov1 1 mov4 4 mov2 2 ld ld55 mov3 3 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul mul1 mul2 mul3 mul4 sub sub1 sub2 sub3 sub sub1 sub2 sub3 Saman Amarasinghe 23 6.035 ©MIT Fall 1998Looppp Example • 4 iterations are overlapped mov4 mov2 – value of %r11 don’t change mov1 mov4 mul3 jz1 – 4 regs for (%rdi,%rax) jz mul3 – each addr. incremented by 44 mu mull22 sub2 – 4 regs to keep value %r10 sub1 – Same registers can be reused loop: after 4 of these blocks mov (%rdi,%rax), %r10 g, generate code for 4 blocks, imul imul %r11 %r11, %r10 %r10 otherwise need to move mov %r10, (%rdi,%rax) sub 4, %rax jz loop Saman Amarasinghe 24 6.035 ©MIT Fall 1998Software Pip pelining g • Optimal use of resources • NNd eed a l lot off regi isters – Values in multiple iterations need to be kept • Issues Issues in in dependencies dependencies – Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it it should should not not have) have) – Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) • • Code Code generation generation issues issues – Generate pre-amble and post-amble code – Multiple blocks so no register copy is needed Saman Amarasinghe 25 6.035 ©MIT Fall 1998 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • Id Inductii on VVariiabl bl e RR ecogniitii on • loop invariant code motion Saman Amarasinghe 26 6.035 ©MIT Fall 1998 Register Allocation and dI Instt ructi ti on S S ch h ed d uli li ng • If reg gister allocation is before instruction scheduling – restricts the choices for scheduling g Saman Amarasinghe 27 6.035 ©MIT Fall 1998 Examp ple 1: mov 4(%rbp), %rax 2 2: ad dd d %% rax, % %rb bx 3: mov 8(%rbp), %rax 4: 4: add add %rax, %rax, %rcx %rcx Saman Amarasinghe 28 6.035 ©MIT Fall 1998 Examp ple 1 1: mov 4(%rbp), %rax 22d : addd %%% rax, %rbxb 3 3: mov 8(%rbp), %rax 1 1 2 4: 4: add add %rax, %rax, %rcx %rcx 1 1 1 3 3 4 Saman Amarasinghe 29 6.035 ©MIT Fall 1998Examp ple 1 1: mov 4(%rbp), %rax 22d : addd %%% rax, %rbxb 3 3: mov 8(%rbp), %rax 1 1 2 4: 4: add add %rax, %rax, %rcx %rcx 1 1 1 3 3 ALUop 2 4 4 MEM 1 1 1 3 3 MEM 2 1 3 Saman Amarasinghe 30 6.035 ©MIT Fall 1998Examp ple 1 1: mov 4(%rbp), %rax 22d : addd %%rax, %%b rbx 3 3: mov 8(%rbp), %rax 1 1 2 4: 4: add add %rax %rax, , %rcx %rcx 1 1 1 3 Anti-dependence How about a different register? 3 4 Saman Amarasinghe 31 6.035 ©MIT Fall 1998Examp ple 1 1: mov 4(%rbp), %rax 22d : addd %%% rax, %rbxb 3 3: mov 8(%rbp), %r10 2 4: 4: add add %r10 %r10, , %rcx %rcx 3 Anti-dependence How about a different register? 3 4 Saman Amarasinghe 32 6.035 ©MIT Fall 1998Examp ple 1 1: mov 4(%rbp), %rax 22d : addd %% %rax, %rbxb 3 3: mov 8(%rbp), %r10 2 4: 4: add add %r10, %r10, %rcx %rcx 3 3 ALUop 2 4 4 MEM 1 1 1 3 3 MEM 2 1 3 Saman Amarasinghe 33 6.035 ©MIT Fall 1998Register Allocation and dI Instt ructi ti on S S ch h ed d uli li ng • If reg gister allocation is before instruction scheduling – restricts the choices for scheduling g Saman Amarasinghe 34 6.035 ©MIT Fall 1998 Register Allocation andI d Inst tructi ti on S Sch hed duli li ng • If reg gister allocation is before instruction scheduling – restricts the choices for scheduling g • If If instruction instruction scheduling scheduling b before efore register register allocation – Register Register allocation allocation may may spill spill registers registers – Will change the carefully done schedule Saman Amarasinghe 35 6.035 ©MIT Fall 1998 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • Id Inductii on VVariiabl bl e RR ecogniitii on • loop invariant code motion Saman Amarasinghe 36 6.035 ©MIT Fall 1998 Superscalar: Where have all the t transi i st t ors gone? ? • Out of order execution – If an instruction stalls, go beyond that and start executing non-dependent instructions –Pros: • Hardware scheduling • Tolerates unpredictable latencies – Cons: • Instruction window is small Saman Amarasinghe 37 6.035 ©MIT Fall 1998 Superscalar: Where have all the t transi i st t ors gone? ? •Reg gister renaming g – If there is an anti or output dependency of a register that stalls the pipeline, use a different hardware register –Pros: • Avoids anti and output dependencies – Cons: • Cannot do more complex transformations to eliminate dependencies Saman Amarasinghe 38 6.035 ©MIT Fall 1998 Hardware vs. Comp piler • In a superscalar, hardware and compiler scheduling can work hand-in-hand • Hardware can reduce the burden when not predictable by the compiler • Compiler can still greatly enhance the performance – Large instruction window for scheduling – Many program transformations that increase parallelism • C C ompil il er i i s even more criti iti cal l wh h en no h h ard dware support – VLIW VLIW machines machines (Itanium (Itanium, DSPs) DSPs) Saman Amarasinghe 39 6.035 ©MIT Fall 1998 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IId nducti ion V Vari iab bll e R Recogni iti ion • loop invariant code motion Saman Amarasinghe 40 6.035 ©MIT Fall 1998 Induction Variables •Examp ple i = 200 for jj = 1 to 100 a(i) = 0 i = i - 1 Saman Amarasinghe 41 6.035 ©MIT Fall 1998 Induction Variables •Examp ple i = 200 for jj = 1 to 100 a(i) = 0 i = i - 1 Basic Induction variable: J J = =1 1, 2 2 , 3 3, 4 4, ….. Index Variable i in a(i): I = 200, 199, 198, 197…. Saman Amarasinghe 42 6.035 ©MIT Fall 1998 Induction Variables • Examp ple i = 200 ffor orj=1t j = 1 to o1100 00 a(i) = 0 i = i - 1 Basic Induction variable: J J =1 = 1, 2 2, 3 3, 4 4, ….. Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 43 6.035 ©MIT Fall 1998 Induction Variables • Examp ple i = 200 ffor orj=1t j = 1 to o1100 00 a(201 - j) = 0 i = i - 1 Basic Induction variable: J J = =1 1, 2 2 , 3 3 , 4 4, ….. Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 44 6.035 ©MIT Fall 1998 Induction Variables • Examp ple ffor orj=1t j = 1 to o1100 00 a(201 - j) = 0 Basic Induction variable: J J =1 = 1, 2 2, 3 3, 4 4, ….. Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 45 6.035 ©MIT Fall 1998 What are induction variables? • x is an induction variable of a loop p L if – variable changes its value every iteration of the loop – the value is a function of number of iterations of the loop • In compilers this function is normally a linear function function – Example: for loop index variable j, function cj + d Saman Amarasinghe 46 6.035 ©MIT Fall 1998 What can we do with induction varib iabll es? ? • • Use Use them them t to o perform perform strength strength reduction reduction • Get rid of them Saman Amarasinghe 47 6.035 ©MIT Fall 1998 Classification of induction variables • Basic induction variables – Explicitly modified by the same constant amount once during each iteration of the loop – Example: loop index variable • Dependent induction variables – Can Can be be expressed expressed in in the the form: form: ax ax + + b b where where a a and and be are loop invariant and x is an induction variable – Example: Example: 202 202 - 22j j Saman Amarasinghe 48 6.035 ©MIT Fall 1998 Classification of induction variables • Class of induction variables: All induction variables with same basic variable in their linear eq quations • • Basis Basis of of a a class: class: the the basic basic variable variable that that determines that class Saman Amarasinghe 49 6.035 ©MIT Fall 1998 Finding g Basic Induction Variables • Look inside loop p nodes • Find variables whose only modification is of the the f form orm jj = j+d j + d where where d d is is a a loop loop constant Saman Amarasinghe 50 6.035 ©MIT Fall 1998 Finding g Dep p endent Induction Variables • Find all the basic induction variables • Search variable k with a single assignment in the loop • Variable assignments of the form k = e op j or k = -j where j is an induction variable and e is loop invariant Saman Amarasinghe 51 6.035 ©MIT Fall 1998 Finding g Dep p endent Induction Variables • Example for i = 1 to 100 j = ic k = j+1 Saman Amarasinghe 52 6.035 ©MIT Fall 1998 A sp pecial case t = 202 for j = 1 to 100 t = t - 2 a(jj) = t t = t - 2 b(j) (j) = t Saman Amarasinghe 53 6.035 ©MIT Fall 1998 A sp pecial case u1 = 200 t = 202 u2 = 202 for j = 1 to 100 for jj = 1 to 100 t = t - 2 u1 = u1 - 4 a(jj ) = t a(j) a(j) = u1 u1 t = t - 2 u2 = u2 - 4 b(j) (j) = t b(j) b(j) = = uu2 2 Saman Amarasinghe 54 6.035 ©MIT Fall 1998 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IId nductii on VVariiabl bl e RR ecogniitii on • Loop invariant code motion Saman Amarasinghe 55 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation p p roduces the same value in every loop iteration, move it out of the loop Saman Amarasinghe 56 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp p utation p p roduces the same value in every loop iteration, move it out of the loop for i = 1 to N x x = x+1 x + 1 for j = 1 to N a(i,j) = 100N + 10i + j + x Saman Amarasinghe 57 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation p p roduces the same value in every loop iteration, move it out of the loop for i = 1 to N xx = x+1 x + 1 for j = 1 to N a(i,j) = 100N + 10i + j + x Saman Amarasinghe 58 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation p produces the same value in every loop iteration, move it out of the loop t1 = 100N for i = 1 to N x x = x+1 x + 1 for j = 1 to N a(i,j) = 100N + 10i + j + x Saman Amarasinghe 59 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp p utation p p roduces the same value in every loop iteration, move it out of the loop t1 = 100N for i = 1 to N x x = x+1 x + 1 for j = 1 to N a(i,j) = t1 + 10i + j + x Saman Amarasinghe 60 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation p p roduces the same value in every loop iteration, move it out of the loop t1 = 100N for i = 1 to N xx = x+1 x + 1 for j = 1 to N a(i,j) = t1 + 10i + j + x Saman Amarasinghe 61 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation p produces the same value in every loop iteration, move it out of the loop t1 = 100N for i = 1 to N x x = x+1 x + 1 for j = 1 to N a(i,j) = t1 + 10i + j + x Saman Amarasinghe 62 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation p produces the same value in every loop iteration, move it out of the loop t1 = 100N for i = 1 to N xx = x+1 x + 1 t2 = t1 + 10i + x for j = 1 to N a(( i,j) ,j) = t1 + 10i + j j + x Saman Amarasinghe 63 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation p p roduces the same value in every loop iteration, move it out of the loop t1 = 100N for i = 1 to N x x = x+1 x + 1 t2 = t1 + 10i + x for j = 1 to N a( (i,j) ,j) = t2 + j j Saman Amarasinghe 64 6.035 ©MIT Fall 1998 MIT OpenCourseWare http://ocw.mit.edu 6.035 Computer Language Engineering Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Website URL
Comment
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
Dr.MasonHanks
User Type:
Teacher
Country:
Germany
Uploaded Date:
23-07-2017