Question? Leave a message!




Loop Optimizations

Loop Optimizations
Dr.MasonHanks Profile Pic
Dr.MasonHanks,Germany,Teacher
Published Date:23-07-2017
Website URL
Comment
Spring Spring 2010 2010 Loop Optimizations Instruction Instruction Scheduling Scheduling5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IId nductii on VVariiabl bl e RR ecogniitii on • loop invariant code motion Saman Amarasinghe 2 6.035 ©MIT Fall 1998 Scheduling g Loop p s •Loop p bodies are small • But, lot of time is spend in loops due to large number number o of f iterations iterations • Need better ways to schedule loops Saman Amarasinghe 3 6.035 ©MIT Fall 1998 Loop p Examp p le • Machine – One load/store unit • load 2 cycles • store store 2 2 cycles cycles – Two arithmetic units • add 2 cycles • branch 2 cycles • multiply 3 cycles – Both Both units units are are p pipelined ipelined (initiate (initiate one one op op each each cycle) cycle) • Source Code for i = 1 to N Ai = Ai b Saman Amarasinghe 4 6.035 ©MIT Fall 1998 Loop p Examp p le • Source Code for i = 1 to N base Ai = Ai b off ffset • Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov mov %r10, %r10, (%rdi,%rax) (%rdi,%rax) sub 4, %rax jz loop Saman Amarasinghe 5 6.035 ©MIT Fall 1998 � mov d=7 Looppp Example 2 d=5 imul • Assembly Code 3 loop: mov (%rdi,%rax), %r10 d=2 mov imul %r11,, %r10 0 0 mov %r10, (%rdi,%rax) d=2 sub sub 4, %rax jz jz loop loop 2 2 d=0 Schedule (9 cycles per iteration) jz mov mov mov mov imul bge imul bge imul sub sub Saman Amarasinghe 6 6.035 ©MIT Fall 19985 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • Id Inductii on VVariiabl bl e RR ecogniitii on • loop invariant code motion Saman Amarasinghe 7 6.035 ©MIT Fall 1998 Loop p Unrolling g • Unroll the loop p y y bod few times • Pros: – Create Create a a much much larger larger basic basic block block for for the the body body – Eliminate few loop bounds checks • • Cons: Cons: – Much larger program – S Setup cod de ( ( of f i iterati ions unroll ll ff actor)) – beginning and end of the schedule can still have unuseddl slotts Saman Amarasinghe 8 6.035 ©MIT Fall 1998 Loop p Examp ple loop: loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov mov %r10, %r10, (%rdi,%rax) (%rdi,%rax) sub 4, %rax jz loop Saman Amarasinghe 9 6.035 ©MIT Fall 1998 Loop p Examp ple loop: loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub 4, %rax mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub 4, %rax jz loop Saman Amarasinghe 10 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 Looppp Example 3 loop: loop: mov d=9 mov (%rdi,%rax), %r10 0 imul %r11, %r10 sub d=9 mov %r10, (%rdi,%rax) 2 2 sub 4, %rax mov d=7 mov (%rdi,%rax), %r10 2 mul imul %r11, %r10 d=5 3 mov %r10, (%rdi,%rax) mov d=2 sub 4, %rax 0 jz loop sub d=2 2 jz d=0 • Schedule (8 cycles per iteration) mov mov mov mov mo mov v mo mov v mo mov v mo mov v imul imul bge imul imul bge imul imul sub sub sub sub Saman Amarasinghe 11 6.035 ©MIT Fall 1998Loop p Unrolling g • Rename reg gisters – Use different registers in different iterations Saman Amarasinghe 12 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 Looppp Example loop: loop: 3 mov (%rdi,%rax), %r10 mov d=9 0 imul %r11, %r10 sub d=9 mov %1 %r100, (% (%rdi di,%%rax)) 2 mov d=7 sub 4, %rax 2 mov (%rdi,%rax), %r10 mul mul d=5 d=5 3 imul %r11, %r10 mov d=2 mov %r10, (%rdi,%rax) 0 sub sub 4, 4, % %rax rax sub sub dd2 =2 2 jz loop jz d=0 Saman Amarasinghe 13 6.035 ©MIT Fall 1998mov d=14 2 mul d=12 Looppp Example loop: loop: 3 mov (%rdi,%rax), %r10 mov d=9 0 imul %r11, %r10 sub d=9 mov %1 %r100, (% (%rdi di,%%rax)) 2 mov d=7 sub 4, %rax 2 mov (%rdi,%rax), %rcx mul mul d=5 d=5 3 imul %r11, %rcx mov d=2 mov %rcx, (%rdi,%rax) 0 sub sub 4, 4, % %rax rax sub sub dd2 =2 2 jz loop jz d=0 Saman Amarasinghe 14 6.035 ©MIT Fall 1998Loop p Unrolling g • Rename reg gisters – Use different registers in different iterations • Eliminate unnecessary dependencies – again again , use use more more registers registers to to eliminate eliminate true true , anti anti and and output dependencies – eliminate eliminate dependent dependent-chains chains of of calculations calculations when when possible Saman Amarasinghe 15 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 Looppp Example 3 loop: loop: mov d=9 mov (%rdi,%rax), %r10 0 imul %r11, %r10 sub d=9 2 2 mov %r%1 100,( (%%rdi di,%%rax)) mov d=7 sub 4, %rax 2 mul d=5 mov (%rdi,%rax), %rcx 3 imul %r11, %rcx mov d=2 0 mov %rcx, (%rdi,%rax) sub d=2 sub sub 4, 4, %%rax rax 2 jz loop jz d=0 Saman Amarasinghe 16 6.035 ©MIT Fall 1998mov d=5 2 mul d=3 Looppp Example 3 loop: loop: mov d=0 mov (%rdi,%rax), %r10 0 imul %r11, %r10 sub d=0 2 2 mov %r%1 100,( (%%rdi di,%%rax)) mov d=7 sub 8, %rax 2 mul d=5 mov (%rdi,%rbx), %rcx 3 imul %r11, %rcx mov d=2 0 mov %rcx, (%rdi,%rbx) sub d=2 sub sub 8 8,, %%rbx rbx 2 jz loop jz d=0 Saman Amarasinghe 17 6.035 ©MIT Fall 1998mov d=5 2 mul d=3 Looppp Example 3 loop: loop: mov d=0 mov (%rdi,%rax), %r10 0 imul %r11, %r10 sub d=0 mov %r10, (%rdi,%rax) 2 2 sub 8, %rax mov d=7 mov (%rdi,%rbx), %rcx 2 mul imul %r11, %rcx d=5 3 mov %rcx, (%rdi,%rbx) mov d=2 sub 8, %rbx 0 jz loop sub d=2 2 • S Sch hed dul le (4 (4.5 5 cycl les per it iterati tion jz d=0 mov mov mov mov mo mov v mo mov v mo mov v mo mov v imul imul jz imul imul jz imul imul su sub b su sub b sub sub Saman Amarasinghe 18 6.035 ©MIT Fall 19985 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • l loop i invari iant cod de moti ion • Induction Variable Recognition Saman Amarasinghe 19 6.035 ©MIT Fall 1998 Software Pip pelining g •Try y p p p ple iterations so that the to overla multi slots will be filled • Find Find the the steady steady-state state window window so so that: that: – all the instructions of the loop body is executed – but but from from d different ifferent iterations iterations Saman Amarasinghe 20 6.035 ©MIT Fall 1998