Enhancing performance with pipelining

define pipelining in computer architecture and how to calculate speedup of pipeline
Dr.AldenCutts Profile Pic
Dr.AldenCutts,United Kingdom,Teacher
Published Date:23-07-2017
Your Website URL(Optional)
Comment
Lecture 4: Introduction to Advanced Pipelining Prepared by: Professor David A. Patterson Computer Science 252, Fall 1996 Edited and presented by : Prof. Kurt Keutzer Computer Science 252, Spring 2000 KK SPR2000 1Review: Evaluating Branch Alternatives • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier Pipeline depth Pipeline speedup = 1 +Branch frequency ×Branch penalty Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.42 3.5 1.0 Predict taken 1 1.14 4.4 1.26 Predict not taken 1 1.09 4.5 1.29 Delayed branch 0.5 1.07 4.6 1.31 KK SPR2000 2Review: Evaluating Branch Prediction • Two strategies 100000 – Backward branch predict taken, forward branch not 10000 taken 1000 – Profile-based prediction: record branch behavior, 100 predict branch based on prior run 10 • “Instructions between 1 mispredicted branches” a better metric than Profile-based Direction-based misprediction KK SPR2000 3 Instructions per mispredicted branch alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatvReview: Summary of Pipelining Basics • Hazards limit performance – Structural: need more HW resources – Data: need forwarding, compiler scheduling – Control: early evaluation & PC, delayed branch, prediction • Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency • Interrupts, Instruction Set, FP makes pipelining harder • Compilers reduce cost of data and control hazards – Load delay slots – Branch delay slots – Branch prediction • Today: Longer pipelines (R4000) = Better branch prediction, more instruction parallelism? KK SPR2000 4Case Study: MIPS R4000 (200 MHz) • 8 Stage Pipeline: – IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. – IS–second half of access to instruction cache. – RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. – EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. – DF–data fetch, first half of access to data cache. – DS–second half of access to data cache. – TC–tag check, determine whether the data cache access hit. – WB–write back for loads and register-register operations. • 8 Stages: What is impact on Load delay? Branch KK SPR2000 5 delay? Why?Case Study: MIPS R4000 IF IS RF EX DF DS TC WB TWO Cycle IF IS RF EX DF DS TC Load Latency IF IS RF EX DF DS IF IS RF EX DF IF IS RF EX IF IS RF IF IS IF IF IS RF EX DF DS TC WB THREE Cycle IF IS RF EX DF DS TC Branch Latency IF IS RF EX DF DS (conditions evaluated IF IS RF EX DF during EX phase) IF IS RF EX Delay slot plus two stalls IF IS RF Branch likely cancels delay slot if not taken IF IS IF KK SPR2000 6MIPS R4000 Floating Point • FP Adder, FP Multiplier, FP Divider • Last step of FP Multiplier/Divider uses FP Adder HW • 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers KK SPR2000 7MIPS FP Pipe Stages FPInstr 1 2 345 6 7 8 … Add, Subtract U S+A A+R R+S Multiply U E+M M M M N N+A R 28 Divide U A R D … D+A D+R, D+R, D+A, D+R, A, R 108 Square root U E (A+R) …A R Negate U S Absolute value U S FP compare U A R Stages: A Mantissa ADD stage M First stage of multiplier D Divide pipeline stage N Second stage of multiplier E Exception test stage R Rounding stage S Operand shift stage U Unpack FP numbers KK SPR2000 8R4000 Performance • Not ideal CPI of 1: – Load stalls (1 or 2 clock cycles) – Branch stalls (2 cycles + unfilled slots) – FP result stalls: RAW data hazard (latency) – FP structural stalls: Not enough FP hardware (parallelism) 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Base Load stalls Branch stalls FP result stalls FP structural stalls KK SPR2000 9 eqntott espresso gcc li doduc nasa7 ora spice2g6 su2cor tomcatvAdvanced Pipelining and Instruction Level Parallelism (ILP) • ILP: Overlap execution of unrelated instructions • gcc 17% control transfer – 5 instructions + 1 branch – Beyond single block to get more instruction level parallelism • Loop level parallelism one opportunity, SW and HW • Do examples and then explain nomenclature • DLX Floating Point as example – Measurements suggests R4000 performance FP execution has room for improvement KK SPR2000 10Instruction Latencies Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 KK SPR2000 11Simple Loop for (i=1; i=1000; i++) x(i) = x(i) + s; KK SPR2000 12Simple Loop and assembler for (i=1; i=1000; i++) x(i) = x(i) + s; Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1=zero NOP ;delayed branch slot KK SPR2000 13FP Loop: Where are the Hazards? Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1=zero NOP ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 KK SPR2000 14 • Where are the stalls?FP Loop Hazards Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar in F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1=zero NOP ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 KK SPR2000 15FP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1=zero 9 stall ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 KK SPR2000 16 • 9 clocks: Rewrite code to minimize stalls?Revised FP Loop Minimizing Stalls 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 6 clocks: Unroll loop 4 times code to make faster KK SPR ? 2000 17Unroll Loop Four Times (straightforward way) 1 Loop: LD F0,0(R1) Rewrite loop to 2 ADDD F4,F0,F2 minimize stalls? 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4LD F6,-8(R1) 5 ADDD F8,F6,F2 6SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,32 ;alter to 48 14 BNEZ R1,LOOP 15 NOP 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration KK SPR2000 18 Assumes R1 is multiple of 4Unrolled Loop That Minimizes Stalls 1 Loop: LD F0,0(R1) • What assumptions 2 LD F6,-8(R1) made when moved 3 LD F10,-16(R1) code? 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 – OK to move store past 6 ADDD F8,F6,F2 SUBI even though changes 7 ADDD F12,F10,F2 register 8 ADDD F16,F14,F2 – OK to move loads before 9SD 0(R1),F4 stores: get right data? 10 SD -8(R1),F8 – When is it safe for 11 SD -16(R1),F12 compiler to do such 12 SUBI R1,R1,32 changes? 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration When safe to move instructions? KK SPR2000 19Summary of Loop Unrolling Example • Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset. • Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for the loop maintenance code. • Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations. • Eliminate the extra tests and branches and adjust the loop maintenance code. • Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing the memory addresses and finding that they do not refer to the same address. • Schedule the code, preserving any dependences needed to yield the same result as the original code. KK SPR2000 20

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.