Question? Leave a message!




Pipeline Hazards

Pipeline Hazards
1 Pipeline Hazards Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic6.823 L6- 2 Arvind Technology Assumptions • A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower) It makes the following timing assumption valid t ≈ t ≈ t ≈ t ≈ t IM RF ALU DM RW A 5-stage pipelined Harvard architecture will be the focus of our detailed design September 28, 20056.823 L6- 3 Arvind 5-Stage Pipelined Execution 0x4 Add we rs1 rs2 rd1 we PC addr ws IR addr rdata ALU wd rd2 rdata GPRs Data Inst. Memory Imm Memory wdata Ext Write Decode, Reg. Fetch I-Fetch Execute Memory -Back (ID) (IF) (EX) (MA) (WB) time t0 t1 t2 t3 t4 t5 t6 t7 . . . . instruction1 IF ID EX MA WB 1 1 1 1 1 instruction2 IF ID EX MA WB 2 2 2 2 2 instruction3 IF ID EX MA WB 3 3 3 3 3 instruction4 IF ID EX MA WB 4 4 4 4 4 instruction5 IF ID EX MA WB 5 5 5 5 5 September 28, 20056.823 L6- 4 Arvind 5-Stage Pipelined Execution Resource Usage Diagram 0x4 Add we rs1 rs2 rd1 we PC addr ws IR addr rdata ALU wd rd2 rdata GPRs Data Inst. Memory Imm Memory wdata Ext Write Decode, Reg. Fetch I-Fetch Execute Memory -Back (ID) (IF) (EX) (MA) (WB) time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I I I I I 1 2 3 4 5 ID I I I I I 1 2 3 4 5 EX I I I I I 1 2 3 4 5 MA I I I I I 1 2 3 4 5 WB I I I I I 1 2 3 4 5 September 28, 2005 Resources6.823 L6- 5 Arvind Pipelined Execution: ALU Instructions 0x4 IR IR IR Add 31 we rs1 rs2 A addr rd1 we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext MD1 MD2 Not quite correct We need an Instruction Reg (IR) for each stage September 28, 20056.823 L6- 6 Arvind IR’s and Control points 0x4 IR IR IR Add 31 we rs1 rs2 A addr rd1 we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext MD1 MD2 Are control points connected properly? - ALU instructions - Load/Store instructions -Write back September 28, 20056.823 L6- 7 Arvind Pipelined MIPS Datapath without jumps FD E M W IR IR IR 31 0x4 Add RegDst RegWrite we OpSel MemWrite rs1 WBSrc rs2 A addr rd1 we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext MD1 MD2 ExtSel BSrc September 28, 20056.823 L6- 8 Arvind How Instructions can Interact with each other in a pipeline • An instruction in the pipeline may need a resource being used by another instruction in the pipeline – structural hazard • An instruction may produce data that is needed by a later instruction – data hazard • In the extreme case, an instruction may determine the next instruction to be executed – control hazard (branches, interrupts,...) September 28, 20056.823 L6- 9 Arvind Data Hazards r1 ← … r4 ← r1 … 0x4 IR IR IR Add 31 we rs1 rs2 A addr rd1 we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext MD1 MD2 ... r1 ← r0 + 10 r4 ← r1 + 17 r1 is stale. Oops ... September 28, 20056.823 L6- 10 Arvind Resolving Data Hazards Freeze earlier pipeline stages until the data becomes available ⇒ interlocks If data is available somewhere in the datapath provide a bypass to get it to the right stage Speculate about the hazard resolution and kill the instruction later if the speculation is wrong. September 28, 20056.823 L6- 11 Arvind Feedback to Resolve Hazards FB FB FB FB 1 2 3 4 stage stage stage stage 3 4 1 2 • Detect a hazard and provide feedback to previous stages to stall or kill instructions • Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) September 28, 20056.823 L6- 12 Arvind Interlocks to resolve Data Hazards Stall Condition 0x4 nop IR IR IR Add 31 we rs1 rs2 A addr rd1 we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext ... MD1 MD2 r1 ← r0 + 10 r4 ← r1 + 17 ... September 28, 20056.823 L6- 13 Arvind Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I ) r1 ← (r0) + 10 IF ID EX MA WB 1 1 1 1 1 1 (I ) r4 ← (r1) + 17 IF ID ID ID ID EX MA WB 2 2 2 2 2 2 2 2 2 (I)IF IF IF IF ID EX MA WB 3 3 3 3 3 3 3 3 3 (I)IF ID EX MA WB 4 4 4 4 4 4 stalled stages (I)IF ID EX MA WB 5 5 5 5 5 5 time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I I I I I I I I 1 2 3 3 3 3 4 5 ID I I I I I I I I 1 2 2 2 2 3 4 5 Resource EX I nop nop nop I I I I 1 2 3 4 5 Usage MA I nop nop nop I I I I 1 2 3 4 5 WB I nop nop nop I I I I 1 2 3 4 5 nop ⇒ pipeline bubble September 28, 20056.823 L6- 14 Arvind Interlock Control Logic ws stall C stall rs ? rt 0x4 nop IR IR IR Add 31 we rs1 rs2 A addr rd1 we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext MD1 MD2 Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. September 28, 20056.823 L6- 15 Arvind Interlocks Control Logic ignoring jumps & branches ws stall we C stall rs ? rt ws ws we we C re1 re2 C dest dest C re 0x4 nop IR IR IR Add 31 C dest we rs1 rs2 A addr rd1 we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext MD1 MD2 Should we always stall if the rs field matches some rd? not every instruction writes a register ⇒ we not every instruction reads a register ⇒ re September 28, 20056.823 L6- 16 Arvind Source & Destination Registers R-type: op rs rt rd func I-type: op rs rt immediate16 J-type: op immediate26 source(s) destination ALU rd ← (rs) func (rt) rs, rt rd ALUi rt ← (rs) op imm rs rt LW rt ← M (rs) + imm rs rt SW M (rs) + imm ← (rt) rs, rt BZ cond (rs) true: PC ← (PC) + imm rs false: PC ← (PC) + 4 rs JPC ← (PC) + imm JAL r31 ← (PC), PC ← (PC) + imm 31 JR PC ← (rs) rs JALR r31 ← (PC), PC ← (rs) rs 31 September 28, 20056.823 L6- 17 Arvind Deriving the Stall Signal C C dest re ws = Case opcode re1 = Case opcode ALU ⇒ rd ALU, ALUi, ALUi, LW ⇒ rt LW, SW, BZ, JR, JALR JAL, JALR ⇒ R31 ⇒ on J, JAL ⇒ off we = Case opcode ALU, ALUi, LW ⇒(ws ≠ 0) re2 = Case opcode JAL, JALR ⇒ on ⇒ on ALU, SW ... ⇒ off ⇒ off ... C stall stall = ((rs =ws ).we + D E E (rs =ws ).we + D M M (rs =ws ).we ) . re1 + D W W D ((rt =ws ).we + D E E (rt =ws ).we + D M M (rt =ws ).we ) . re2 D W W D September 28, 2005 This is not the full story 6.823 L6- 18 Arvind Hazards due to Loads & Stores Stall Condition What if (r1)+7 = (r3)+5 ? 0x4 nop IR IR IR Add 31 we rs1 rs2 A addr rd1 we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext MD1 MD2 ... Is there any possible data hazard M(r1)+7 ← (r2) r4 ← M(r3)+5 in this instruction sequence? ... September 28, 20056.823 L6- 19 Arvind Load & Store Hazards ... M(r1)+7 ← (r2) (r1)+7 = (r3)+5 ⇒ data hazard r4 ← M(r3)+5 ... However, the hazard is avoided because our memory system completes writes in one cycle Load/Store hazards, even when they do exist, are often resolved in the memory system itself. More on this later in the course. September 28, 200520 Five-minute break to stretch your legs6.823 L6- 21 Arvind Complications due to Jumps PCSrc (pc+4 / jabs / rind/ br) stall Add M E 0x4 nop IR IR Add I Jump? 1 Note fetching the addr PC next instruction inst IR before decode is Inst 104 I speculation ⇒ kill Memory 2 I 096 ADD 1 A jump instruction kills (not stalls) I 100 J 200 2 the following instruction I 104 ADD kill 3 I 304 ADD How? 4 September 28, 20056.823 L6- 22 Arvind Pipelining Jumps PCSrc (pc+4 / jabs / rind/ br) stall To kill a fetched instruction Insert a mux before IR Add M E 0x4 nop IR IR Add II I Jump? 2 1 1 Any IRSrc D interaction addr PC nop inst IR between Inst 304 104 stall and nop I Memory 2 jump? IRSrc = Case opcode D D I 096 ADD 1 J, JAL ⇒ nop I 100 J 200 2 ... ⇒ IM I 104 ADD kill 3 I 304 ADD 4 September 28, 20056.823 L6- 23 Arvind Jump Pipeline Diagrams time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I ) 096: ADD IF ID EX MA WB 1 1 1 1 1 1 (I ) 100: J 200 IF ID EX MA WB 2 2 2 2 2 2 (I ) 104: ADD IF nop nop nop nop 3 3 (I ) 304: ADD IF ID EX MA WB 4 4 4 4 4 4 time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I I I I I 1 2 3 4 5 ID I I nop I I 1 2 4 5 Resource EX I I nop I I 1 2 4 5 Usage MA I I nop I I 1 2 4 5 WB I I nop I I 1 2 4 5 nop ⇒ pipeline bubble September 28, 20056.823 L6- 24 Arvind Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall Add M E 0x4 nop IR IR Add I BEQZ? 1 zero? IRSrc D addr PC nop A inst IR ALU Y Inst 104 I Memory 2 Branch condition is not known until I 096 ADD 1 the execute stage I 100 BEQZ r1 200 2 what action should be taken in the I 104 ADD 3 decode stage ? I 304 ADD 4 September 28, 20056.823 L6- 25 Arvind Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall ? Add M E BEQZ? 0x4 nop IR IR Add I I 2 1 zero? IRSrc D addr PC nop A inst IR ALU Y Inst 108 I 3 Memory If the branch is taken I 096 ADD - kill the two following instructions 1 I 100 BEQZ r1 200 2 - the instruction at the decode stage I 104 ADD 3 is not valid I 304 ADD 4 ⇒ stall signal is not valid September 28, 20056.823 L6- 26 Arvind Pipelining Conditional Branches PCSrc (pc+4/jabs/rind/br) stall M E BEQZ? IRSrc E 0x4 nop IR IR Add I I Jump? 2 1 zero? PC IRSrc D addr PC nop A inst IR ALU Y Inst 108 I 3 Memory If the branch is taken I 096 ADD - kill the two following instructions 1 I 100 BEQZ r1 200 2 - the instruction at the decode stage I 104 ADD 3 is not valid I 304 ADD 4 ⇒ stall signal is not valid September 28, 2005 Add6.823 L6- 27 Arvind New Stall Signal stall = ( ((rs =ws ).we + (rs =ws ).we + (rs =ws ).we ).re1 D E E D M M D W W D + ((rt =ws ).we + (rt =ws ).we + (rt =ws ).we ).re2 D E E D M M D W W D ) . ((opcode =BEQZ).z + (opcode =BNEZ).z) E E Don’t stall if the branch is taken. Why? Instruction at the decode stage is invalid September 28, 20056.823 L6- 28 Arvind Control Equations for PC and IR Muxes PCSrc = Case opcode E BEQZ.z, BNEZ.z ⇒ br Give priority ... ⇒ Case opcode to the older D J, JAL ⇒ jabs instruction, JR, JALR ⇒ rind i.e., execute ... ⇒ pc+4 stage instruction IRSrc = Case opcode D E over decode BEQZ.z, BNEZ.z ⇒ nop stage instruction ... ⇒ Case opcode D J, JAL, JR, JALR ⇒ nop ... ⇒ IM IRSrc = Case opcode E E BEQZ.z, BNEZ.z ⇒ nop ... ⇒ stall.nop + stall.IR D September 28, 20056.823 L6- 29 Arvind Branch Pipeline Diagrams (resolved in execute stage) time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I ) 096: ADD IF ID EX MA WB 1 1 1 1 1 1 (I ) 100: BEQZ 200 IF ID EX MA WB 2 2 2 2 2 2 (I ) 104: ADD IF ID nop nop nop 3 3 3 (I ) 108: IF nop nop nop nop 4 4 (I ) 304: ADD IF ID EX MA WB 5 5 5 5 5 5 time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I I I I I 1 2 3 4 5 ID I I I nop I 1 2 3 5 Resource EX I I nop nop I 1 2 5 Usage MA I I nop nop I 1 2 5 WB I I nop nop I 1 2 5 nop ⇒ pipeline bubble September 28, 20056.823 L6- 30 Arvind Reducing Branch Penalty (resolve in decode stage) • One pipeline bubble can be removed if an extra comparator is used in the Decode stage PCSrc (pc+4 / jabs / rind/ br) E Add nop IR 0x4 Add Zero detect on register file output we rs1 rs2 nop addr rd1 PC ws inst IR wd rd2 GPRs Inst D Memory Pipeline diagram now same as for jumps September 28, 20056.823 L6- 31 Arvind Branch Delay Slots (expose control hazard to software) • Change the ISA semantics so that the instruction that follows a jump or branch is always executed – gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted. I 096 ADD 1 Delay slot instruction I 100 BEQZ r1 200 2 I 104 ADD executed regardless of 3 I 304 ADD branch outcome 4 • Other techniques include branch prediction, which can dramatically reduce the branch penalty... to come later September 28, 20056.823 L6- 32 Arvind Bypassing time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I )r1 ← r0 + 10 IF ID EX MA WB 1 1 1 1 1 1 (I ) r4 ← r1 + 17 IF ID ID ID ID EX MA WB 2 2 2 2 2 2 2 2 2 (I)IF IF IF IF ID EX MA 3 3 3 3 3 3 3 3 (I ) stalled stages IF ID EX 4 4 4 4 (I)IF ID 5 5 5 Each stall or kill introduces a bubble in the pipeline ⇒ CPI 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I ) r1 ← r0 + 10 IF ID EX MA WB 1 1 1 1 1 1 (I ) r4 ← r1 + 17 IF ID EX MA WB 2 2 2 2 2 2 (I)IF ID EX MA WB 3 3 3 3 3 3 (I)IF ID EX MA WB 4 4 4 4 4 4 (I)IF ID EX MA WB 5 5 5 5 5 5 September 28, 20056.823 L6- 33 Arvind Adding a Bypass stall r4 ← r1... r1 ←... EM W 0x4 nop IR IR IR Add 31 ASrc we rs1 rs2 A addr rd1 D we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext MD1 MD2 When does this bypass help? ... (I)r1 ← r0 + 10 r1 ← Mr0 + 10 JAL 500 1 (I)r4 ← r1 + 17 r4 ← r1 + 17 r4 ← r31 + 17 2 yes no no September 28, 20056.823 L6- 34 Arvind The Bypass Signal Deriving it from the Stall Signal stall = ( ((rs =ws ).we + (rs =ws ).we + (rs =ws ).we ).re1 D E E D M M D W W D +((rt =ws ).we + (rt =ws ).we + (rt =ws ).we ).re2 ) D E E D M M D W W D we = Case opcode ws = Case opcode ALU, ALUi, LW ⇒(ws ≠ 0) ALU ⇒ rd JAL, JALR ⇒ on ALUi, LW ⇒ rt ... ⇒ off JAL, JALR ⇒ R31 ASrc = (rs =ws ).we .re1 Is this correct? D E E D No because only ALU and ALUi instructions can benefit from this bypass Split we into two components: we-bypass, we-stall E September 28, 20056.823 L6- 35 Arvind Bypass and Stall Signals Split we into two components: we-bypass, we-stall E we-bypass = Case opcode we-stall = Case opcode E E E E ALU, ALUi ⇒ (ws ≠ 0) LW ⇒ (ws ≠ 0) ... ⇒ off JAL, JALR ⇒ on ... ⇒ off ASrc = (rs =ws ).we-bypass . re1 D E E D stall = ((rs =ws ).we-stall + D E E (rs =ws ).we + (rs =ws ).we ). re1 D M M D W W D +((rt = ws ).we + (rt = ws ).we + (rt = ws ).we ). re2 D E E D M M D W W D September 28, 20056.823 L6- 36 Arvind Fully Bypassed Datapath PC for JAL, ... stall EM W 0x4 nop IR IR IR Add 31 ASrc we rs1 rs2 A addr rd1 D we PC ws inst IR addr ALU Y wd rd2 Inst GPRs B rdata Memory Data R Memory Imm wdata wdata Ext BSrc MD1 MD2 Is there still a need for the stall signal ? stall = (rs =ws ). (opcode =LW ).(ws ≠0 ).re1 D E E E E D + (rt =ws ). (opcode =LW ).(ws ≠0 ).re2 D E E E E D September 28, 20056.823 L6- 37 Arvind Why an Instruction may not be dispatched every cycle (CPI1) • Full bypassing may be too expensive to implement – typically all frequently used paths are provided – some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI • Loads have two cycle latency – Instruction after load cannot use load result – MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II. • Conditional branches may cause bubbles – kill following instruction(s) if no delay slots Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler. September 28, 200538 Thank you
Website URL
Comment