Speculative Execution

Branch Prediction and Speculative Execution and speculative execution and indirect branch prediction
Dr.ShaneMatts Profile Pic
Dr.ShaneMatts,United States,Teacher
Published Date:23-07-2017
Your Website URL(Optional)
Comment
1 Branch Prediction and Speculative Execution Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Krste Asanovic and Arvind 6.823 L13-2 Arvind Outline • Control transfer penalty • Branch prediction schemes • Branch misprediction recovery schemes October 26, 2005 6.823 L13-3 Arvind Phases of Instruction Execution PC Fetch: Instruction bits retrieved I-cache from cache. Fetch Buffer Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer Issue Buffer Execute: Instructions and operands sent to Func. execution units . Units When execution completes, all results and exception flags are available. Result Buffer Commit: Instruction irrevocably updates architectural state (aka “graduation” or Arch. “completion”). State October 26, 2005 6.823 L13-4 Arvind Fetch Stage PC Instruction Cache Hit? Opcode Rd Rsrc1 Rsrc2/Imm Fetch Buffer Instructions To Decode Stage October 26, 2005 6.823 L13-5 Arvind Decode & Rename Stage Opcode Rd Rsrc1 Rsrc2/Imm (Renaming is shown only for Rsrc2, similar for Rsrc1) R31 R31 V Tag SignExt Committed R30 Rename R30 V Tag Architectural Table Regfile R0 R0 V Tag 1 X 0 1 0 1 0 1 1 ImmSel 0 1 0 1 Opcode U E P1 Tag1 Data1 P2 Tag2 Data2 Pd Rd Datad Cause t1 t2 ROB tn October 26, 2005 6.823 L13-6 Arvind Execute Stage • Arbiter selects one ready instruction (P1=1 AND P2=1) to execute • Instruction reads operands from ROB, executes, and broadcasts tag and result to waiting instructions in ROB. Also saves result and exception flags for commit. Opcode U E P1 Tag1 Data1 P2 Tag2 Data2 Pd Rd Datad Cause t1 t2 ROB Opcode U E P1 Tag1 Data1 P2 Tag2 Data2 Pd Rd Datad Cause tn Func. Unit October 26, 2005 6.823 L13-7 Arvind Commit Stage • When instruction at ptr2 (commit point) has completed, write back result to architectural state and check for exceptions • Check if rename table entry for architectural register written matches tag, if so, clear valid bit in rename table t1 ptr2 Opcode U E P1 Tag1 Data1 P2 Tag2 Data2 Pd Rd Datad Cause ROB tn Exception? R31 Rename V Tag R31 Committed R30 V Tag R30 Table Architectural Regfile R0 R0 V Tag =? Clear rename valid? October 26, 2005 6.823 L13-8 Arvind Branch Penalty PC Next fetch started Fetch I-cache Fetch Modern processors may Buffer Decode have 10 pipeline stages between next PC calculation Issue and branch resolution Buffer Execute Func. Units Result Branch executed Buffer Commit Arch. State October 26, 2005 6.823 L13-9 Arvind Average Run-Length between Branches Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU 39 % 13 % FPU Add 20 % FPU Mult 13 % load 26 % 23 % store 9 % 9 % branch 16 % 8 % other 10 % 12 % SPECint92: compress, eqntott, espresso, gcc, li SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor What is the average run length between branches October 26, 2005 6.823 L13-10 Arvind Reducing Control Transfer Penalties Software solution • loop unrolling Increases the run length • instruction scheduling Compute the branch condition as early as possible (limited) Hardware solution • delay slots replaces pipeline bubbles with useful work (requires software cooperation) • branch prediction & speculative execution of instructions beyond the branch October 26, 2005 6.823 L13-11 Arvind MIPS Branches and Jumps Need to know (or guess) both target address and whether the branch/jump is taken or not Instruction Taken known? Target known? After Reg. Fetch After Inst. Fetch BEQZ/BNEZ Always Taken After Inst. Fetch J Always Taken After Reg. Fetch JR October 26, 2005 6.823 L13-12 Arvind Branch Penalties in Modern Pipelines UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 750MHz, 2000) A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 Branch Target B Branch Address Calc/Begin Decode Address I Complete Decode Known J Steer Instructions to Functional units Branch R Register File Read Direction & E Integer Execute Jump Register Remainder of execute pipeline Target (+ another 6 stages) Known October 26, 2005 6.823 L13-13 Arvind Outline • Control transfer penalty • Branch prediction schemes • Branch misprediction recovery schemes October 26, 2005 6.823 L13-14 Arvind Branch Prediction Motivation: branch penalties limit performance of deeply pipelined processors Modern branch predictors have high accuracy (95%) and can reduce branch penalties significantly Required hardware support: Prediction structures: branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: • In-order machines: kill instructions following branch in pipeline • Out-of-order machines: shadow registers and memory buffers for each speculated branch October 26, 2005 6.823 L13-15 Arvind Static Branch Prediction Overall probability a branch is taken is 60-70% but: JZ backward forward 90% 50% JZ ISA can attach additional semantics to branches about preferred direction, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction (HP PA-RISC, Intel IA-64) October 26, 2005 6.823 L13-16 Arvind Dynamic Branch Prediction learning based on past behavior Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) October 26, 2005 6.823 L13-17 Arvind Branch Prediction Bits • Assume 2 BP bits per instruction • Change the prediction after two consecutive mistakes ¬take wrong ¬taken taken taken taken ¬take take ¬taken right right taken ¬taken ¬taken take wrong BP state: (predict take/¬take) x (last prediction right/wrong) October 26, 2005 6.823 L13-18 Arvind Branch History Table Fetch PC 00 k k 2 -entry BHT, I-Cache BHT Index 2 bits/entry Instruction Opcode offset + Branch? Taken/¬Taken? Target PC 4K-entry BHT, 2 bits/entry, 80-90% correct predictions October 26, 2005 6.823 L13-19 Arvind Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (95% correct) 00 Fetch PC k 2-bit global branch history shift register Shift in Taken/¬Taken results of each branch Taken/¬Taken? October 26, 2005 6.823 L13-20 Arvind Exploiting Spatial Correlation Yeh and Patt, 1992 if (xi 7) then y += 1; if (xi 5) then c -= 4; If first condition false, second condition also false History bit: H records the direction of the last branch executed by the processor Two sets of BHT bits (BHT0 & BHT1) per branch instruction H = 0 (not taken) ⇒ consult BHT0 H = 1 (taken) ⇒ consult BHT1 October 26, 2005