Question? Leave a message!

Vector Computers

Vector Computers
Dr.ShaneMatts Profile Pic
Dr.ShaneMatts,United States,Teacher
Published Date:23-07-2017
Website URL
Joel Emer November 30, 2005 6.823, L22-1 Vector Computers Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind Joel Emer November 30, 2005 6.823, L22-2 Supercomputers Definition of a supercomputer: • Fastest machine in world at given task • A device to turn a compute-bound problem into an I/O bound problem • Any machine costing 30M+ • Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer Joel Emer November 30, 2005 6.823, L22-3 Supercomputer Applications Typical application areas • Military research (nuclear weapons, cryptography) • Scientific research • Weather forecasting • Oil exploration • Industrial design (car crash simulation) • Bioinformatics • Cryptography All involve huge computations on large data sets In 70s-80s, Supercomputer ≡ Vector Machine Joel Emer November 30, 2005 6.823, L22-4 Loop Unrolled Code Schedule Int1 Int 2 M1 M2 FP+ FPx loop: ld f1, 0(r1) ld f2, 8(r1) loop: ld f1 ld f3, 16(r1) ld f4, 24(r1) ld f2 add r1, 32 ld f3 fadd f5 fadd f5, f0, f1 add r1 ld f4 fadd f6, f0, f2 fadd f6 Schedule fadd f7, f0, f3 fadd f7 fadd f8, f0, f4 fadd f8 sd f5, 0(r2) sd f5 sd f6, 8(r2) sd f6 sd f7, 16(r2) sd f7 sd f8, 24(r2) add r2 bne sd f8 add r2, 32 bne r1, r3, loop Joel Emer November 30, 2005 6.823, L22-5 Vector Supercomputers Epitomized by Cray-1, 1976: • Scalar Unit – Load/Store Architecture • Vector Extension – Vector Registers –Vector Instructions • Implementation –Hardwired Control – Highly Pipelined Functional Units – Interleaved Memory System –No Data Caches –No Virtual Memory Joel Emer November 30, 2005 6.823, L22-6 Cray-1 (1976) Core unit of the Cray 1 computer Image removed due to copyright restrictions. To view image, visit http://www.cray- Joel Emer November 30, 2005 6.823, L22-7 Cray-1 (1976) V0 V i V. Mask V1 V2 V j 64 Element V3 V. Length V4 V k Vector Registers V5 Single Port V6 V7 Memory FP Add S0 S FP Mul j S1 ( (A ) + j k m ) 16 banks of h S2 S FP Recip k S3 64-bit words S i S4 (A ) 64 S 0 i Int Add + S5 T jk T Regs S6 Int Logic 8-bit SECDED S7 Int Shift A0 80MW/sec data A1 ( (A ) + j k m ) Pop Cnt h A2 load/store A j A3 A i A4 (A ) A Addr Add 64 0 k A5 B jk A B Regs i A6 Addr Mul 320MW/sec A7 instruction buffer refill NIP CIP 64-bitx16 LIP 4 Instruction Buffers memory bank cycle 50 ns processor cycle 12.5 ns (80MHz) Joel Emer November 30, 2005 6.823, L22-8 Vector Programming Model Scalar Registers Vector Registers r15 v15 r0 v0 0 1 2 VLRMAX-1 Vector Length Register VLR v1 Vector Arithmetic v2 Instructions + + + + + + ADDV v3, v1, v2 v3 0 1 VLR-1 Vector Load and Vector Register v1 Store Instructions LV v1, r1, r2 Memory Base, r1 Stride, r2 Joel Emer November 30, 2005 6.823, L22-9 Vector Code Example Scalar Code Vector Code C code LI R4, 64 LI VLR, 64 for (i=0; i64; i++) loop: LV V1, R1 Ci = Ai + Bi; L.D F0, 0(R1) LV V2, R2 L.D F2, 0(R2) ADDV.D V3, V1, V2 ADD.D F4, F2, F0 SV V3, R3 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop Joel Emer November 30, 2005 6.823, L22-10 Vector Instruction Set Advantages • Compact – one short instruction encodes N operations • Expressive, tells hardware that these N operations: – are independent – use the same functional unit – access disjoint registers – access registers in same pattern as previous instructions – access a contiguous block of memory (unit-stride load/store) – access memory in a known pattern (strided load/store) • Scalable – can run same code on more parallel pipelines (lanes) Joel Emer November 30, 2005 6.823, L22-11 Vector Arithmetic Execution • Use deep pipeline (= fast clock) to execute element V V V operations 1 2 3 • Simplifies control of deep pipeline because elements in vector are independent (= no hazards) Six stage multiply pipeline V3 - v1 v2 Joel Emer November 30, 2005 6.823, L22-12 Vector Instruction Execution ADDV C,A,B Execution using one Execution using pipelined functional four pipelined unit functional units A6 B6 A24 B24 A25 B25 A26 B26 A27 B27 A5 B5 A20 B20 A21 B21 A22 B22 A23 B23 A4 B4 A16 B16 A17 B17 A18 B18 A19 B19 A3 B3 A12 B12 A13 B13 A14 B14 A15 B15 C2 C8 C9 C10 C11 C1 C4 C5 C6 C7 C0 C0 C1 C2 C3 Joel Emer November 30, 2005 6.823, L22-13 Vector Memory System Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency • Bank busy time: Cycles between accesses to same bank Base Stride Vector Registers Address Generator + 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Banks Joel Emer November 30, 2005 6.823, L22-14 Vector Unit Structure Functional Unit Vector Registers Elements Elements Elements Elements 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, … Lane Memory Subsystem Joel Emer November 30, 2005 6.823, L22-15 T0 Vector Microprocessor (1995) Lane Vector register elements striped over lanes 2425262728 29 3031 1617181920 21 2223 8 9 101112 13 1415 0 1 2 3 4 5 6 7 For more information, visit _____________________________________________ Joel Emer November 30, 2005 6.823, L22-16 Vector Instruction Parallelism Can overlap execution of multiple vector instructions – example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle Joel Emer November 30, 2005 Vector Chaining 6.823, L22-17 • Vector version of register bypassing – introduced with Cray-1 V V V V V LV v1 1 2 3 4 5 MULV v3,v1,v2 ADDV v5, v3, v4 Chain Chain Load Unit Mult. Add Memory Joel Emer November 30, 2005 6.823, L22-18 Vector Chaining Advantage • Without chaining, must wait for last element of result to be written before starting dependent instruction Load Mul Time Add • With chaining, can start dependent instruction as soon as first result appears Load Mul Add Joel Emer November 30, 2005 6.823, L22-19 Vector Startup Two components of vector startup penalty – functional unit latency (time through pipeline) – dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency R X X X W First Vector Instruction R X X X W R X X X W R X X X W R X X X W Dead Time R X X X W R X X X W R X X X W Dead Time Second Vector Instruction R X X X W R X X X W Joel Emer November 30, 2005 6.823, L22-20 Dead Time and Short Vectors No dead time 4 cycles dead time T0, Eight lanes No dead time 100% efficiency with 8 element vectors 64 cycles active Cray C90, Two lanes 4 cycle dead time Maximum efficiency 94% with 128 element vectors