Question? Leave a message!

Digital Signal Processors: Applications and Architectures

Digital Signal Processors: Applications and Architectures 7
Lecture 9: Digital Signal Processors: Applications and Architectures Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI; Prof. Bob Brodersen, Prof. David Patterson 1 Kurt KeutzerIncreasing volume Processor Applications General Purpose high performance Pentiums, Alpha’s, SPARC Used for general purpose software Heavy weight OS UNIX, NT Workstations, PC’s Embedded processors and processor cores ARM, 486SX, Hitachi SH7000, NEC V800 Single program Lightweight, often realtime OS DSP support Cellular phones, consumer electronics (e.g. CD players) Microcontrollers Extremely cost sensitive Small word size 8 bit common Highest volume processors by far Automobiles, toasters, thermostats, ... 2 Kurt Keutzer Increasing CostProcessor Markets 30B 32bit micro 5.2B/17 1.2B/4 32 bit DSP 10B/33 DSP 16bit 5.7B/19 micro 9.3B/31 8bit micro 3 Kurt KeutzerThe Processor Design Space Application specific architectures for performance Microprocessors Embedded processors Performance is everything Software rules Microcontrollers Cost is everything Cost 4 Kurt Keutzer PerformanceMarket for DSP Products Mixed/ Signal Analog DSP DSP is the fastest growing segment of the semiconductor market 5 Kurt KeutzerDSP Applications Audio applications Networking • MPEG Audio • Cable modems • Portable audio • ADSL Digital cameras • VDSL Wireless • Cellular telephones • Base station 6 Kurt KeutzerIncreasing volume Another Look at DSP Applications Highend Wireless Base Station TMS320C6000 Cable modem gateways Midend Cellular phone TMS320C540 Fax/ voice server Low end Storage products TMS320C27 Digital camera TMS320C5000 Portable phones Wireless headsets Consumer audio Automobiles, toasters, thermostats, ... 7 Kurt Keutzer Increasing CostServing a range of applications 8 Kurt KeutzerWorld’s Cellular Subscribers Millions 700 Will provide a ubiquitous 600 infrastructure 500 for wireless 400 data as well as voice 300 Digital 200 100 Analog 0 Year 1993 1994 1995 1996 1997 1998 1999 2000 2001 9 Kurt Keutzer Source: Ericsson Radio Systems, Inc.CELLULAR TELEPHONE SYSTEM 1 2 3 4155551212 CONTROLLER 4 5 6 7 8 9 0 PHYSICAL RF BASEBAND LAYER MODEM CONVERTER PROCESSING SPEECH SPEECH A/D DAC ENCODE DECODE 10 Kurt KeutzerHW/SW/IC PARTITIONING MICROCONTROLLER 1 2 3 4155551212 CONTROLLER 4 5 6 7 8 9 0 PHYSICAL RF BASEBAND LAYER MODEM ASIC CONVERTER PROCESSING SPEECH SPEECH A/D DAC ENCODE DECODE DSP ANALOG IC 11 Kurt KeutzerMapping onto a system on a chip phone keypad S/P book intfc control protocol DMA S/P RAM µC RAM speech DMA voice quality recognition enhancment DSP ASIC RPELTP deintl CORE LOGIC speech decoder decoder demodulator Viterbi and equalizer synchronizer 12 Kurt KeutzerExample Wireless Phone Organization C540 ARM7 13 Kurt KeutzerMultimedia I/O Architecture Embedded Radio Processor Modem Sched ECC Pact Interface Low Power Bus FB Video Fifo Fifo Decomp SRAM Pen Data Graphics Audio Flow Video 14 Kurt KeutzerMultimedia System on a Chip E.g. Multimedia terminal electronics Graphics Out Uplink Radio Video I/O Downlink Radio Voice I/O Pen In Future chips will be a mix of processors, memory and µP Video Unit dedicated hardware for specific algorithms and I/O custom Memory DSP 15 Kurt Keutzer Coms Requirements of the Embedded Processors Optimized for a single program code often in onchip ROM or off chip EPROM Minimum code size (one of the motivations initially for Java) Performance obtained by optimizing datapath Low cost Lowest possible area Technology behind the leading edge High level of integration of peripherals (reduces system cost) Fast time to market Compatible architectures (e.g. ARM) allows reuseable code Customizable core Low power if application requires portability 16 Kurt KeutzerArea of processor cores = Cost Nintendo processor Cellular phones 17 Kurt KeutzerAnother figure of merit Computation per unit area Nintendo processor Cellular phones 18 Kurt KeutzerCode size If a majority of the chip is the program stored in ROM, then code size is a critical issue The Piranha has 3 sized instructions basic 2 byte, and 2 byte 19 plus 16 or 32 bit immediate Kurt Keutzer BENCHMARKS DSPstone ZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHEN APPLICATION BENCHMARKS ADPCM TRANSCODER CCITT G.721 REALUPDATE COMPLEXUPDATES DOTPRODUCT MATRIX1X3 CONVOLUTION FIR FIR2DIM HRONEBIQUAD LMS 20 FFTINPUTSCALED Kurt Keutzer Evolution of GP and DSP General Purpose Microprocessor traces roots back to Eckert, Mauchly, Von Neumann (ENIAC) DSP evolved from Analog Signal Processors, using analog hardware to transform phyical signals (classical electrical engineering) ASP to DSP because DSP insensitive to environment (e.g., same response in snow or desert if it works at all) DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1 variation Different history and different applications led to different terms, different metrics, some new inventions Convergence of markets will lead to architectural showdown 21 Kurt KeutzerEmbedded Systems vs. General Purpose Computing 1 Embedded System General purpose computing Runs a few applications often Intended to run a fully general known at design time set of applications Not enduser programmable Enduser programmable Operates in fixed runtime Faster is always better constraints, additional performance may not be useful/valuable 22 Kurt Keutzer Embedded Systems vs. General Purpose Computing 2 Embedded System General purpose computing Differentiating features: Differentiating features power speed (need not be fully predictable) cost speed speed (must be predictable) did we mention speed cost (largest component power) 23 Kurt Keutzer DSP vs. General Purpose MPU DSPs tend to be written for 1 program, not many programs. Hence OSes are much simpler, there is no virtual memory or protection, ... DSPs sometimes run hard realtime apps You must account for anything that could happen in a time slot All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD DSPs have an infinite continuous data stream 24 Kurt Keutzer DSP vs. General Purpose MPU The “MIPS/MFLOPS” of DSPs is speed of MultiplyAccumulate (MAC). DSP are judged by whether they can keep the multipliers busy 100 of the time. The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolvers In DSPs, algorithms are king Binary compatability not an issue Software is not (yet) king in DSPs. People still write in assembly language for a product to minimize the die area for ROM in the DSP chip. 25 Kurt Keutzer TYPES OF DSP PROCESSORS DSP Multiprocessors on a die TMS320C80 TMS320C6000 32BIT FLOATING POINT TI TMS320C4X MOTOROLA 96000 ATT DSP32C ANALOG DEVICES ADSP21000 16BIT FIXED POINT TI TMS320C2X MOTOROLA 56000 ATT DSP16 ANALOG DEVICES ADSP2100 26 Kurt Keutzer Note of Caution on DSP Architectures Successful DSP architectures have two aspects: Key architectural and microarchitectural features that enabled product success in key parameters Speed Code density Low power Architectural and microarchitectural features that are artifacts of the era in which they were designed • We will focus on the former 27 Kurt Keutzer Architectural Features of DSPs Data path configured for DSP Fixedpoint arithmetic MAC Multiplyaccumulate Multiple memory banks and buses Harvard Architecture Multiple data memories Specialized addressing modes Bitreversed addressing Circular buffers Specialized instruction set and execution control Zerooverhead loops Support for MAC Specialized peripherals for DSP THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN 28 Kurt KeutzerDSP Data Path: Arithmetic DSPs dealing with numbers representing real world = Want “reals”/ fractions DSPs dealing with numbers for addresses = Want integers Support “fixed point” as well as integers 1 Š x 1 S . radix point N–1 N–1 –2 Š x 2 S . radix point 29 Kurt Keutzer DSP Data Path: Precision Word size affects precision of fixed point numbers DSPs have 16bit, 20bit, or 24bit data words Floating Point DSPs cost 2X 4X vs. fixed point, slower than fixed point DSP programmers will scale values inside code SW Libraries Separate explicit exponent “Blocked Floating Point” single exponent for a group of fractions Floating point support simplify development 30 Kurt Keutzer DSP Data Path: Overflow DSP are descended from analog : what should happen to output when “peg” an input (e.g., turn up volume control knob on stereo) Modulo Arithmetic N–1 Set to most positive (2 –1) or N–1 most negative value(–2 ) : “saturation” Many algorithms were developed in this model 31 Kurt KeutzerDSP Data Path: Multiplier Specialized hardware performs all key arithmetic operations in 1 cycle  50 of instructions can involve multiplier = single cycle latency multiplier Need to perform multiplyaccumulate (MAC) nbit multiplier = 2nbit product 32 Kurt Keutzer DSP Data Path: Accumulator Don’t want overflow or have to scale accumulator Option 1: accumalator wider than product: “guard bits” Motorola DSP: 24b x 24b = 48b product, 56b Accumulator Option 2: shift right and round product before adder Multiplier Multiplier Shift ALU ALU Accumulator Accumulator G 33 Kurt KeutzerDSP Data Path: Rounding Even with guard bits, will need to round when store accumulator into memory 3 DSP standard options Truncation: chop results = biases results up Round to nearest: 1/2 round down,  1/2 round up (more positive) = smaller bias Convergent: 1/2 round down, 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) = no bias IEEE 754 calls this round to nearest even 34 Kurt Keutzer Data Path GeneralPurpose Processor DSP Processor Multiplies often take1 cycle Specialized hardware performs all key arithmetic operations in Shifts often take 1 cycle 1 cycle. Other operations (e.g., Hardware support for saturation, rounding) typically managing numeric fidelity: take multiple cycles. Shifters Guard bits Saturation 35 Kurt Keutzer320C54x DSP Functional Block Diagram 36 Kurt Keutzer FIR Filtering: A Motivating Problem M most recent samples in the delay line (Xi) New sample moves data down delay line “Tap” is a multiplyadd Each tap (M+1 taps total) nominally requires: Two data fetches Multiply Accumulate Memory writeback to update delay line Goal: 1 FIR Tap / DSP instruction cycle 37 Kurt KeutzerBENCHMARKS FIR FILTER FINITEIMPULSE RESPONSE FILTER −1 −1 −1 Z . . . . Z Z C N C N −1 C C 1 2 38 Kurt KeutzerMicroarchitectural impact MAC N−1 element of finiteimpulse y(n)= h(m)x(n−m)  response filter computation XY 0 MPY ADD/SUB ACC REG 39 Kurt KeutzerMapping of the filter onto a DSP execution unit 4 6 1 3 5 X X Y Σ 2 1 n n 2 6 β X D αY n1 4 α D 5 3 The critical hardware unit in a DSP is the multiplier much of the architecture is organized around allowing use of the multiplier on every cycle This means providing two operands on every cycle, through multiple data and address busses, multiple address units and 40 local accumulator feedback Kurt KeutzerMAC Eg. 320C54x DSP Functional Block Diagram 41 Kurt Keutzer DSP Memory FIR Tap implies multiple memory accesses DSPs want multiple data ports Some DSPs have ad hoc techniques to reduce memory bandwdith demand Instruction repeat buffer: do 1 instruction 256 times Often disables interrupts, thereby increasing interrupt response time Some recent DSPs have instruction caches Even then may allow programmer to “lock in” instructions into cache Option to turn cache into fast program memory No DSPs have data caches May have multiple data memories 42 Kurt KeutzerConventional ``Von Neumann’’ memory 43 Kurt KeutzerHARVARD ARCHITECTURE in DSP PROGRAM X MEMORY Y MEMORY MEMORY GLOBAL P DATA X DATA Y DATA 44 Kurt KeutzerMemory Architecture GeneralPurpose Processor DSP Processor Von Neumann architecture Harvard architecture Typically 1 access/cycle 24 memory accesses/cycle May use caches No cachesonchip SRAM Program Memory Processor Memory Processor Data Memory 45 Kurt KeutzerEg. TMS320C3x MEMORY BLOCK DIAGRAM Harvard Architecture 46 Kurt KeutzerEg. 320C62x/67x DSP 47 Kurt Keutzer DSP Addressing Have standard addressing modes: immediate, displacement, register indirect Want to keep MAC datapth busy Assumption: any extra instructions imply clock cycles of overhead in inner loop = complex addressing is good = don’t use datapath to calculate fancy address Autoincrement/Autodecrement register indirect lw r1,0(r2)+ = r1 Mr2; r2r2+1 Option to do it before addressing, positive or negative 48 Kurt KeutzerDSP Addressing: FFT FFTs start or end with data in weird bufferfly order 0 (000) = 0 (000) 1 (001) = 4 (100) 2 (010) = 2 (010) 3 (011) = 6 (110) 4 (100) = 1 (001) 5 (101) = 5 (101) 6 (110) = 3 (011) 7 (111) = 7 (111) What can do to avoid overhead of address checking instructions for FFT Have an optional “bit reverse” address addressing mode for use with autoincrement addressing Many DSPs have “bit reverse” addressing for radix2 FFT 49 Kurt KeutzerBIT REVERSED ADDRESSING 000 x(0) F(0) 100 x(4) F(1) 010 x(2) F(2) 110 x(6) F(3) 001 x(1) F(4) 101 x(5) F(5) 011 x(3) F(6) 111 x(7) F(7) Four 2point Two 4point One 8point DFT DFTs DFTs Data flow in the radix2 decimationintime FFT algorithm 50 Kurt KeutzerDSP Addressing: Buffers DSPs dealing with continuous I/O Often interact with an I/O buffer (delay lines) To save memory, buffer often organized as circular buffer What can do to avoid overhead of address checking instructions for circular buffer Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of buffer Option 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach end Every DSP has “modulo” or “circular” addressing 51 Kurt KeutzerCIRCULAR BUFFERS Instructions accomodate three elements: • buffer address • buffer size • increment Allows for cyling through: • delay elements • coefficients in data memory 52 Kurt Keutzer Addressing DSP Processor GeneralPurpose Processor •Dedicated address generation •Often, no separate address units generation unit •Specialized addressing •Generalpurpose addressing modes; e.g.: modes Autoincrement Modulo (circular) Bitreversed (for FFT) •Good immediate data support 53 Kurt KeutzerAddress calculation unit for DSP Supports modulo and bit reversal arithmetic Often duplicated to calculate multiple addresses per cycle 54 Kurt Keutzer DSP Instructions and Execution May specify multiple operations in a single instruction Must support MultiplyAccumulate (MAC) Need parallel move support Usually have special loop support to reduce branch overhead Loop an instruction or sequence 0 value in register usually means loop maximum number of times Must be sure if calculate loop count that 0 does not mean 0 May have saturating shift left arithmetic May have conditional execution to reduce branches 55 Kurt KeutzerADSP 2100: ZEROOVERHEAD LOOP DO addr UNTIL condition” DO X ... Address Generation PCS = PC + 1 if (PC = x condition) PC = PCS else PC = PC +1 X • Eliminates a few instructions in loops • Important in loops with small bodies 56 Kurt KeutzerInstruction Set DSP Processor GeneralPurpose Processor Specialized, complex instructions Generalpurpose instructions Multiple operations per Typically only one operation instruction per instruction mac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0 mov r0,x0 mov r1,y0 mpy x0, y0, a add a, b mov y0, r2 inc r0 inc rl 57 Kurt KeutzerSpecialized Peripherals for DSPs •Synchronous serial ports •Host ports •Parallel ports •Bit I/O ports •Timers •Onchip DMA controller •Onchip A/D, D/A •Clock generators converters • Onchip peripherals often designed for “background” operation, even when core is powered down. 58 Kurt KeutzerSpecialized peripherals 59 Kurt KeutzerTMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach 1995 60 Kurt Keutzer Summary of Architectural Features of DSPs Data path configured for DSP Fixedpoint arithmetic MAC Multiplyaccumulate Multiple memory banks and buses Harvard Architecture Multiple data memories Specialized addressing modes Bitreversed addressing Circular buffers Specialized instruction set and execution control Zerooverhead loops Support for MAC Specialized peripherals for DSP THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN 61 Kurt Keutzer