Question? Leave a message!




High Speed Router Design

High Speed Router Design 8
High Speed Router Design Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1Overview  Introduction  Evolution of HighSpeed Routers  High Speed Router Components: Lookup Algorithm Classification Switching Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2What do switches/routers look like Access routers e.g. ISDN, ADSL Core router Core ATM switch e.g. OC48c POS Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3Dimensions, Power Consumption Cisco GSR 12416 Juniper M160 19” 19” Capacity: 160Gb/s Capacity: 80Gb/s Power: 4.2kW Power: 2.6kW 6ft 3ft 2ft 2.5ft Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 4Where high performance packet switches are used Carrier Class Core Router ATM Switch Frame Relay Switch The Internet Core Edge Router Enterprise WAN access Enterprise Campus Switch Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5Where are routers Ans: Points of Presence (POPs) POP3 POP2 POP1 D POP4 A B E POP5 POP6 C POP7 POP8 F Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6Why the Need for Big/Fast/Large Routers POP with large routers POP with smaller routers  Interfaces: Price 200k, Power 400W  Space, power, interface cost economics  About 5060 of i/fs are used for interconnection within the POP.  Industry trend is towards large, single router per POP. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7Job of router architect  For a given set of features: Maximize capacity, C s.. t Power, P 5kW 3 Volume, V 2m Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8Performance metrics 1. Capacity 3  “maximize C, s.t. volume 2m and power 5kW” 2. Throughput  Maximize usage of expensive longhaul links.  Trivial with workconserving outputqueued routers 3. Controllable Delay  Some users would like predictable delay.  This is feasible with outputqueueing plus weighted fair queuing (WFQ). ( , ) ( , ) WFQ Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9The Problem  Output queued switches are impractical Can’t I just use N separate R memory devices per output R R output R 1 R DRAM R R R N NR NR Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10 dataMemory Bandwidth Commercial DRAM  Memory speed is not keeping up with Moore’s Law. 1980 1983 1986 1989 1992 1995 1998 2001 1000 100 DRAM 1.1x / 18months 10 1 Moore’s Law 2x / 18 months 0.1 Router 0.01 Line Capacity Capacity 2x / 7 months 2.2x / 18months 0.001 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 0.0001 11 Access Time (ns)Packet processing is getting harder CPU Instructions per minimum length packet since 1996 1000 100 10 1 1996 1997 1998 1999 2000 2001 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12Basic Ideas Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13Forwarding Functions: ATM Switch  Lookup cell VCI/VPI in VC table.  Replace old VCI/VPI with new.  Forward cell to outgoing interface.  Transmit cell onto link. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14Functions: Ethernet (L2) Switch  Lookup frame destination address (DA) in forwarding table. If known, forward to correct port. If unknown, broadcast to all ports.  Learn source address (SA) of incoming frame.  Forward frame to outgoing interface.  Transmit frame onto link. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15Functions: IP Router  Lookup packet DA in forwarding table. If known, forward to correct port. If unknown, drop packet.  Decrement TTL, update header Cksum.  Forward packet to outgoing interface.  Transmit packet onto link. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16Basic Architectural Components Congestion Control Admission Control Reservation Control Routing Datapath: Output perpacket Policing Switching Scheduling processing Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17Basic Architectural Components 3. 1. Output 2. Scheduling Forwarding Table Interconnect Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18Generic Router Architecture Header Processing Data Hdr Data Hdr Queue Lookup Update Packet IP Address Header IP Address Next Hop 1M prefixes Address Buffer 1M packets Offchip DRAM Offchip DRAM Table Memory Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19Generic Router Architecture Header Processing Buffer Lookup Update Manager IP Address Header Buffer Address Memory Table Header Processing Buffer Lookup Update Manager IP Address Header Buffer Address Memory Table Header Processing Buffer Lookup Update Manager IP Address Header Buffer Address Memory Table Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20Simplest Design: Software Router using PCs  Idea: add specialpurpose software to general purpose hardware: Cheap, but slow  Measure of speed: aggregate data rate or aggregate packet rate  Limits number type of interfaces, topologies etc  Eg: 400 Mbps aggregate rate will allow four 100 Mbps ethernet interfaces, but no GbE  Eg: MIT’s Click Router Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21Aggregate Packet vs Bit Rates 64 byte pkts 1518 byte pkts Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22PerPacket Processing Time Budget MIT’s Click Router claims 435Kpps with 64 byte packets See: http://www.pdos.lcs.mit.edu/click/ (= it can do 100 Mbps, but not GbE interfaces) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23Soln: Decentralization/Parallelism  Finegrained parallelism: instructionlevel  Symmetric coarsegrain parallelism: multiprocs  Asymmetric coarsegrain parallelism: multiprocs  Coprocessors (ASICs)  Operates under control of CPU  Move expensive ops to hardware  NICs with onboard processing  Attack I/O bottleneck:  Move processing to the NIC (ASIC or embedded RISC)  Handles only 1 interface rather than aggregate rate  Smart NICs with onboard stacks  Cell Switching: Design protocols to suit hardware speeds  Data pipelines… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24Optimizations (contd) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25Demultiplexing vs Classification  Demultiplexing in a layered model provides freedom to use arbitrary protocols without transmission overhead, but imposes sequential processing limitations  Packet classification: combines demuxing from a sequence of opns at multiple layers to an operation at one layer Overall goal: flow segregation… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 26Classification example Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 27Hardware Optimization of Classification Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28Hybrid Hardware/Software Classifier Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29Conceptual Bindings Connectionless Network: Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30Second Gen. Network Systems Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 31Switch Fabric Concept Data path (aka backplane) that provides parallelism Connects the NICs which have onboard processing Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 32Desired Switch Fabric Properties Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 33Space Division Fabric Asynchronous design arose from multiprocessor context Data can be sent across fabric at arbitrary times Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 34Blocking and Port Contention  Even if “internally” nonblocking (I.e. fully inter connected), portcontention can occur Why  Need blocking circuits at input and output ports Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 35Crossbar: “Switched” interconnections  Use “switches” between each input and output instead of separate paths: active = data flows from I to O  Total number of “paths” required = N+M  Number of switching points = NxM Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 36Crossbar: “Switched” interconnections  Switch controller (centralized) handles port contention  Allows transfers in parallel (upto MinN,M paths)  Note: port hardware can operate much slower  Issues: switches, switch controller Shivkumar Kalyanaraman  Port contention still exists… Rensselaer Polytechnic Institute 37Queuing: input, output buffers Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 38Timedivision Switching Fabrics  Aka: bus (I.e. single shared link)  Low cost and low speed (used in computers)  Need arbitration mechanism:  eg: fixed timeslots or datablocks, fixed cells, variable packets Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 39Time division switching: telephony  Key idea: when demultiplexing, position in frame determines output trunk  Time division switching interchanges sample position within a frame: time slot interchange (TSI) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 40Timedivision: Shared memory fabrics  Memory interface hardware expensive = many “ports” share fewer memory interfaces  Eg: dualported memory  Separate lowspeed bus lines for controller Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 41Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 42MultiStage Fabrics  Compromise between pure timedivision and pure space division  Attempt to combine advantages of each  Lower cost from timedivision  Higher performance from spacedivision  Technique: Limited Sharing  Eg: Banyan switch  Features  Scalable  Selfrouting, I.e. no central controller  Packet queues allowed, but not required  Note: multistage switches share the “crosspoints” which have now become “expensive” resources… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 43Banyan Switch Fabric (Contd)  Basic building block = 2x2 switch, labelled by 0/1  Can be synchronous or asynchronous  Asynch = packets can arrive at arbitrary times  Synchronous banyan offers TWICE the effective throughput  Worst case when all inputs receive packets with same label Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 44Banyan Fabric More on switching later… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 45Forwarding a.k.a. Port Mapping Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 46Basic Architectural Components: Forwarding Decision 3. 1. Output 2. Scheduling Forwarding Table Interconnect Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 47ATM and MPLS Switches Direct Lookup (Port, VCI) VCI Memory Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 48Bridges and Ethernet Switches Associative Lookups Associative Advantages: Memory or CAM • Simple Associated Network Associated Disadvantages Data Search Address Data Data • Slow Hit 48 Address • High Power log N 2 • Small • Expensive Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 49Bridges and Ethernet Switches Hashing Associated Data Search Data 16 Hit Hashing Memory 48 Function Address log N 2 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 50 Address DataLookups Using Hashing An example Memory 1 2 3 4 Associated Data Search Hashing Function Data 16 Hit 1 2 48 Address CRC16 log N 2 1 2 3 Linked lists Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 51Lookups Using Hashing Performance of simple example   1 ER = 1 +  M 2  1  11––   N Where: ER = Expected number of memory references M = Number of memory addresses in table N = Number of linked lists  = MN Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 52Lookups Using Hashing Advantages: • Simple • Expected lookup time can be small Disadvantages • Nondeterministic lookup time • Inefficient use of memory Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 53Perpacket processing in an IP Router 1. Accept packet arriving on an incoming link. 2. Lookup packet destination address in the forwarding table, to identify outgoing port(s). 3. Manipulate packet header: e.g., decrement TTL, update header checksum. 4. Send (switch) packet to the outgoing port(s). 5. Classify and buffer packet in the queue. 6. Transmit packet onto outgoing link. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 54Caching Addresses Slow Path Buffer CPU Memory Fast Path DMA DMA DMA Line Line Line Card Card Card Local Local Local Buffer Buffer Buffer Memory Memory Memory MAC MAC MAC Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 55Caching Addresses LAN: Average flow 40 packets Huge Number of flows WAN: 120 100 Cache 80 Hit 60 Rate 40 20 0 Shivkumar Kalyanaraman Cache = 10 of Full Table Rensselaer Polytechnic Institute 56IP Router Lookup H Forwarding Engine Dstn E Next Hop Addr A Next Hop Computation D E R Forwarding Table Destination Next Hop Incoming Packet IPv4 unicast destination address based lookup Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 57Lookup and Forwarding Engine Packet header payload Router Routing Lookup Destination Outgoing Data Structure Address Port Forwarding Table Destnetwork Port 65.0.0.0/8 3 128.9.0.0/16 1 149.12.0.0/19 7 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 58Example Forwarding Table Destination IP Prefix Outgoing Port 65.0.0.0/ 8 3 Prefix length 128.9.0.0/16 1 142.12.0.0/19 7 IP prefix: 032 bits 142.12.0.0/19 128.9.0.0/16 65.0.0.0/8 32 0 128.9.16.14 24 2 1 2 65.0.0.0 65.255.255.255 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 59Prefixes can Overlap Longest matching prefix 128.9.176.0/24 128.9.16.0/21128.9.172.0/21 142.12.0.0/19 65.0.0.0/8 128.9.0.0/16 32 0 128.9.16.14 2 1 Routing lookup: Find the longest matching prefix (aka the most specific route) among all prefixes that match the destination address. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 60Difficulty of Longest Prefix Match 2dimensional search: Prefix Length 32 Prefix Value 24 128.9.176.0/24 128.9.16.0/21 128.9.172.0/21 142.12.0.0/19 128.9.0.0/16 65.0.0.0/8 8 128.9.16.14 Prefix Values Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 61 Prefix LengthIP Routers Metrics for Lookups Prefix Port • Lookup time 65/8 3 • Storage space 5 128.9.16.14 128.9/16 2 128.9.16/20 • Update time 7 128.9.19/24 • Preprocessing time 128.9.25/24 10 128.9.176/20 1 142.12/19 3 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 62Lookup Rates Required Year Line Linerate 40B (Gbps) packets (Mpps) 199899 OC12c 0.622 1.94 199900 OC48c 2.5 7.81 200001 OC192c 10.0 31.25 200203 OC768c 40.0 125 31.25 Mpps  33 ns DRAM: 5080 ns, SRAM: 510 ns Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 63Update Rates Required  Recent BGP studies show that updates can be: Bursty: several 100s of routes updated/withdrawn = insert/delete operations Frequent: Average 100+ updates per second  Need data structure to be efficient in terms of lookup as well as update (insert/delete) operations. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 64Size of the Forwarding Table 100000 90000 Renewed Exponential 80000 Growth 70000 60000 10,000/year 50000 40000 30000 20000 10000 0 95 96 97 98 99 00 Year Renewed growth due to multihoming of enterprise networks Shivkumar Kalyanaraman Rensselaer Polytechnic Institute Source: http://www.telstra.net/op65 s/bgptable.html Number of PrefixesPotential HyperExponential Growth Global routing table vs Moore's law since 1999 160000 Global prefixes Moore's law 150000 Double growth 140000 130000 120000 110000 100000 90000 80000 70000 60000 50000 01/99 04/99 07/99 10/99 01/00 04/00 07/00 10/00 01/01 04/01 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 66 Prefixeslog N 2 Trees and Tries Binary Search Tree Binary Search Trie 0 1 0 1 0 1 010 111 N entries Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 67Trees and Tries Multiway tries 16ary Search Trie 0000, ptr 1111, ptr 1111, ptr 0000, 0 1111, ptr 0000, 0 000011110000 111111111111 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 68Lookup: Multiway Tries Tradeoffs Degree of Mem Nodes Total Memory Fraction 6 Tree References (Mbytes) Wasted () (x10 ) 2 48 1.09 4.3 49 4 24 0.53 4.3 73 8 16 0.35 5.6 86 16 12 0.25 8.3 93 64 8 0.17 21 98 256 6 0.12 64 99.5 15 Table produced from 2 randomly generated 48bit addresses Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 69Routing Lookups in Hardware Prefix length Most prefixes are 24bits or shorter Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 70 NumberRouting Lookups in Hardware Prefixes up to 24bits 24 2 = 16M entries 142.19.6 Next Hop Next Hop 1 24 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 71 142.19.6.14 14 142.19.6Routing Lookups in Hardware Prefixes up to 24bits 128.3.72 Next Hop 1 Next Hop 24 Pointer 0 Prefixes above 24bits Next Hop Next Hop 8 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 72 128.3.72.44 44 128.3.72 base offsetSwitching a.k.a. Interconnect Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 73Basic Architectural Components: Interconnect 3. 1. Output 2. Scheduling Forwarding Table Interconnect Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 74FirstGeneration IP Routers Shared Backplane Buffer CPU Memory DMA DMA DMA Line Line Line Interface Interface Interface MAC MAC MAC  Most Ethernet switches and cheap packet routers  Bottleneck can be CPU, hostadaptor or I/O bus  What is costly Bus Memory Interface CPU Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 75SecondGeneration IP Routers Buffer CPU Memory DMA DMA DMA Line Line Line Card Card Card Local Local Local Buffer Buffer Buffer Memory Memory Memory MAC MAC MAC  Port mapping intelligence in line cards  Higher hit rate in local lookup cache  What is costly Bus Memory Interface CPU Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 76ThirdGeneration Switches/Routers Switched Backplane Line CPU Line Card Card Card Local Local Buffer Buffer Memory Memory MAC MAC Third generation switch provides parallel paths (fabric)  What’s costly Bus Memory, CPU Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 77FourthGeneration Switches/Routers Clustering and Multistage 1 2 3 4 5 6 13 14 15 16 17 18 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 111213141516 171819 20 2122232425262728 29303132 19 20 21 22 23 24 7 8 9 10 11 12 31 32 21 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 78Switching goals (telephony data) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 79Circuit switch  A switch that can handle N calls has N logical inputs and N logical outputs  N up to 200,000  Moves 8bit samples from an input to an output port  Recall that samples have no headers  Destination of sample depends on time at which it arrives at the switch  In practice, input trunks are multiplexed  Multiplexed trunks carry frames = set of samples  Goal: extract samples from frame, and depending on position in frame, switch to output  each incoming sample has to get to the right output line and the right slot in the output frame Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 80Call blocking  Can’t find a path from input to output  Internal blocking slot in output frame exists, but no path  Output blocking no slot in output frame is available  Output blocking is reduced in transit switches need to put a sample in one of several slots going to the desired next hop Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 81Multiplexors and demultiplexors  Most trunks time division multiplex voice samples  At a central office, trunk is demultiplexed and distributed to active circuits  Addressing not required…  Synchronous multiplexor: N input lines Output runs N times as fast as input 1 1 2 2 3 De 3 MUX … MUX … 1 2 3 … N N N Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 82Switching: what does a switch do  Transfers data from an input to an output many ports (density), high speeds  Eg: Crossbar Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 83Circuit Switch Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 84Issue: Call Blocking Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 85Time division switching  Key idea: when demultiplexing, position in frame determines output trunk  Time division switching interchanges sample position within a frame: time slot interchange (TSI) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 86Scaling Issues with TSI Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 87Space division switching  Each sample takes a different path through the switch, depending on its destination Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 88Crossbar  Simplest possible spacedivision switch  Crosspoints can be turned on or off, long enough to transfer a packet from an input to an output  Internally nonblocking 2  but need N crosspoints  time to set each crosspoint grows quadratically Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 89Multistage crossbar  In a crossbar during each switching time only one crosspoint per row or column is active  Can save crosspoints if a crosspoint can attach to more than one input line (why)  This is done in a multistage crossbar  Need to rearrange connections every switching time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 90Multistage crossbar  Can suffer internal blocking unless sufficient number of secondlevel stages 2  Number of crosspoints N  Finding a path from input to output requires a depthfirstsearch  Scales better than crossbar, but still not too well 120,000 call switch needs 250 million crosspoints Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 91TimeSpace Switching Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 92TimeSpaceTime (TST) switching Telephone switches like 5ESS use multiple spacestages: eg Shivkumar Kalyanaraman RensTSS selaer PoS lyT tec he nic tc Institute 93Packet switches  In a circuit switch, path of a sample is determined at time of connection establishment  No need for a sample headerposition in frame used  In a packet switch, packets carry a destination field or label  Need to look up destination port onthefly  Datagram switches  lookup based on entire destination address (longestprefix match)  Cell or Labelswitches  lookup based on VCI or Labels Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 94Blocking in packet switches  Can have both internal and output blocking  Internal no path to output  Output trunk unavailable  Unlike a circuit switch, cannot predict if packets will block (why)  If packet is blocked = must either buffer or drop Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 95Dealing with blocking in packet switches  Overprovisioning internal links much faster than inputs  Buffers at input or output  Backpressure if switch fabric doesn’t have buffers, prevent packet from entering until path is available  Parallel switch fabrics increases effective switching capacity Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 96Switch Fabrics: Buffered crossbar  What happens if packets at two inputs both want to go to same output  Can defer one at an input buffer  Or, buffer crosspoints: complex arbiter Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 97Switch fabric element  Goal: towards building “selfrouting” fabrics  Can build complicated fabrics from a simple element  Routing rule: if 0, send packet to upper output, else to lower output If both packets to same output, buffer or drop Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 98Banyan  Simplest selfrouting recursive fabric  What if two packets both want to go to the same output output blocking Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 99Features of multistage switches  Issue: output blocking… two packets want to go to same output port… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 100Blocking in Banyan Fabric Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 101Blocking in Banyan S/ws: Sorting  Can avoid blocking by choosing order in which packets appear at input ports  If we can  present packets at inputs sorted by output  remove duplicates  remove gaps  precede banyan with a perfect shuffle stage  then no internal blocking  For example: X, 010, 010, X, 011, X, X, X:  Sort = 010, 011, 011, X, X, X, X, X  Remove dups = 010, 011, X, X, X, X, X, X  Shuffle = 010, X, 011, X, X, X, X, X  Need sort, shuffle, and trap networks Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 102Sorting using Merging  Build sorters from merge networks  Assume we can merge two sorted lists  Sort pairwise, merge, recurse Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 103Putting together: BatcherBanyan Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 104NonBlocking BatcherBanyan Batcher Sorter SelfRouting Network 3 7 7 7 7 7 7 000 7 2 5 0 4 6 6 001 5 3 2 5 5 4 5 010 2 5 3 1 6 5 4 011 6 6 1 3 0 3 3 100 0 1 0 4 3 2 2 101 1 0 6 2 1 0 1 110 4 4 4 6 2 2 0 111 • Fabric can be used as scheduler. •BatcherBanyan network is blocking for multicast. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 105Queuing, Buffer Management, Classification Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 106Basic Architectural Components: Queuing, Classification 3. 1. Output 2. Scheduling Forwarding Table Interconnect Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 107Queuing: Two basic techniques Input Queueing Output Queueing Usually a nonblocking Usually a fast bus switch fabric (e.g. crossbar) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 108Queuing: Output Queueing Individual Output Queues Centralized Shared Memory Memory b/w = 2N.R 1 2 N 1 2 Memory b/w = (N+1).R N Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 109Input Queuing Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 110Input Queueing Head of Line Blocking Load 100 58.6 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 111 DelaySolution: Input Queueing w/ Virtual output queues (VOQ) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 112HeadofLine (HOL) in Input Queuing Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 113Input Queues Virtual Output Queues Load 100 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 114 DelayOutput Queuing Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 115Packet Classification H Forwarding Engine E Action A Packet Classification D E R Classifier (Policy Database) Predicate Action Incoming Packet Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 116Multifield Packet Classification Field 1 Field 2… Field k Action 152.163.190.69/21 152.163.80.11/32 UDP A1 Rule 1… 152.168.3.0/24 152.163.0.0/16 TCP A2 Rule 2 … … …………… 152.168.0.0/16 152.0.0.0/8 ANY An Rule N… Given a classifier with N rules, find the action associated with the highest priority rule matching Shivkumar Kalyanaraman an incoming packet. Rensselaer Polytechnic Institute 117Prefix matching: 1d range problem 128.9.19/24 128.9.25/24 128.9.16/20 128.9.176/20 128.9/16 32 2 1 0 128.9.16.14 Most specific route = “longest matching prefix” Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 118Classification: 2D Geometry problem Field 1 Field 2 Data R7 R6 P1 P2 R3 e.g. (144.24/16, 64/24) e.g. (128.16.46.23, ) R1 R4 R5 R2 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute Field 1 119 Field 2Network Processors: Building Block for programmable networks Slides from Raj Yavatkar, raj.yavatkarintel.com Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 120Intel IXP Network Processors  Microengines Control Processor  RISC processors optimized for packet processing Media/Fabric  Hardware support for StrongARM Interface multithreading  Fast path  Embedded StrongARM/Xscale  Runs embedded OS and handles exception tasks  Slow path, Control plane Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 121 ME 1 SRAM ME 2 DRAM ME nVarious forms of Processors Embedded Processor (runtocompletion) Parallel architecture Shivkumar Kalyanaraman Rensselaer Polytechnic Institute Pipelined Architecture 122Software Architectures Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 123Division of Functions Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 124Packet Flow Through the Hierarchy Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 125Scaling Network Processors Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 126Memory Scaling Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 127Memory Scaling (contd) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 128Memory Types Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 129Memory Caching and CAM CACHE: Content Addressable Memory (CAM): Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 130CAM and Ternary CAM CAM Operation: Ternary CAM (TCAM): Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 131Ternary CAMs Associative Memory Value Mask 10.0.0.0 255.0.0.0 R1 255.255.0.0 Next Hop 10.1.0.0 R2 255.255.255.0 10.1.1.0 R3 10.1.3.0 255.255.255.0 R4 10.1.3.1 255.255.255.255 R4 Priority Encoder Using TCAMs for Classification: Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 132IXP: A Building Block for Network Systems  Example: IXP2800 Multithreaded (x8)  16 microengines + XScale core Microengine Array RDRAM Media  Up to 1.4 Ghz ME speed Controller Switch  8 HW threads/ME MEv2 MEv2 MEv2 MEv2 Fabric 1 2 3 4 I/F  4K control store per ME  Multilevel memory hierarchy MEv2 MEv2 MEv2 MEv2 8 7 6 5 Intel®  Multiple interprocessor XScale™ PCI communication channels MEv2 MEv2 MEv2 MEv2 Core 9 10 11 12  NPU vs. GPU tradeoffs MEv2 MEv2 MEv2 MEv2 Scratch  Reduce core complexity 16 15 14 13 Memory QDR SRAM Controller  No hardware caching Hash PerEngine  Simpler instructions  Unit Memory, CAM, Signals shallow pipelines Interconnect  Multiple cores with HW multi threading per chip Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 133IXP 2400 Block Diagram DDR DRAM ME0 ME1 controller SHaC: Scratch; Hash Unit; CSR Unit ME3 ME2 Xscale Core PCI Unit ME4 ME5 MSF Unit ME7 ME6 QDR SRAM controller (2 channels) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 134 TBUF RBUFIXP2800 Features  Half Duplex OC192 / 10 Gb/sec Ethernet Network Processor  XScale Core  700 MHz (half the ME)  32 Kbytes instruction cache / 32 Kbytes data cache  Media / Switch Fabric Interface  2 x 16 bit LVDS Transmit Receive  Configured as CSIXL2 or SPI4  PCI Interface  64 bit / 66 MHz Interface for Control  3 DMA Channels  QDR Interface (w/Parity)  (4) 36 bit SRAM Channels (QDR or CoProcessor)  Network Processor Forum LookAside1 Standard Interface  Using a “clamshell” topology both Memory and Coprocessor can be instantiated on same channel  RDR Interface  (3) Independent Direct Rambus DRAM Interfaces  Supports 4i Banks or 16 interleaved Banks Shivkumar Kalyanaraman  Supports 16/32 Byte bursts Rensselaer Polytechnic Institute 135Hardware Features to ease packet processing  Ring Buffers  For interblock communication/synchronization  Producerconsumer paradigm  Next Neighbor Registers and Signaling  Allows for single cycle transfer of context to the next logical microengine to dramatically improve performance  Simple, easy transfer of state  Distributed data caching within each microengine  Allows for all threads to keep processing even when multiple threads are accessing the same data Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 136XScale Core processor  Compliant with the ARM V5TE architecture support for ARM’s thumb instructions support for Digital Signal Processing (DSP) enhancements to the instruction set Intel’s improvements to the internal pipeline to improve the memorylatency hiding abilities of the core does not implement the floatingpoint instructions of the ARM V5 instruction set Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 137Microengines – RISC processors  IXP 2800 has 16 microengines, organized into 4 clusters (4 MEs per cluster)  ME instruction set specifically tuned for processing network data  40bit x 4K control store  Sixstage pipeline in an instruction On an average takes one cycle to execute  Each ME has eight hardwareassisted threads of execution can be configured to use either all eight threads or only four threads  The nonpreemptive hardware thread arbiter swaps between threads in roundrobin order Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 138MicroEngine v2 DPush SPush From Next Neighbor Bus Bus Local 128 128 128 Next 128 D 128 S GPR GPR Neighbor Xfer In Xfer In Memory 640 words Control Store 4K Instructions LM Addr 1 2 per Bop Aop CTX LM Addr 0 Prev B Prev A PRandom BOperand AOperand Status CRC Unit Multiply and TAGs Lock LRU 015 32bit Execution 015 Logic Find first bit CRC remain Data Path (6bit) Add, shift, logical Entry Status Local CSRs ALUOut To Next Neighbor Timers 128 D 128 S Timestamp Xfer Out Xfer Out Shivkumar Kalyanaraman DPull Bus SPull Bus Rensselaer Polytechnic Institute 139 CAMWhy Multithreading time t3 t1 t2 0 1 2 3 4 5 6 7 Microengine Executing code Microengine Waiting for n thread signal Ready to Shivkumar Kalyanaraman Rensselaer Polytechnic Institute execute 140Packet processing using multi threading within a MicroEngine Execution Time = 8 X T a Packet n Thread 0 Packet n+1 Thread 1 Thread 2 Packet n+2 Packet n+3 Thread 3 Packet n+4 Thread 4 Packet n+5 Thread 5 Packet n+6 Thread 6 Packet n+7 Thread 7 T a Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 141Registers available to each ME  Four different types of registers  general purpose, SRAM transfer, DRAM transfer, nextneighbor (NN)  256, 32bit GPRs  can be accessed in threadlocal or absolute mode  256, 32bit SRAM transfer registers.  used to read/write to all functional units on the IXP2xxx except the DRAM  256, 32bit DRAM transfer registers  divided equally into readonly and writeonly  used exclusively for communication between the MEs and the DRAM  Benefit of having separate transfer and GPRs  ME can continue processing with GPRs while other functional units read and write the transfer registers Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 142Different Types of Memory Type of Logical Size in bytes Approx Special Notes Memory width (bytes) unloaded latency (cycles) Local to ME 4 2560 3 Indexed addressing post incr/decr Onchip 4 16K 60 Atomic ops 16 scratch rings w/at. get/put SRAM 4 256M 150 Atomic ops 64elem q array DRAM 8 2G 300 Direct path to/from MSF Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 143IXA Software Framework External Control Plane Protocol Stacks Processors Control Plane PDK XScale™ C/C++ Core Core Components Language Core Component Library Resource Manager Library Microblock Library Microengine Microengine Pipeline C Language Micro Micro Micro block block block Protocol Library Utility Library Hardware Abstraction Library Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 144Microengine C Compiler  C language constructs  Basic types,  pointers, bit fields  Inline assembly code support  Aggregates Structs, unions, arrays Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 145What is a Microblock  Data plane packet processing on the microengines is divided into logical functions called microblocks  Coarse Grain and stateful  Example 5Tuple Classification, IPv4 Forwarding, NAT  Several microblocks running on a microengine thread can be combined into a microblock group. A microblock group has a dispatch loop that defines the dataflow for packets between microblocks A microblock group runs on each thread of one or more microengines  Microblocks can send and receive packets to/from an Shivkumar Kalyanaraman Renssea laessocia r Polytechnic te Instid tut eXscale Core Component. 146Core Components and Microblocks Core Core Core XScale™ Component Component Component Core Core Component Library Resource Manager Library Microblock Library Micro engines Microblock Microblock Microblock rd Microblock Core Intel/3 party Userwritten Library Libraries blocks code Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 147Applications of Network Processors  Fully programmable architecture  Implement any packet processing applications Examples from customers Routing/switching, VPN, DSLAM, Multiservioce switch, storage, content processing Intrusion Detection (IDS) and RMON  Use as a research platform Experiment with new algorithms, protocols  Use as a teaching tool Understand architectural issues Gain handson experience withy networking systems Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 148Technical and Business Challenges  Technical Challengers Shift from ASICbased paradigm to softwarebased apps Challenges in programming an NPU Tradeoff between power, board cost, and no. of NPUs How to add coprocessors for additional functions  Business challenges Reliance on an outside supplier for the key component Preserving intellectual property advantages Add value and differentiation through software algorithms in data plane, control plane, services plane functionality Must decrease timetomarket (TTM) to be competitive Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 149Challenges in Modern Terabit Class Switch Design Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 150Goals…  Design of a terabitclass system Several Tb/s aggregate throughput 2.5 Tb/s: 256x256 OC192 or 64x64 OC 768 OEM Achieve wide coverage of application spectrum Singlestage Electronic fabric Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 151System Architecture line card 1 switch fabric input line 1 switch core data ingress buffer network ingress FC processor output line 1 egress buffer data egress FC switch fabric interface chips line card N input line 1 ingress buffer network processor iRT output line 1 egress buffer eRT Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 152 OCx itf OCx itfPower Requirement  Do not exceed the per shelf (2 kW), per board (150W), and per chip (20W) budgets  Forcedair cooling, avoid hotspots  More throughput at same power: Gb/s/W density is increasing  I/O uses an increasing fraction of power ( 50)  Electrical I/O technology has not kept pace with capacity demand  Lowpower, highdensity I/O technology is a must  CMOS density increases faster than W/gate decreases  Functionality/chip constrained by power rather than density Power determines the number of chips and boards  Architecture must be able to be distributed accordingly Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 153Packaging Requirement  NEBS compliance  Constrained by  Standard form factors  Power budget at chip, card, rack level  Switch core  Link, connector, chip packaging technology  Connector density (pins/inch)  CMOS density doubles, number of pins +510 per generation  This determines the maximum perchip and percard throughput  Line cards  Increasing port counts  Prevalent line rate granularity OC192 (10 Gb/s)  1 adapter/card  1 Tb/s systems require multirack solutions  Long cables instead of backplane (30 to 100m)  Interconnect accounts for large part of system cost Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 154Packaging 30 m / 100 feet 19" Rack 4 Shelves 16 x OC192 / Shelf 2.5 Tb/s Switch Line Cards Line Cards Line Cards Line Cards Core 0 63 64 127 128 191 192 255 Active + Backup  2.5 Tb/s, 1.6x speedup, 2.5 Gb/s links 8b/10b: 4000 links (diff. pairs) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 155SwitchInternal RoundTrip (RT)  Physical system size  Direct consequence of packaging  CMOS technology  Clock speeds increasing much slower than density  More parallelism required to increase throughput  Shrinking packet cycle  Line rates have up drastically (OC3 through OC768)  Minimum packet size has remained constant Large roundtrip (RT) in terms of min. packet duration  Can be (many) tens of packets per port  Used to be only a nodetonode issue, now also inside the node  Systemwide clocking and synchronization Evolution of RT Line rate OC12 OC48 OC192 OC768 Interconnect distance 1 m 1 m 6 m 30 m Interconnect type backplane backplane cable fiber Packet duration 512 ns 128 ns 32 ns 8 ns Round Trip 1 cell 1 cell 16 cells 64 cells Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 156SwitchInternal RoundTrip (RT) line card 1 switch fabric input line 1 switch core data ingress buffer network ingress FC processor iRT output line 1 egress buffer data egress FC eRT switch fabric interface chips line card N Consequences input line 1  Performance impact ingress buffer network  All buffers must be scaled by RT processor iRT output line 1  Fabricinternal flow control egress buffer becomes an important issue eRT Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 157 OCx itf OCx itfSpeedUp Requirement  “Industry standard” 2x speedup  Three flavors  Utilization: compensate SAR overhead  Performance: compensate scheduling inefficiencies speedup  OQ speedup: memory access time Switch core speedup S is very costly  Bandwidth is a scarce resource: COST and POWER  Core buffers must run S times faster  Core scheduler must run S times faster  Is it really needed  SAR overhead reduction  Variablelength packet switching: hard to implement, but may be more costeffective  Performance: does the gain in performance justify the increase in cost and power  Depends on application  Low Internet utilization Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 158Multicast Requirement  Full multicast support  Many multicast groups, full link utilization, no blocking, QoS  Complicates everything  Buffering, queuing, scheduling, flow control, QoS Sophisticated multicast support really needed  Expensive  Often disabled in the field…  Complexity, billing, potential for abuse, etc.  Again, depends on application Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 159Packet size Requirement  Support very short packets (3264B)  40B OC768 = 8 ns  Short packet duration  Determines speed of control section  Queues and schedulers  Implies longer RT  Wider data paths Do we have to switch short packets individually  Aggregation techniques  Burst, envelope, container switching, “packing”  Singlestage, multipath switches  Parallel packet switch Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 160100Tb/s optical router Stanford University Research Project  Collaboration  4 Professors at Stanford (Mark Horowitz, Nick McKeown, David Miller and Olav Solgaard), and our groups.  Objective  To determine the best way to incorporate optics into routers.  Push technology hard to expose new issues.  Photonics, Electronics, System design  Motivating example: The design of a 100 Tb/s Internet router  Challenging but not impossible (100x current commercial systems)  It identifies some interesting research problems Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 161100Tb/s optical router Optical 160 160 Switch 320Gb/s 320Gb/s 40Gb/s 40Gb/s 160Gb/s 40Gb/s Request Arbitration 40Gb/s Grant (100Tb/s = 625 160Gb/s) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 162Research Problems  Linecard  Memory bottleneck: Address lookup and packet buffering.  Architecture  Arbitration: Computation complexity.  Switch Fabric  Optics: Fabric scalability and speed,  Electronics: Switch control and link electronics,  Packaging: Three surface problem. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 163160Gb/s Linecard: Packet Buffering DRAM DRAM DRAM b 160 Gb/s 160 Gb/s Queue Manager SRAM  Problem  Packet buffer needs density of DRAM (40 Gbits) and speed of SRAM (2ns per packet)  Solution  Hybrid solution uses onchip SRAM and offchip DRAM.  Identified optimal algorithms that minimize size of SRAM (12 Mbits).  Precisely emulates behavior of 40 Gbit, 2ns SS RA hivM. kumar Kalyanaraman Rensselaer Polytechnic Institute 164The Arbitration Problem  A packet switch fabric is reconfigured for every packet transfer.  At 160Gb/s, a new IP packet can arrive every 2ns.  The configuration is picked to maximize throughput and not waste capacity.  Known algorithms are too slow. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 165100Tb/s Router Optical links Optical Switch Racks of 160Gb/s Fabric Linecards Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 166Racks with 160Gb/s linecards DRAM DRAM DRAM Queue Manager SRAM Lookup DRAM DRAM DRAM Queue Manager SRAM Lookup Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 167Passive Optical Switching Integrated AWGR or diffraction grating based wavelength router  ,,  ,,  ,,  ,, 1 n 1 n 1 n 1 n Midstage Egress Ingress 1 1 1 1 1 1 Linecard 1 Linecard 1 Linecard 1  ,,  ,,  ,,  ,, 1 n 1 n 1 n 1 n Midstage Egress Ingress 2 2 2 2 2 2 Linecard 2 Linecard 2 Linecard 2  ,,  ,,  ,,  ,, 1 n 1 n 1 n Midstage 1 n Egress Ingress n n n n n n Linecard n Linecard n Linecard n Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 168Predictions: Core Internet routers  The need for more capacity for a given power and volume budget will mean:  Fewer functions in routers:  Little or no optimization for multicast,  Continued overprovisioning will lead to little or no support for QoS, DiffServ, …,  Fewer unnecessary requirements:  Missequencing will be tolerated,  Latency requirements will be relaxed.  Less programmability in routers, and hence no network processors (NPs used at edge…).  Greater use of optics to reduce power in switch. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 169Likely Events The need for capacity and reliability will mean:  Widespread replacement of core routers with transport switching based on circuits:  Circuit switches have proved simpler, more reliable, lower power, higher capacity and lower cost per Gb/s. Eventually, this is going to matter.  Internet will evolve to become edge routers interconnected by rich mesh of WDM circuit switches. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 170Summary  High speed routers: lookup, switching, classification, buffer management  Lookup: Rangematching, tries, multiway tries  Switching: circuit s/w, crossbar, batcherbanyan,  Queuing: input/output queuing issues  Classification: Multidimensional geometry problem  Road ahead to 100 Tbps routers… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 171
Website URL
Comment