Question? Leave a message!




Transport Protocol Design: UDP, TCP

Transport Protocol Design: UDP, TCP 37
Transport Protocol Design: UDP, TCP Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1Overview UDP: connectionless, endtoend service UDP Servers TCP features, Header format Connection Establishment Connection Termination TCP Server Design Ref: Chap 11, 17,18; RFC 793, 1323 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2Transport Protocols  Protocol implemented entirely at the ends  Fatesharing  Completeness/correctness of function implementations  UDP provides just integrity and demux  TCP adds…  Connectionoriented  Reliable  Ordered  Pointtopoint  Bytestream  Full duplex  Flow and congestion controlled Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3UDP: User Datagram Protocol RFC 768  Minimal Transport Why is there a UDP Service:  No connection  “Best effort” service, UDP establishment (which can segments may be: add delay)  Lost  Simple: no connection state at sender, receiver  Delivered out of order  Small header to app  No congestion control: UDP  Connectionless: can blast away as fast as  No handshaking desired: dubious between UDP sender, receiver  Each UDP segment handled independently Shivkumar Kalyanaraman of others Rensselaer Polytechnic Institute 4Multiplexing / demultiplexing Recall: segment unit of data Demultiplexing: delivering exchanged between transport received segments to layer entities correct app layer processes  aka TPDU: transport protocol data unit receiver P3 P4 applicationlayer M M data application segment P1 P2 transport M header M network application application transport segment H M transport t network segment H network n Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5Multiplexing / demultiplexing Multiplexing: gathering data from multiple 32 bits app processes, enveloping source port dest port data with header (later used for demultiplexing) other header fields multiplexing/demultiplexing:  based on sender, receiver port numbers, IP addresses application  source, dest port s in data each segment (message)  recall: wellknown port numbers for specific applications TCP/UDP segment format Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6UDP, cont.  Often used for streaming 32 bits multimedia apps Source port Dest port  Loss tolerant Length, in bytes of UDP Checksum Length  Rate sensitive segment,  Other UDP uses (why): including header  DNS  SNMP Application  Reliable transfer over data UDP: add reliability at (message) application layer  Applicationspecific UDP segment format error recover Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7UDP Checksum Goal: detect “errors” (e.g., flipped bits) in transmitted segment. Note: IP only has a header checksum. Receiver: Sender:  Treat segment contents  Compute checksum of as sequence of 16bit received segment integers  Check if computed checksum equals  Checksum: addition (1’s checksum field value: complement sum) of segment contents  NO error detected  Sender puts checksum  YES no error value into UDP detected. But maybe checksum field errors nonetheless Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8Introduction to TCP  Communication abstraction: Reliable Ordered Pointtopoint Bytestream Full duplex Flow and congestion controlled  Protocol implemented entirely at the ends Fate sharing Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9Evolution of TCP 1984 1975 Nagel’s algorithm Threeway handshake 1987 to reduce overhead Raymond Tomlinson Karn’s algorithm 1990 of small packets; In SIGCOMM 75 to better estimate 4.3BSD Reno predicts congestion roundtrip time collapse fast retransmit delayed ACK’s 1983 1986 1988 BSD Unix 4.2 Van Jacobson’s Congestion supports TCP/IP 1974 algorithms collapse TCP described by congestion avoidance observed Vint Cerf and Bob Kahn and congestion control 1982 In IEEE Trans Comm (most implemented in TCP IP RFC 793 791 4.3BSD Tahoe) 1990 1975 1980 1985 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10TCP Through the 1990s 1994 1996 T/TCP SACK TCP (Braden) (Floyd et al) Transaction Selective TCP Acknowledgement 1996 1996 1993 1994 FACK TCP Hoe TCP Vegas ECN (Mathis et al) Improving TCP (Brakmo et al) (Floyd) startup extension to SACK real congestion Explicit avoidance Congestion Notification 1993 1994 1996 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 11TCP Header Source port Destination port Sequence number Flags: SYN Acknowledgement FIN RESET HdrLen Advertised window Flags 0 PUSH Checksum Urgent pointer URG ACK Options (variable) Data Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12Principles of Reliable Data Transfer  Characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13Reliability Models  Reliability = requires redundancy to recover from uncertain loss or other failure modes.  Two types of redundancy:  Spatial redundancy: independent backup copies  Forward error correction (FEC) codes  Problem: requires huge overhead, since the FEC is also part of the packet(s) it cannot recover from erasure of all packets  Temporal redundancy: retransmit if packets lost/error Lazy: trades off response time for reliability  Design of status reports and retransmission optimization important Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14Temporal Redundancy Model Packets • Sequence Numbers • CRC or Checksum Timeout • ACKs Status Reports • NAKs, • SACKs • Bitmaps Retransmissions • Packets • FEC information Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15Types of errors and effects  Forward channel biterrors (garbled packets)  Forward channel packeterrors (lost packets)  Reverse channel biterrors (garbled status reports)  Reverse channel biterrors (lost status reports)  Protocolinduced effects:  Duplicate packets  Duplicate status reports  Outoforder packets  Outoforder status reports  Outofrange packets/status reports (in windowbased transmissions) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16Mechanisms …  Mechanisms:  Checksum in pkts: detects pkt corruption  ACK: “packet correctly received”  NAK: “packet incorrectly received”  aka: stopandwait Automatic Repeat reQuest (ARQ) protocols  Provides reliable transmission over:  An errorfree forward and reverse channel  A forward channel which has biterrors; reverse: ok  Cannot handle reversechannel biterrors; or packet losses in either direction. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17More mechanisms …  Mechanisms:  Checksum: detects corruption in pkts acks  ACK: “packet correctly received”  NAK: “packet incorrectly received”  Sequence number: identifies packet or ack  1bit sequence number used only in forward channel aka: alternatingbit protocols  Provides reliable transmission over:  An errorfree channel  A forward reverse channel with biterrors  Detects duplicates of packets/acks/naks  Still needs NAKs, and cannot recover from packet errors… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18More Mechanisms …  Mechanisms:  Checksum: detects corruption in pkts acks  ACK: “packet correctly received”  Duplicate ACK: “packet incorrectly received”  Sequence number: identifies packet or ack 1bit sequence number used both in forward reverse channel  Provides reliable transmission over:  An errorfree channel  A forward reverse channel with biterrors  Detects duplicates of packets/acks  NAKs eliminated  Packet errors in either direction not handled… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19Reliability Mechanisms…  Mechanisms:  Checksum: detects corruption in pkts acks  ACK: “packet correctly received”  Duplicate ACK: “packet incorrectly received”  Sequence number: identifies packet or ack 1bit sequence number used both in forward reverse channel  Timeout only at sender  Provides reliable transmission over:  An errorfree channel  A forward reverse channel with biterrors  Detects duplicates of packets/acks  NAKs eliminated  A forward reverse channel with packeterrors (loss) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20Example: ThreeWay Handshake  TCP connectionestablishment: 3wayhandshake necessary and sufficient for unambiguous setup/teardown even under conditions of loss, duplication, and delay Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21TCP Connection Setup: FSM CLOSED active OPEN create TCB passive OPEN Snd SYN CLOSE create TCB delete TCB CLOSE LISTEN delete TCB SEND rcv SYN snd SYN ACK snd SYN SYN SYN rcv SYN RCVD SENT snd ACK Rcv SYN, ACK rcv ACK of SYN Snd ACK CLOSE ESTAB Send FIN Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22More Connection Establishment  Socket: BSD term to denote an IP address + a port number. A connection is fully specified by a socket pair i.e. the source IP address, source port, destination IP address, destination port.  Initial Sequence Number (ISN): counter maintained in OS.  BSD increments it by 64000 every 500ms or new connection setup = time to wrap around 9.5 hours. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23TCP Connection Teardown Sender Receiver FIN FINACK Data write Data ack FIN FINACK Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24TCP Connection Teardown: FSM CLOSE ESTAB send FIN CLOSE rcv FIN send FIN send ACK CLOSE FIN WAIT WAIT1 rcv FIN CLOSE snd ACK snd FIN rcv FIN+ACK FIN WAIT2 CLOSING LASTACK snd ACK rcv ACK of FIN rcv ACK of FIN TIME WAIT CLOSED rcv FIN Timeout=2msl snd ACK delete TCB Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25Time Wait Issues  Web servers not clients close connection first Established  FinWaits  TimeWait  Closed Why would this be a problem  TimeWait state lasts for 2 MSL MSL should be 120 seconds (is often 60s) Servers often have order of magnitude more connections in TimeWait Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 26StopandWait Efficiency t frame U= 2t +t prop frame t Data frame 1 = t prop 2 + 1 U Ack Data  Light in vacuum Ack = 300 m/s Light in fiber t Distance/Speed of Signal prop  = = = 200 m/s Frame size /Bit rate t frame Electricity Distance  Bit rate = 250 m/s = Frame size Speed of Signal Shivkumar Kalyanaraman Rensselaer Polytechnic Institute No loss or biterrors 27Sliding Window: Efficiency Sender Receiver Next expected Max acceptable Max ACK received Next seqnum … … … … Sender window Receiver window Sent Acked Sent Not Acked Received Acked Acceptable Packet OK to Send Not Usable Not Usable Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28Sliding Window Protocols: Efficiency Nt frame U= 2t +t prop frame t frame Data N t prop 2+1 = 1 if N2+1 Ack Note: no loss or biterrors Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29GoBackN Sender:  kbit seq in pkt header k Allows upto N = 2 – 1 packets inflight, unacked  “Window”: limit on of consecutive unacked pkts In GBN, window = N Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30GoBackN  ACK(n): ACKs all pkts up to, including seq n “cumulative ACK” Sender may receive duplicate ACKs (see receiver) Robust to losses on the reverse channel Can pinpoint the first packet lost, but cannot identify blocks of lost packets in window  One timer for oldestinflight pkt  Timeout = retransmit pkt “base” and all higher seq pkts in window Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 31Selective Repeat: Sender, Receiver Windows Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 32Reliability Mechanisms: Summary  Checksum: detects corruption in pkts acks  ACK: “packet correctly received”  Duplicate ACK: “packet incorrectly received”  Cumulative ACK: acks all pkts upto incl. seq (GBN)  Selective ACK: acks pkt “n” only (selective repeat)  Sequence number: identifies packet or ack  1bit sequence number used both in forward reverse channels  kbit sequence number in both forward reverse channels. k Let N = 2 – 1 = sequence number space size Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 33Reliability Mechanisms: Summary  Timeout only at sender.  One timer for entire window (gobackN)  One timer per pkt (selective repeat)  Window: sender and receiver side.  Limits on what can be sent (or expected to be received).  Window size (W) upto N –1 (GobackN)  Window size (W) upto N/2 (Selective Repeat)  Buffering  Only at sender (GobackN)  Outoforder buffering at sender receiver (Selective Repeat) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 34Reliability capabilities: Summary  Provides reliable transmission over:  An errorfree channel  A forward reverse channel with biterrors  Detects duplicates of packets/acks  NAKs eliminated  A forward reverse channel with packeterrors (loss)  Pipelining efficiency:  GobackN: Entire outstanding window retransmitted if pkt loss/error  Selective Repeat: only lost packets retransmitted  performance penalty if ACKs lost (because acks noncumulative) more complexity Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 35What’s Different in TCP From Link Layers  Logical link vs. physical link  Must establish connection  Variable RTT  May vary within a connection = Timeout variable  Reordering  How long can packets livemax segment lifetime (MSL)  Can’t expect endpoints to exactly match link rate  Buffer space availability, flow control  Transmission rate  Don’t directly know transmission rate Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 36Sequence Number Space  Each byte in byte stream is numbered.  32 bit value  Wraps around  Initial values selected at start up time  TCP breaks up the byte stream in packets.  Packet size is limited to the Maximum Segment Size  Each packet has a sequence number.  Indicates where it fits in the byte stream 13450 14950 16050 17550 packet 8 packet 9 packet 10 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 37MSS  Maximum Segment Size (MSS)  Largest “chunk” sent between TCPs.  Default = 536 bytes. Not negotiated.  Announced in connection establishment.  Different MSS possible for forward/reverse paths.  Does not include TCP header  What all does this effect  Efficiency  Congestion control  Retransmission  Path MTU discovery  Why should MTU match MSS Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 38TCP Window Flow Control: Send Side window Sent and acked Sent but not acked Not yet sent Next to be sent Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 39Window Flow Control: Send Side Packet Received Packet Sent Source Port Dest. Port Source Port Dest. Port Sequence Number Sequence Number Acknowledgment Acknowledgment HL/Flags Window HL/Flags Window D. Checksum Urgent Pointer D. Checksum Urgent Pointer Options.. Options.. App write acknowledged sent to be sentoutside window Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 40Window Flow Control: Receive Side Receive buffer Acked but not Not yet delivered to user acked window Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 41Silly Window Syndrome  Problem: (Clark, 1982) If receiver advertises small increases in the receive window then the sender may waste time sending lots of small packets  Solution Receiver must not advertise small window increases Increase window by min(MSS,RecvBuffer/2) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 42Nagel’s Algorithm Delayed Acks  Small packet problem:  Don’t want to send a 41 byte packet for each keystroke  How long to wait for more data  Solution: Nagel’s algorithm  Allow only one outstanding small (not full sized) segment that has not yet been acknowledged  Batching acknowledgements:  Delayack timer: piggyback ack on reverse traffic if available  200 ms timer will trigger ack if no reverse traffic available Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 43Timeout and RTT Estimation  Problem: Unlike a physical link, the RTT of a logical link can vary, quite substantially How long should timeout be Too long = underutilization Too short = wasteful retransmissions  Solution: adaptive timeout: based on a good estimate of maximum current value of RTT Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 44How to estimate max RTT  RTT = prop + queuing delay Queuing delay highly variable So, different samples of RTTs will give different random values of queuing delay  Chebyshev’s Theorem: MaxRTT = Avg RTT + kDeviation Error probability is less than 1/(k2) Result true for ANY distribution of samples Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 45Round Trip Time and Timeout (II) Q: how to estimate RTT  SampleRTT: measured time from segment transmission until ACK receipt  SampleRTT will vary wildly  use several recent measurements, not just current SampleRTT to calculate “AverageRTT  AverageRTT = (1x)AverageRTT + xSampleRTT  Exponential weighted moving average (EWMA)  Influence of given sample decreases exponentially fast; x = 0.1 Setting the timeout Timeout = AverageRTT + 4Deviation Deviation = (1x)Deviation + xSampleRTT AverageRTT Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 46Timer Granularity  Many TCP implementations set RTO in multiples of 200,500,1000ms  Why  Avoid spurious timeouts – RTTs can vary quickly due to cross traffic  Delayedack timer can delay valid acks by upto 200ms  Make timers interrupts efficient  What happens for the first couple of packets  Pick a very conservative value (seconds)  Can lead to stall if early packet lost… Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 47Retransmission Ambiguity A B A B X RTO RTO Sample Sample RTT RTT Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 48Karn’s RTT Estimator  Accounts for retransmission ambiguity  If a segment has been retransmitted: Don’t update RTT estimators during retransmission. Timer backoff: If timeout, RTO = 2RTO exponential backoff Keep backed off timeout for next packet Reuse RTT estimate only after one successful packet transmission Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 49Timestamp Extension  Used to improve timeout mechanism by more accurate measurement of RTT  When sending a packet, insert current timestamp into option 4 bytes for seconds, 4 bytes for microseconds  Receiver echoes timestamp in ACK Actually will echo whatever is in timestamp  Removes retransmission ambiguity Can get RTT sample on any packet Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 50Recap: Stability of a Multiplexed System Average Input Rate Average Output Rate = system is unstable How to ensure stability 1. Reserve enough capacity so that demand is less than reserved capacity 2. Dynamically detect overload and adapt either the demand or capacity to resolve overload Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 51Congestion Problem in Packet Switching 10 Mbs statistical multiplexing C Ethernet A 1.5 Mbs B queue of packets 45 Mbs waiting for output link D E  Cost: selfdescriptive header perpacket, buffering and delays for applications.  Need to either reserve resources or dynamically detect/adapt to overload for stability Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 52Congestion: Tragedy of Commons 10 Mbps 1.5 Mbps 100 Mbps  Different sources compete for “common” or “shared” resources inside network.  Sources are unaware of current state of resource  Sources are unaware of each other  Source has selfinterest. Assumes that increasing rate by N will lead to N increase in throughput  Conflicts with collective interests: if all sources do this to drive the system to overload, throughput gain is NEGATIVE, and worsens rapidly with incremental overload = congestion collapse Shivkumar Kalyanaraman Rensselaer Polytechnic Institute  Need “enlightened” selfinterest 53Congestion: A Closeup View packet knee cliff loss  knee – point after which  throughput increases very congestion slowly collapse  delay increases fast  cliff – point after which  throughput starts to Load decrease very fast to zero (congestion collapse)  delay approaches infinity  Note (in an M/M/1 queue)  delay = 1/(1 – utilization) Load Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 54 Delay ThroughputCongestion Control vs. Congestion Avoidance  Congestion control goal stay left of cliff  Congestion avoidance goal stay left of knee knee cliff  Right of cliff: congestion Congestion collapse collapse Load Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 55 ThroughputCongestion Collapse  Definition: Increase in network load results in decrease of useful work done  Many possible causes  Spurious retransmissions of packets still in flight  Undelivered packets Packets consume resources and are dropped elsewhere in network  Fragments Mismatch of transmission and retransmission units  Control traffic Large percentage of traffic is for control  Stale or unwanted packets Packets that are delayed on long queues Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 56Solution Directions….  i  i  •Problem: demand outstrips available capacity  1 Demand Capacity  n  If information about  , and  is known in a central i location where control of  or  can be effected with i zero time delays, the congestion problem is solved  Capacity () cannot be provisioned very fast = demand must be managed  Perfect callback: Admit packets into the network from the user only when the network has capacity (bandwidth and buffers) to get the packet across. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 57Issues  If information about  , and  is known in a central i location where control of  or  can be effected with zero i time delays, the congestion problem is solved  Information/knowledge: Only incomplete information about the congestion situation is known (eg: loss indications, single bit, explicit rate field, measure of backlog etc)  Central vs distributed:a distributed solution is required  Demand vs capacity control: usually only the demand is controllable on small timescales. Capacity provisioning may be possible on larger timescales.  Measurement/control points: The congestion point, congestion detection/measurement point, and the control points may be different.  Timedelays: Between the various points, there may be timevarying and heterogeneous timedelays Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 58Static solutions…  Q: Will the “congestion” problem be solved when:  a) Memory becomes cheap (infinite memory) No buffer Too late  b) Links become cheap (high speed links) Replace with 1 Mb/s All links 19.2 kb/s S S S S S S S S File Transfer Time = 7 hours File Transfer time = 5 mins Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 59Static solutions (Continued)  c) Processors become cheap (fast routers switches) A C S B D Scenario: All links 1 Gb/s. A B send to C = “highspeed” congestion (lose more packets faster) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 60Two models of congestion control  1. Endtoend model:  Endsystems is ultimately the source of “demand”  Endsystem must robustly estimate the timing and degree of congestion and reduce its demand appropriately  Must trust other end hosts to do right thing  Intermediate nodes relied upon to send timely and appropriate penalty indications (eg: packet loss rate) during congestion Enhanced routers could send more accurate congestion signals, and help endsystem avoid other sideeffects in the control process (eg: early packet marks instead of late packet drops)  Key: trust and complexity resides at endsystems  Issue: What about misbehaving flows Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 61Two models of congestion control…  2. Networkbased model:  A) All endsystems cannot be trusted and/or  B) The network node has more control over isolation/scheduling of flows  Assumes network nodes can be trusted.  Each network node implements isolation and fairness mechanisms (eg: scheduling, buffer management)  A flow which is misbehaving hurts only itself  Problems:  Partial soln: if flows don’t back off, each flow has congestion collapse, i.e. lousy throughput during overload  Significant complexity in network nodes  If some routers do not support this complexity, congestion still exists  Classic justification of the endtoend principle Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 62Goals of Congestion Control  To guarantee stable operation of packet networks Subgoal: avoid congestion collapse  To keep networks working in an efficient status Eg: high throughput, low loss, low delay, and high utilization  To provide fair allocations of network bandwidth among competing flows in steady state For some value of “fair”  63 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 63What is stability  Equilibrium point(s) of a dynamic system  For packet networks  Each user will get an allocation of bandwidth  Changes of network or user parameters will move the equilibrium from one point, (hopefully) after a brief transient period, to a new one  System should not remain indefinitely away from equilibrium if there are no more external perturbations  Example of instability: unbounded queue growth 64 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 64What is fairness  one of the most overdefined (and probably overrated) concepts fairness index maxmin proportional … infinite number of notions  Fairness for besteffort service, roughly means that services are provided to selfish, competing 65 users in a predictable way Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 65Eg: maxmin fairness  if link not congested, then f max( x ) i  otherwise, if link congested min(x , f ) c  i i f = 4: x 1 8 10 min(8, 4) = 4 4 x 2 6 Allocations min(6, 4) = 4 4 2 min(2, 4) = 2 x 2 3 66 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 66Flow Control Optimization Model  Given a set S of flows, and a set L of links  Each flow s has utility U (x ) , x is its sending s s s rate  Each link l has capacity c l  Modeled as optimization (Eg: Kelly’98, Low’99) max U (x )  s s sS st. x  c , l  L  s l s S l where S = s flow s passes the link l l 67 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 67What is Fairness  Achieves (w,α) fairness if for any other feasible allocation Mo’00: x x x s s w 0  s  x sS s where w is the weight for flow s s  weighted maximum throughput fairness is (w,0)  weighted proportional fairness is (w,1)  weighted minimum potential delay fairness is (w,2)  weighted maxmin fairness is (w,∞)  “Weight” could be driven by economic considerations, or scheme dependencies on factors like RTT, loss rate etc 68 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 68What is fairness (contd)  fairness () axis α 0 1 2 ∞  α = 0 : maximum throughput fairness  α = 1 : proportional fairness  α = 2 : minimum delay fairness  ……  α = ∞ : maxmin fairness 69 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 69Proportional vs Maxmin Fairness  proportional fairness maxmin fairness  the more a flow  every flow has the same consumes critical right to all network network resources, the resources less allocation  network as a black box  network visible inside  network users’ view  network operators’ view  x = x = 0.5 0 19  x = 0.1, x = 0.9 0 19 c = 1 l x 0 l l l 1 2 9 x x x 1 2 9 70 70 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 70Equilibrium  Operate at equilibrium near the knee point  How to maintain equilibrium  Packetconservation: Don’t put a packet into network until another packet leaves.  Use ACK: send a new packet only after you receive and ACK. Why A.k.a “Selfclocking” or “Ackclocking” In steady state, keep packets in network constant  Problem: how do you know you are at the knee  Network capacity or competing demand may change:  Need to probe for knee by increasing demand  Need to reduce demand overshoot detected  Endresult: oscillate around knee Violate packetconservation each time you probe Shivkumar Kalyanaraman Rensselaer Polytechnic Institute by the degree of demand increase 71Selfclocking P r P b Sender Receiver A b A s A r  Implications of ackclocking:  More batching of acks = bursty traffic  Less batching leads to a large fraction of Internet traffic being just acks (overhead) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 72Basic Control Model  Let’s assume windowbased operation  Reduce window when congestion is perceived How is congestion signaled  Either mark or drop packets When is a router congested  Drop tail queues – when queue is full  Average queue length – at some threshold  Increase window otherwise Probe for available bandwidth – how Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 73Simple linear control  Many different possibilities for reaction to congestion and methods for probing Examine simple linear controls Window(t + 1) = a + b Window(t) Different a /b for increase and a /b for i i d d decrease  Supports various reaction to signals Increase/decrease additively Increased/decrease multiplicatively Which of the four combinations is optimal Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 74Phase plots  Simple way to visualize behavior of competing flows over time Fairness Line Overload User 2’s Allocation Optimal point x 2 Underutilization Efficiency Line User 1’s Allocation x 1  Caveat: assumes 2 flows, synchronized feedback, equal Shivkumar Kalyanaraman RTT, discrete “rounds” of operation Rensselaer Polytechnic Institute 75Additive Increase/Decrease  Both X and X increase/decrease by the same amount 1 2 over time  Additive increase improves fairness increases load  Additive decrease reduces fairness decreases load Fairness Line T 1 User 2’s Allocation T 0 x 2 Efficiency Line User 1’s Allocation x 1 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 76Multiplicative Increase/Decrease  Both X and X increase by the same factor over time 1 2  Fairness unaffected (constant), but load increases (MI) or decreases (MD) Fairness Line T 1 User 2’s Allocation x 2 T 0 Efficiency Line User 1’s Allocation x 1 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 77Additive Increase/Multiplicative Decrease (AIMD) Policy Fairness Line x 1 x 0 User 2’s Allocation x 2 x 2 Efficiency Line User 1’s Allocation x 1  Assumption: decrease policy must (at minimum) reverse the load increase overandabove efficiency line  Implication: decrease factor should be conservatively set to account for any congestion detection lags etc Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 78TCP Congestion Control  Maintains three variables: cwnd – congestion window rcvwin – receiver advertised window ssthresh – threshold size (used to update cwnd)  Rough estimate of knee point…  For sending use: win = min(rcvwin, cwnd) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 79TCP: Slow Start  Goal: initialize system and discover congestion quickly  How Quickly increase cwnd until network congested  get a rough estimate of the optimal cwnd  How do we know when network is congested  packet loss (TCP) over the cliff here  congestion control  congestion notification (eg: DEC Bit, ECN) over knee; before the cliffcongestion avoidance  Implications of using loss as congestion indicator  Late congestion detection if the buffer sizes larger  Higher speed links or large buffers = larger windows = higher probability of burst loss  Interactions with retransmission algorithm and timeouts Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 80TCP: Slow Start  Whenever starting traffic on a new connection, or whenever increasing traffic after congestion was experienced:  Set cwnd =1  Each time a segment is acknowledged increment cwnd by one (cwnd++).  Does Slow Start increment slowly Not really. In fact, the increase of cwnd is exponential Window increases to W in RTT log (W) 2 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 81Slow Start Example  The congestion window size grows cwnd = 1 very rapidly cwnd = 2 cwnd = 4  TCP slows down the increase of cwnd when cwnd = 8 cwnd = ssthresh Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 82Slow Start Example One RTT 0R 1 One pkt time 1 1R 2 3 2 3 2R 4 6 5 7 4 5 6 7 3R 8 10 12 14 9 11 13 15 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 83Slow Start Sequence Plot . . . Sequence No Window doubles every round Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 84Congestion Avoidance  Goal: maintain operating point at the left of the cliff:  How additive increase: starting from the rough estimate (ssthresh), slowly increase cwnd to probe for additional available bandwidth multiplicative decrease: cut congestion window size aggressively if a loss is detected. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 85Congestion Avoidance  Slow down “Slow Start”  If cwnd ssthresh then each time a segment is acknowledged increment cwnd by 1/cwnd i.e. (cwnd += 1/cwnd).  So cwnd is increased by one only if all segments have been acknowledged.  (more about ssthresh latter) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 86Congestion Avoidance Sequence Plot Sequence No Window grows by 1 every round Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 87Slow Start/Congestion Avoidance Eg. cwnd = 1  Assume that cwnd = 2 ssthresh = 8 cwnd = 4 14 12 cwnd = 8 10 8 ssthresh 6 4 cwnd = 9 2 0 Roundtrip times cwnd = 10 Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 88 t=0 t=2 t=4 t=6 Cwnd (in segments)Putting Everything Together: TCP Pseudocode Initially: while (next unack + win) cwnd = 1; transmit next packet; ssthresh = infinite; New ack received: if (cwnd ssthresh) where win = min(cwnd, flowwin); / Slow Start/ cwnd = cwnd + 1; else unack next / Congestion Avoidance / seq cwnd = cwnd + 1/cwnd; Timeout: (loss detection) / Multiplicative decrease / win ssthresh = win/2; cwnd = 1; Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 89The big picture cwnd Timeout Congestion Avoidance Slow Start Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 90Packet Loss Detection: Timeout Avoidance  Wait for Retransmission Time Out (RTO)  What’s the problem with this  Because RTO is a performance killer  In BSD TCP implementation, RTO is usually more than 1 second  the granularity of RTT estimate is 500 ms  retransmission timeout is at least two times of RTT  Solution: Don’t wait for RTO to expire  Use alternate mechanism for loss detection  Fall back to RTO only if these alternate mechanisms fail. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 91Fast Retransmit  Resend a segment after 3 duplicate ACKs Recall: a duplicate cwnd = 1 ACK means that an outof sequence cwnd = 2 segment was received cwnd = 4  Notes: duplicate ACKs due packet reordering 3 duplicate ACKs if window is small don’t get duplicate ACKs Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 92Fast Recovery (Simplified)  After a fastretransmit set cwnd to ssthresh/2 i.e., don’t reset cwnd to 1  But when RTO expires still do cwnd = 1  Fast Retransmit and Fast Recovery  implemented by TCP Reno; most widely used version of TCP today Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 93Fast Retransmit and Fast Recovery cwnd Congestion Avoidance Slow Start Time  Retransmit after 3 duplicated acks  prevent expensive timeouts  No need to slow start again  At steady state, cwnd oscillates around the optimal window size. Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 94Fast Retransmit Retransmission X Duplicate Acks Sequence No Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 95Multiple Losses X X Now what X Retransmission X Duplicate Acks Sequence No Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 96TCP Versions: Tahoe X X X X Sequence No Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 97TCP Versions: Reno X X X Now what timeout X Sequence No Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 98NewReno  The ack that arrives after retransmission (partial ack) should indicate that a second loss occurred  When does NewReno timeout When there are fewer than three dupacks for first loss When partial ack is lost  How fast does it recover losses One per RTT Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 99NewReno X X X Now what – partial ack X recovery Sequence No Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 100SACK  Basic problem is that cumulative acks only provide little information  Alt: Selective Ack for just the packet received What if selective acks are lost  carry cumulative ack also  Implementation: Bitmask of packets received  Selective acknowledgement (SACK)  Only provided as an optimization for retransmission  Fall back to cumulative acks to guarantee correctness and window updates Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 101SACK X X X Now what – send X retransmissions as soon Sequence No as detected Time Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 102Asymmetric Behavior  Three important characteristics of a path  Loss  Delay  Bandwidth  Forward and reverse paths are often independent even when they traverse the same set of routers  Many link types are unidirectional and are used in pairs to create bidirectional link Internet 6Mbps (no congestion, A I B bandwidth 6Mbps) 32kbps Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 103Asymetric Loss  Loss Information in acks is very redundant Low levels of ack loss will not create problems TCP relies on ack clocking – will burst out packets when cumulative ack covers large amount of data Burstiness will in turn cause queue overflow/loss Max burst size for TCP and/or simple rate pacing Critical also during restart after idle Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 104Ack Compression  What if acks encounter queuing delay Smooth ack clocking is destroyed Basic assumption that acks are spaced due to packets traversing forward bottleneck is violated Sender receives a burst of acks at the same time and sends out corresponding burst of data Has been observed and does lead to slightly higher loss rate in subsequent window Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 105Bandwidth Asymmetry  Could congestion on the reverse path ever limit the throughput on the forward link  Let’s assume MSS = 1500bytes and delayed acks  For every 3000 bytes of data need 40 bytes of acks  75:1 ratio of bandwidth can be supported  Modem uplink (28.8Kbps) can support 2Mbps downlink  Many cable and satellite links are worse than this  Solutions: Header compression, linklevel support Internet 6Mbps (no congestion, A I B bandwidth 6Mbps) 32kbps Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 106TCP Congestion Control Summary  Sliding window limited by receiver window.  Dynamic windows: slow start (exponential rise), congestion avoidance (additive rise), multiplicative decrease.  Ack clocking  Adaptive timeout: need mean RTT deviation  Timer backoff and Karn’s algo during retransmission  GobackN or Selective retransmission  Cumulative and Selective acknowledgements  Timeout avoidance: Fast Retransmit Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 107
Website URL
Comment