Transport Protocols For High Performance

Aaron Falk, Ted Faber, Joseph Bannister, Andrew Chien, Robert Grossman, Jason Leigh


This is an early draft of the article: Transport Protocols for High Performance: Whither TCP?, Communications ACM, Volume 46, Issue 11, November, 2003, pages 42-49.


Introduction

The stability of today's Internet relies on the congestion control built into the Transmission Control Protocol (TCP). After the National Science Foundation's backbone network experienced repeated networkwide congestion collapses in the 1980s, Van Jacobson, then a researcher at Lawrence Berkeley National Laboratory, devised congestion-control algorithms for TCP [5] allowing thousands of users to share a bottleneck in a stable manner. TCP maintains a congestion window governing how many bytes of data a sender may send without acknowledgment from the receiver. When the sender infers network congestion - usually by failing to receive a timely acknowledgment of a group of bytes - the congestion window quickly closes to stanch the admission of any more data into the network; the congestion window slowly opens as congestion abates.

Jacobson's congestion control also exhibits several shortcomings. When large bandwidth-delay product (BDP) flows pass through a bottleneck router, his algorithms provide poor utilization of the bottleneck link, long round-trip time flows are often unable to get a fair allocation of bottleneck bandwidth, and packet loss due to errors is incorrectly interpreted by the algorithm as a sign of congestion. While these limitations represented an acceptable tradeoff for the sake of network stability in the 1980s, the Internet's subsequent growth and worldwide expansion has meant faster links and increased diversity in network access technologies. Today, supercomputer, satellite, and mobile-phone Internet users experience performance penalties because their expensive links interact poorly with TCP (see the article by Newman et al. in this section). TCP performs poorly in three scenarios:

Poor link utilization at high BDP. It is notably difficult to obtain high TCP throughput and, hence, good link utilization for high-BDP flows traversing end-to-end paths with high bandwidth, high delay, or both. As described in [2], a single TCP flow can saturate a 10Gbps link, incurring large, oscillatory queues. However, sustaining these high flow rates requires unrealistically low packet loss.

Unfair at long round-trip times. TCP flows with long round-trip times (in satellite and cellular networks) have difficulty obtaining their fair share of bandwidth on a bottleneck link. Once TCP enters congestion-avoidance mode, long round-trip-time flows slowly open their congestion windows. As a result, short round-trip time flows obtain higher throughput.

Confused by lossy links. TCP uses packet loss as a binary indicator of congestion; there is no feedback on the intensity of congestion. Packet loss is a reasonable congestion indicator on links with negligible noncongestion-related packet loss. However, many wireless links are subject to, for example, uncorrectable bit errors; as a result, TCP treats them as congested networks and underutilizes them.

Here, we look at ways to improve performance in these environments, emphasizing the first two problems.

Explicit Congestion Control Protocol

The recently developed Explicit Control Protocol (XCP) [7] is a congestion-control system well suited for high-BDP networks. It delivers the highest possible application performance over a range of network infrastructure, including extremely high-speed and high-delay links not well served by TCP. It achieves maximum link utilization and reduces wasted bandwidth due to packet loss. It separates the efficiency and fairness policies of congestion control, enabling routers to quickly make use of available bandwidth while conservatively managing the allocation of bandwidth to flows. It is built on the principle of carrying per-flow congestion state in packets. And its packets each carry a small congestion header through which the sender requests a desired throughput. Routers compute a fair per-flow bandwidth allocation without maintaining any per-flow state. The sender thus learns of the bottleneck router's allocation in a single round trip (see Figure 1).

XCP is provably stable and has been shown, through simulation, to scale with numbers of flows, rates of flows, and variance in flow rate and round-trip times. Simulations show that under these conditions XCP almost never drops packets.

This new congestion control is an opportunity to develop a more systematic framework for managing Internet resources, including link bandwidth, router memory, and processing power. It effectively decouples congestion control from bandwidth-allocation policy. The router has two separate controllers (see Figure 2): the Efficiency Controller for preventing congestion and maintaining high utilization and the Fairness Controller for splitting the aggregate bandwidth fairly among connections sharing the link.

Routers in the core of the Internet typically carry many thousands of flows concurrently. As a result, algorithms requiring per-flow state have poor scaling properties; XCP provides fine-grain, per-flow congestion feedback without retaining per-flow state in the routers. The necessary per-flow state is carried in the packets. The routers execute a few multiplications and additions per packet.1 The result is an algorithm that is fast, scalable, and robust.

XCP is a good fit for high-performance computing, including in petabyte-size file transfers and networks with very expensive bandwidth (see the articles by DeFanti et al. and by Smarr et al. in this section). It is fast, enabling rapid file transfers. It grabs idle bandwidth quickly, efficiently utilizing expensive, high-speed links. It provides congestion control, allowing the sharing of network resources with applications that are not high performance and traffic that is not XCP. And it scales efficiently and fairly to accommodate a range of link and flow bandwidth, number of flows, and round-trip times, allowing for the extension of high-performance applications over the Internet.

However, because XCP requires all routers along the path to participate, deployment is a concern. XCP may be deployed on a cloud-by-cloud basis, providing some increased link utilization within each cloud. It may also be deployed end-to-end by upgrading all routers within specialized networks, a reasonable scenario for high-performance, scientific networks, including OptIPuter networks.

During an XCP deployment, TCP traffic is also present, thus raising the issue of XCP and TCP compatibility, or XCP's TCP-friendliness. Simulations have shown, and we expect experiments to validate, that XCP can be used on a network with TCP traffic without unfairly penalizing the TCP traffic. This coexistence is accomplished through separate queues partitioning bandwidth in XCP-enabled routers. The XCP algorithm then optimizes the utilization of a portion of the available bandwidth, allowing TCP to employ its probing mechanism to accumulate the remainder more slowly.

To evaluate XCP's performance in real networks, a number of research groups, including at the Massachusetts Institute of Technology, the Information Sciences Institute (ISI) at the University of Southern California, and Cisco Systems, are collaborating on implementing and testing XCP. The implementation will run on Linux hosts and Cisco routers. An initial implementation by ISI (see www.isi.edu/isi-scp) will be used as a key component for providing high levels of end-to-end performance to OptIPuter applications.

Most congestion-control research emphasizes incremental changes to TCP, including efforts to address TCP's shortcomings. Sally Floyd [2] has introduced several changes to TCP to improve performance for high-speed networking. High-Speed TCP (HS-TCP) involves a subtle change in the congestion-avoidance response function to allow high-BDP flows to capture available bandwidth more readily. Amit Jain, together with Floyd, [6] have devised QuickStart, a mechanism for allowing an end system to request a higher initial sending rate from routers along the connection's path. Tom Kelly has developed a TCP modification called Scalable TCP similar in nature to HS-TCP in that the congestion-window response function for large windows is modified to recover more quickly from loss events and hence reduce the penalty for probing for available bandwidth. Steven Low developed the Fast Active-Queue-Management Scalable TCP (FAST) algorithm (based on a modification of TCP Vegas), has demonstrating its stability for high speeds and large-BDP flows.

XCP, however, differs significantly from these proposals. By introducing explicit, nonbinary feedback from the network to the endpoints, XCP achieves several important functional and performance advantages. First, it enables large-BDP flows to ramp up to broadband rates more quickly than current versions of TCP, in both start-up and steady-state operation, and obtains the maximum performance supported by the infrastructure under the greatest range of challenging conditions. Rather than being only a modification or tuning of TCP, it also introduces a novel (and general) framework for resource management. XCP's basic building blocks can be used by a range of protocols with different semantics, meeting the high-performance communication needs of most scientific and commercial applications.

By making bandwidth allocation explicit, it also provides fairness to "equal" flows—when TCP would not—and implements a nearly cost-free service-differentiation mechanism as needed. How this simple and powerful unified bandwidth-allocation strategy supports traditional quality-of-service concepts, including tiered service levels and bandwidth guarantees, is suggested in [7]. These advantages are not, however, without cost. But XCP's tradeoff - additional complexity in the network routers and switches - is minor and manageable, and the resulting advantages are profound.

TCP Striping

While XCP proposes changes to the network infrastructure by modifying router behavior, other systems address performance issues without modifying the infrastructure. Because they require no modifications inside the network, these systems are prototyped more quickly.

Striped TCP. One approach to improving TCP performance for data-intensive applications is to adjust the TCP window size so it is equal to the BDP. This approach requires modifying and tuning the kernel of each of the operating systems participating in the networked application. Further improvements can be realized by reducing the per-byte overhead by using large packets.

Another approach to overcoming the limitations of TCP is to stripe flows over standard TCP network connections; individual files are sent over these connections. The aggregate window (the sum of the individual windows) opens quickly and is less sensitive than a single large window to the closing of the windows of any individual connection due to congestion. The aggregate connection achieves a high throughput more quickly than a single connection and is less sensitive to transient loss or congestion.

TCP striping can be done at the level of data middleware or the application. Striping can also be combined with tuning to achieve higher overall performance. Striped connections can use a single network interface card on a single host, multiple network interface cards on a single host, or several network interface cards on several hosts.

Striped TCP has been implemented many times; some early implementations targeted satellite communications [1]; more recently, the focus has shifted to improving the performance of distributed data-intensive applications, including GridFTP (see globus.org) and PSockets [11].

In experiments, striped-TCP easily delivers more than 300Mbps on an OC-12 (622Mbps) transatlantic link with 110ms packet transit time using standard workstations and 1Gbps network interface cards [3]. However, achieving such high bandwidth can be problematic.

UDP data channel/TCP control channel. Researchers recently sought to develop protocols employing a UDP-based Data Channel (UDC) controlled by a TCP-based Control Channel (TCC), or UDC/TCC protocols. UDP has no flow control, rate control, or reliable transmission mechanisms; UDC/TCC protocols implement these control functions in a separate TCP control channel. The UDC/TCC protocols yield three main advantages:

Deployment. They can be deployed at the application level and require no changes to routers or other network infrastructure.

Bandwidth. In experiments, their use of available bandwidth is extremely efficient. For example, during the iGrid 2002 conference in Amsterdam, The Netherlands, both the Simple Available Bandwidth Utilization Library (SABUL) protocol and the Reliable Blast User Data Protocol (RBUDP) [9] implementations consistently averaged approximately 920Mbps under the same conditions as for the striped TCP described earlier. A prediction function provided in [4] allows applications to predict performance of RBUDP on a given network.

Rate control. SABUL [10], Tsunami, an experimental high-speed network file transfer protocol (see www.anml.iu.edu/anmlresearch.html), and other implementations use a rate-control mechanism based not on windows but on inter-packet arrival time. This rate-based system provides a smoother sending pattern than TCP's sawtooth, and the fixed control interval is more effective on high-BDP networks. Such rate-control mechanisms are fair in the sense that they allow multiple streams to share network resources but enforce a different standard of fairness from traditional Internet protocols.

Cluster and Storage-Area-Network Protocols

Backplanes and networks are converging. Physical-link technologies are converging for such high-speed cluster networks as Myrinet (see www.myri.com), VIA (see www.viarch.org), and storage area networks (SANs), including FibreChannel (see www.fibrechannel.org) and InfiniBand (see www.infinibandta.org), toward being able to share fiber and copper-link technologies and speeds. Cluster networks support SAN-style access to storage devices, while SANs adopt the more general approaches of cluster networks.

Despite convergence at the physical layer, the protocols used in cluster and SAN environments differ dramatically from wide-area network protocols, including TCP/IP. SAN protocols depend on low latency, low loss rates, support for communication groups, and aggressive sending of data. The convergence of physical-link technologies has inspired a range of research experiments exploring synergies, including TCP/IP protocols and networks for accessing remote storage and cluster protocols for accessing across wide-area IP networks.

It seems likely that TCP and other application protocols incorporating SAN features will support high-speed storage access more directly. It also seems likely that as SANs get larger they will need to confront the same congestion and addressing issues as the Internet protocols have had to confront. Protocol stacks indicate how current systems express this convergence of function by combining protocols (see Figure 3).

Research on the Group Transport Protocol (GTP) at the University of California, San Diego, combines the information for multiple flows at a receiver to achieve rapid response and fairness. This approach is motivated by both the application and the resulting communication structures. Many users may share a single storage device; alternatively, clusters may access one another's distributed data as if it was a single device. Supporting these interactions requires new protocol designs coordinating the management of multiple flows.

We focus on high-speed SAN applications flexibly sharing storage or distributed data collections; such applications include distributed cooperative visualization and real-time group communication. Beyond that, the more expressive protocols enable new applications.

These applications differ from TCP-centric formulations of high-speed network communication in their need for multi-point, high-speed, real-time, secure communication. A number of them optimize from different performance metrics (such as time to bandwidth and time to transmit a terabyte).

Advantages of cluster and SAN protocols. Traditional cluster and SAN protocols operate in a high-bandwidth, highly reliable, low-latency cluster and LAN/MAN environment. They provide message-based remote-procedure-call-like interactions and even support one-sided high-speed block transfer with such mechanisms as remote direct memory access. To achieve good performance, the protocols stream data aggressively, manage packet drops as transient errors, and manage buffers to support multiple high-bandwidth flows at a receiver.

Aggressive streaming of data fills high-BDP pipes, delivering peak performance. Because end-points view packet drops as transient errors, packet transmission errors are separated from congestion control. Supporting multiple flows at an endpoint enables efficient group communication at high speed, an essential prerequisite for the application structures this research aims to support.

These aggressive protocols are effective in cluster/SAN environments due to the presence of link or network-level flow control providing fairly reliable delivery. The presence of plentiful bandwidth is a critical factor. Because of their tight integration with operating systems, many cluster/SAN protocols support direct access to network storage devices. However, extending them to large partially shared networks, including WANs, in the context of the OptIPuter presents several challenges.

Cluster/SAN protocols in WAN environments. The principal differences between cluster/SAN environments and WAN-type networks include: long latencies; varied structure and performance (irregular topologies, shared network links); and higher loss rates (packet loss, no link/network flow control). For the protocol designer enhancing cluster/SAN protocols, they represent a number of challenges:

Our work with extensions of cluster/SAN protocols aims to address these challenges and others. In particular, we are looking to devise ways for receiver-based rate control to manage fast-start and group communication, as well as variable source coding, to deal with loss. Early experiments suggest that the narrow domain of network structure and bountiful network bandwidth will admit simple solutions to these problems.

Conclusion

Several projects within the OptIPuter framework are pursuing research on transport protocols in networking environments where TCP and Internet protocols perform poorly. XCP addresses large-BDP network performance when routers might help in the congestion-control process; moreover, it scales well and reacts quickly. Several techniques for tuning existing protocols and controlling sending rates to provide high performance in cluster or Grid environments are also being employed in such protocols as UDC/TCC, SABUL, RBUDP, and Tsunami. They provide high bandwidth to applications in environments where tens of thousands of users share the network, but performance requirements are more demanding. Finally, GTP is beginning to realize the integration of TCP and SAN protocols. In clusters, the needed communication paradigms are richer, thus requiring multipoint high-bandwidth solutions in a constrained environment.

Today, many of these transport systems are constrained to operate in networks or parts of a network where their operating requirements can be met. In order for user communities to expand into the wide area, wide-area transport protocols, including TCP, may well have to adopt some or all of these techniques.

References

1. Allman, M., Ostermann, S., and Kruse, H. Data transfer efficiency over satellite circuits using a multi-socket extension to the File Transfer Protocol (FTP). In Proceedings of the ACTS Results Conference (NASA Lewis Research Center, Cleveland, OH, Sept. 11–13, 1995).

2. Floyd, S. HighSpeed TCP for Large Congestion Windows. Internet draft, Feb. 2003; see www.icir.org/floyd/papers/draft-floyd-tcp-highspeed-02.txt.

3. Grossman, R., Gu, Y., Hanley, D., Hong, X., Lillethun, D., Levera, J., Mambretti, J., Mazzucco, M., and Weinberger, J. Experimental studies using photonic data services at iGrid 2002. In a special issue on iGrid 2002, C. de Laat, M. Brown, and T. DeFanti, Eds. J. Fut. Comput. Syst. 19, 6 (Aug. 2003).

4. He, E., Leigh, J., Yu, O., and DeFanti, T. Reliable Blast UDP: Predictable high-performance bulk data transfer. In Proceedings of IEEE International Conference on Cluster Computing 2002 (Chicago, IL, Sept. 23–26). IEEE Computer Society Press, 2002, 317–324.

5. Jacobson, V. Congestion avoidance and control. In Proceedings of ACM Sigcomm'88 (Palo Alto, CA, Aug. 16–18). ACM Press, New York, 1988, 314–329.

6. Jain, A. and Floyd, S. Quick-Start for TCP and IP. Internet draft, Oct. 2002; see www.icir.org/floyd/papers/draft-amit-quick-start-02.txt.

7. Katabi, D., Handley, M., and Rohrs, C. Internet congestion control for future high-bandwidth-delay product environments. In Proceedings of ACM Sigcomm'02 (Pittsburgh, PA, Aug. 19–21). ACM Press, New York, 2002, 89–102; see www.acm.org/sigcomm/sigcomm2002/papers/xcp.pdf.

8. Lauria, M., Pakin, S., and Chien, A. Efficient layering for high-speed communication: Fast messages 2.x In Proceedings of the 7th High-Performance Distributed Computing Conference (Chicago, July 28–31). IEEE Computer Society Press, 1998.

9. Leigh, J., Yu, O., Schonfeld, D., Ansari, R., et al. Adaptive networking for tele-immersion. In Proceedings of the Immersive Projection Technology/Eurographics Virtual Environments Workshop (Stuttgart, Germany, May 16–18). Eurographics, Aire-la-Ville, Switzerland, 2001, 199–208.

10. Sivakumar, H., Grossman, R., Mazzucco, M., Pan, Y., and Zhang, Q. Simple available bandwidth utilization library for high-speed wide-area networks. J. Supercomput. (2003).

11. Sivakumar, H., Bailey, S., and Grossman, R. PSockets: The case for application-level networking striping for data-intensive applications using high-speed wide area networks. In Proceedings of SC2000 (Dallas, TX, Nov. 4–10). ACM Press, New York, 2000.

12. von Eicken, T. Culler, D., Goldstein, S., and Schauser, K. Active Messages: A mechanism for integrated communication and computation. In Proceedings of the 19th International Symposium on Computer Architecture (Gold Coast, Australia, May 19–21). IEEE Computer Society Press, 1992, 256–266.