The Taub Faculty of Computer Science Events and Talks
Boris Pismenny (Ph.D. Thesis Seminar)
Thursday, 23.11.2023, 16:30
Advisor: Prof. Adam Morrison and Prof. Dan Tsafrir
The Internet services which we enjoy in our day-to-day lives---search, social networking, online maps, video sharing, online shopping—run on Data Centers (DCs). DCs are warehouse scale computers that consist of tens of thousands of machines which are interconnected via fast networks. Building and maintaining DCs is tremendously expensive, for example, Amazon’s DC in Tel-Aviv spans over 100,000 square feet and they estimate that building each DC costs approximately 2.37 billion USD where only 280 million USD are designated for the land and buildings and the rest is for computing infrastructure. Technology companies seeking to maximize their return on investment (ROI) must efficiently utilize their DCs which is particularly challenging because rapid computer technology changes shift the bottleneck component in DCs. In our work, we observe that in recent years the growth in host Network Interface Controller (NIC) bandwidth has outpaced the growth in other DC host system resources such as memory bandwidth and CPU processing capacity. At the same time, networking is the cheapest component in servers whereas CPU and memory are the most expensive, therefore finding new ways to improve the utilization of NICs will improve overall DC utilization and ROI.
Our first paper “Autonomous NIC Offloads” tackles the problem of offloading CPU-intensive application layer logic (e.g., encryption) onto NICs. We observe that the ideal position to perform data-intensive computations is on the NIC as network data flows through it in any case. But previous approaches to offload data-intensive application layer (layer-5) computations to NICs depend on offloading the underlying layer≤4 protocols (TCP, IP, routing, firewall, etc.), which undesirably encumbers innovation imposing undesirable security and maintenance burdens. In contrast, autonomous NIC offloads accelerate data-intensive application logic without having to migrate the entire layer≤4 network stack to the NIC. The key challenges autonomous offloads address is coping with out-of-sequence TCP packets. On transmit, to process out-of-sequence packet P, we leverage the software TCP retransmission buffer and the application itself to provide the NIC with the data needed to process P. On receive, out-of-sequence packets bypass the offloading logic when the NIC’s state is insufficient to perform the offload, and we use a software-hardware handshake to recover the state necessary to offload subsequent packets. We implement autonomous offloads for two protocols and computations: HTTPS encryption and authentication and NVMe-TCP zero-copy and data digest. We also describe the properties of protocols and computations that are autonomously offloadable, we find that most are offloadable but not all. Our evaluation shows autonomous offloads increase throughput by up to 3.3x, reduce CPU utilization by up to 60% and reduce latency by up to 30%. Software support for autonomous offloads is available in open-source libraries, such as the Linux Kernel and OpenSSL, and recent NVIDIA NICs support autonomous offloads in hardware. Our second paper “The Benefits of General-Purpose On-NIC Memory”exposes the newly available memory on NICs (Nicmem) directly to applications. We identify a class of applications that benefit from Nicmem, which we call “data movers”. Data mover applications process incoming packets based on metadata without accessing incoming packet data. We use two data mover applications to demonstrate the benefits of Nicmem: key-value stores (KVS) and network functions (NFs). Popular NFs such as network address translation frequently operate on headers—rather than data—of incoming packets. For NFs, we introduce a packet processing architecture that splits between packet headers and data, keeping the data on Nicmem when possible and thus reducing memory and PCIe bandwidth. Our approach consequently shortens latency by up to 23% and increases throughput by up to 19%. Similarly, because KVS workloads are highly skewed, we introduce a cache of hot values that resides on Nicmem which is closer to the wire. This design shortens skewed KVS workload latency by up to 43% and increases throughput by up to 80%.
Our third paper “ShRing: Networking with Shared Receive Rings” observes that today’s NIC interface for receiving packets requires sufficient per-core packet buffers to absorb packet bursts, but the combined size of all packet buffers—which are typically not shared between cores—can exceed the size of the last level cache (LLC). As a result, packet processing slows down, degrading throughput and latency, because NIC and CPU memory accesses are frequently served from main memory rather than LLC. To alleviate this problem, we propose a new NIC interface for receiving packets called “shRing” which shares packet buffers between cores when memory bandwidth is high. Inter-core sharing adds synchronization overhead which is offset by the smaller memory footprint. Our experiments show that shRing increases NF throughput by up to 1.27x and reduces NF latency by up to 38x. The large latency improvement occurs when shRing reduces packet processing time below packet inter-arrival time thereby preventing CPU overload and queue buildup.