The rapid growth of data-intensive applications and high-speed I/O devices has led to increasing demands on I/O performance in both virtualized cloud environments and bare metal setups. But existing systems struggle to fully exploit the potential of modern hardware due to inefficiencies at various I/O stack layers. This thesis presents three novel techniques that optimize I/O performance across virtualized and bare metal environments: IOctopus, cinterrupts, and Hermes.
IOctopus eliminates non-uniform DMA (NUDMA) effects in multi-CPU systems by connecting I/O devices to every CPU and abstracting multiple PCIe endpoints into a single entity. IOctopus eliminates all remote DMAs, transforming them to local operations, thus improving I/O throughput and latency by as much as 2.7x and 1.28x, respectively.
Cinterrupts enable fine-grained control over interrupt generation in modern high-speed storage devices by allowing software to indicate which I/O requests are latency sensitive. With this information, the device can “calibrate” its interrupts to completions of latency-sensitive operations. This approach increases throughput, reduces CPU consumption, and achieves lower latency even when interrupts are coalesced. While primarily designed for NVMe SSDs, the cinterrupts principle can be extended to other I/O technologies. Calibrated interrupts increase throughput by up to 35%, reduce CPU consumption by as much as 30%, and achieve up to 37% lower latency even when interrupts are coalesced.
In high-throughput virtual setups, operators may choose to employ dedicated hypervisor cores – denoted “sidecores” – in order to process the network I/O of virtual machines (VMs). Sidecores reduce virtualization overheads by eliminating architectural exits (and instead polling on virtual queues),
and by reducing the number of context switches between virtual CPUs and their corresponding virtual I/O processing threads. The problem is that the decision of whether or not to use sidecores, as well as their number, is determined statically in existing systems, thereby significantly limiting the applicability of this optimization. We solve this problem by introducing Hermes, which adapts to changing workloads and matches the performance of optimally-tuned static configurations at any point in time. Hermes improves throughput by as much as 12x, reduces CPU consumption by up to 20%, and shortens tail latency by at most 63%.
Together, IOctopus, cinterrupts, and Hermes represent an advancement in I/O optimization, enabling both cloud providers and bare metal operators to better utilize modern high-speed I/O devices, including advanced storage systems, and improve overall system performance.