Skip to content (access key 's')
Logo of Technion
Logo of CS Department
Logo of CS4People
Events

The Taub Faculty of Computer Science Events and Talks

ceClub: The Technion Computer Engineering Club
event speaker icon
Sagi Shahar (EE, Technion)
event date icon
Tuesday, 14.06.2016, 14:30
event location icon
EE Meyer Building 1061
Modern discrete GPUs have been the processors of choice primarily for compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions that have long been established in CPU context, like memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory hardware does not support page faults and lacks the ability to modify memory mappings for a running GPU kernel.

We implement ActivePointers , a software address translation layer and a paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables implementation of fully functional memory mapped files on commodity GPUs. To access a file mapped into GPU memory developers use active pointers, which behave like regular pointers, but under the hood, access the GPU page cache and trigger page faults handled on the GPU. To make the implementation efficient we design and evaluate a number of novel mechanisms, such as a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp.

We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the whole file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling the speedups of up to 3X over a combined CPU+GPU implementation and 3.5X over 12-core CPU-only run.