דלג לתוכן (מקש קיצור 's')
אירועים

אירועים והרצאות בפקולטה למדעי המחשב ע"ש הנרי ומרילין טאוב

event speaker icon
שגיא שחר (הנדסת חשמל, טכניון)
event date icon
יום שלישי, 14.06.2016, 14:30
event location icon
חדר 1061, בניין מאייר, הפקולטה להנדסת חשמל
Modern discrete GPUs have been the processors of choice primarily for compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions that have long been established in CPU context, like memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory hardware does not support page faults and lacks the ability to modify memory mappings for a running GPU kernel.

We implement ActivePointers , a software address translation layer and a paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables implementation of fully functional memory mapped files on commodity GPUs. To access a file mapped into GPU memory developers use active pointers, which behave like regular pointers, but under the hood, access the GPU page cache and trigger page faults handled on the GPU. To make the implementation efficient we design and evaluate a number of novel mechanisms, such as a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp.

We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the whole file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling the speedups of up to 3X over a combined CPU+GPU implementation and 3.5X over 12-core CPU-only run.