Skip to content (access key 's')
Logo of Technion
Logo of CS Department
Logo of CS4People
Events

The Taub Faculty of Computer Science Events and Talks

Keyword Search in Deduplicated Storage Systems
event speaker icon
Nadav Elias (M.Sc. Thesis Seminar)
event date icon
Monday, 12.07.2021, 12:00
event location icon
Zoom Lecture: 93832460026
event speaker icon
Advisor: Dr. G. Yadgar
Deduplication is a widely implemented technique in storagesystems to reduce overall storage costs, or effectively increaselogical capacity, by replacing redundant chunks of data with references. As deduplication becomes a core component of storage systems, there is an opportunity to rethink storage functions to leverage the properties of deduplication, that thelogical size of a storage system may be many multiples ofthe physical data size. Specifically, we focus on the common task of performing a search for a string and demonstrate how to re-implement search to be deduplication-aware for faster performance. We demonstrate that a physical scan over the storage spaceis more efficient than anaive searchover the logical storagespace (e.g. reading all of the files) since we can read databased on its physical layout and avoid reading data regions repeatedly if referenced multiple times. Deduplication works on chunks of data, and this introduces a new complexity as a query term may be split across two chunks. We address split terms by maintaining a table of suffix and prefix partial matches that may be completed during a logical phase that processes file representations (file recipes). The table of suffixand prefix matches can potentially grow to a large size, and we present optimizations that maintain portions of the table in memory and large portions of the table on SSD/HDD that are rarely accessed. We implemented a quick search algorithmthat supports multiple, arbitrary byte strings. Experiments with quick search demonstrate performance speedups relative to naive search that increase with the degree of deduplication. While the physical phase tends to dominatethe overall runtime, the logical phase shows interesting performance variations relative to the number of files as well as the properties of the query terms.