Deduplication is widely utilized in many modern large scale storage systems and provide an effective solution for both secondary and primary storage. Therefore, there is a rising need for deduplication storage to support advanced features such as data indexing for information retrieval. To our knowledge, no indexing solution for deduplicated storage utilizes the deduplication and current indexing methods process duplicates.
In this work, we propose IDEA, Inverted Deduplication-Aware Index, which we use to explore the potential of utilizing deduplication in keyword-indexing. IDEA is shown to be superior to the deduplication-oblivious approach, in both index creation and index size, and index query retrieval time. IDEA is also shown to be extendible for advanced indexing features, and orthogonal to the underlying index-engine.