Setting Zigzag Straight - An erasure coding scheme and its evaluation in the cloud

Speaker:
Matan Liram, M.Sc. Thesis Seminar
Date:
Thursday, 30.3.2017, 11:00
Place:
Taub 601
Advisor:
Prof. E. Yaakobi, G. Yadgar, Prof. A. Schuster

Erasure codes protect data in large scale data centers against multiple concurrent failures. However, in the frequent case of a single node failure, the amount of data that must be read for recovery can be an order of magnitude larger than the amount of data lost. Some existing codes successfully reduce these recovery costs but increase the storage overhead considerably. Others, which are theoretically optimal, minimize the amount of data required for recovery, but incur irregular I/O patterns that may actually increase overall recovery time and cost. Thus, while the theoretical results in this context continue to improve, many of them are inapplicable to realistic system settings, and their benefit remains theoretical as well. This gap between theory and practice has been observed in previous studies that applied theoretically optimal techniques to real systems. In this paper, we present a novel system-level approach to bridging this gap in the context of reducing recovery costs. We optimize the sequentiality of the data read, at the cost of a minor increase in its amount. We use Zigzag, a family of erasure codes with minimal overhead and optimal recovery, and trade its theoretical optimality for real performance gains. Our implementation of Zigzag and its optimizations in Ceph reduces recovery costs with two, three and four parity nodes, for large and small objects alike. We were able to cut down recovery time by up to 20% compared to that of Reed-Solomon, and to reduce the amount of data read and transferred by 18% to 37%.

Back to the index of events