Boaz Moav (M.Sc. Thesis Seminar)
Wednesday, 31.05.2023, 16:30
Advisor: Prof. E. Yaakobi
The increasing demand for data storage has prompted the exploration of new techniques, with molecular data storage being a promising alternative. The stored information can be represented as a collection of two-dimensional arrays, such that each row represents a DNA strand. In this work, we present the results of our research into error-correcting codes for molecular data storage using this representation. Although both insertions and deletions have been observed to occur, the focus of our work is deletions, that can be caused by a failure of the bit addition chemistry. On top of this, cells can be lost either partially, which occurs when a defective bit prematurely terminates the chain, or the data within a cell can be corrupted completely. Initial experiments have reported error rates as high as 10%. Those errors can be considered as erasures in the last few symbols. Our study focuses on correcting those erasures and also deletions across rows. We present code constructions and explicit encoders that are shown to be nearly optimal in many scenarios, using bounds we derived. The first construction is based on permuting the columns of a chosen parity check matrix, such that the set of linearly independent columns match the pattern of possible erasures, that is at most a known parameter $d$ of tail-erasures. This construction is optimal for $d=2,3,4$ and nearly optimal for any $d=2t+1$. To design the above code properly, our work introduces a new distance metric, with properties similar to the Hamming distance, but which is better suited for the tail-erasure model. The second construction uses Tensor Product codes (TPC), that underlies Varshamov-Tenengolts (VT) codes to correct $t$ rows that suffer from one deletion each, where $t$ is a parameter. In the last construction we combine the two problems, to construct a code that can correct both tail-erasures and one deletion in $t$ rows. To conclude, as a concrete example, we mention Iridia, a San Diego-based startup, that has developed a new storage method using a two-dimensional arrays, for which our work suggests a suitable solutions. Our findings show that the new coding schemes are capable of effectively mitigating these errors, making Iridia's system a promising solution for DNA data storage.