合成生物学 ›› 2021, Vol. 2 ›› Issue (3): 428-443.DOI: 10.12211/2096-8280.2021-023

• 研究论文 • 上一篇    

细胞内大片段DNA数据存储的多RS码交织编码

陈为刚1,2, 葛奇1, 王盼盼1, 韩明哲2,3, 郭健1   

  1. 1.天津大学微电子学院,天津  300072
    2.教育部合成生物学前沿科学中心,天津大学,天津  300072
    3.天津大学化工学院,天津  300072
  • 收稿日期:2021-02-09 修回日期:2021-03-28 出版日期:2021-06-30 发布日期:2021-07-14
  • 通讯作者: 陈为刚
  • 作者简介:陈为刚(1980—),男,博士,副教授。研究方向为DNA数据存储、信息论与编码理论。 E-mail:chenwg@tju.edu.cn

Multiple interleaved RS codes for data storage using up to Mb-scale synthetic DNA in living cells

Weigang CHEN1,2, Qi GE1, Panpan WANG1, Mingzhe HAN2,3, Jian GUO1   

  1. 1.School of Microelectronics,Tianjin University,Tianjin 300072,China
    2.Frontiers Science Center for Synthetic Biology (MOE),Tianjin University,Tianjin 300072,China
    3.School of Chemical Engineering and Technology,Tianjin University,Tianjin 300072,China
  • Received:2021-02-09 Revised:2021-03-28 Online:2021-06-30 Published:2021-07-14
  • Contact: Weigang CHEN

摘要:

合成DNA作为潜在的数字信息存储介质,存储密度高,可用时间久,有望成为未来数据存储的重要选项。然而,DNA的合成与测序读出往往造成碱基的多种错误,无法满足数据存储的可靠性要求,而保证可靠性的编码方案往往效率较低。针对该问题,提出了一种面向酿酒酵母内大片段DNA数据存储的高效率编码方法。数据编码通过多个极高码率的里德-所罗门(RS)码的码字交织构建数据DNA单元,将其与酵母的自主复制序列(ARS)交替镶嵌,构成酵母人工染色体序列;数据读出时,利用二代高通量测序,组合了读段从头(de novo)组装、ARS导引例,用20×二代测序数据可无错恢复原始数据。该编码方法不仅能实现数据可靠存储,实现的DNA数据部分逻辑密度为1.973 bit/bp,即使考虑生物单元开销,总体逻辑密度仍达到1.947 bit/bp。该设计流程可支持Kb到Mb不同长度的DNA的编码,为大片段DNA数据存储的“湿”实验提供灵活的实验前验证与评估。

关键词: DNA数据存储, 里德-所罗门(RS)码, 交织, 自主复制序列, 重叠群

Abstract:

The synthetic DNA, as a potential digital data storage medium, has a high storage density and can be used for a very long period. It is expected to serve as an important option for future massive data storage. However, the synthesis, assembly and sequencing of DNA often introduce multiple types of base errors, which does not satisfy the reliability requirements of data storage, while reliability-enhanced coding schemes usually sacrifice the logical coding density by adding redundancy. To deal with this problem, an encoding process for DNA data storage using large synthetic DNA fragments in Saccharomyces Cerevisiae was proposed. Data writing into DNA chunks was constructed by interleaving multiple codewords of Reed Solomon (RS) codes with a very high code rate, embedded with autonomous replication sequences (ARSs) in alternation to form a yeast artificial chromosome. Utilizing the high-throughput sequencing, data readout combines short read assembly with the de Bruijn graphs, ARS guided contig combination and erasure/error correction to achieve reliable data recovery. The error correction capability has been fully exploited by interleaving the large missing fractions into random erasures across all the RS codewords and correcting more erasures than errors. We designed and simulated a 2.5 Mb ring chromosome and successfully recovered the original data from 20× high-throughput sequencing reads. The simulated sequencing data are generated using the ART simulation software, which has been trained using the real sequencing data from an artificial chromosome of 254 886 bp constructed for data storage previously. All the processes including the large DNA chunk assembly, DNA replication, extraction and high-throughput sequencing are viewed as the DNA storage channel in information theory community. We provided an efficient encoding scheme matching the codes and the DNA storage channel based on the information theory paradigm. The logical density of the data DNA chunks was 1.973 bit/bp, and the overall logical density still reached up to 1.947 bit/bp including the biological units (ARSs and vector backbones). The demonstrated design process can support DNA coding schemes with the different lengths from Kb up to Mb, which provides flexible verification and support for wet experiments in the synthesis and sequencing of large fragments of DNA for digital data storage.

Key words: DNA data storage, reed-solomon codes, interleaving, autonomously replicating sequence, contig

中图分类号: