合成生物学 ›› 2021, Vol. 2 ›› Issue (3): 412-427.DOI: 10.12211/2096-8280.2020-083

• 研究论文 • 上一篇    下一篇

Chamaeleo: DNA存储碱基编解码算法的可拓展集成与系统评估平台

平质1,2,3, 张颢龄1,2,3, 陈世宏4,5, 倪鸣1,3, 徐讯1,2,3, 朱砂6,7, 沈玥1,2,3,5   

  1. 1.深圳华大生命科学研究院,广东 深圳 518083
    2.深圳市合成生物学创新研究院,中国科学院深圳先进技术研究院,广东 深圳 518055
    3.广东省高通量基因组测序与合成编辑应用重点实验室,深圳华大生命科学研究院,广东 深圳 518120
    4.广东省 )华大基因合成基因组学院士工作站,深圳华大基因科技有限公司,广东 深圳 518120
    5.深圳国家基因库,广东 深圳 518120
    6.英国牛津大学大数据研究所,牛津 OX3 7LF
    7.英国TAICHI AI Ltd. ,伦敦 N1 7GU
  • 收稿日期:2020-12-01 修回日期:2020-12-31 出版日期:2021-06-30 发布日期:2021-07-13
  • 通讯作者: 朱砂,沈玥
  • 作者简介:平质(1987—),男,博士,助理研究员。研究方向为合成生物学、DNA存储、生物信息分析算法。 E-mail:pingzhi@genomics.cn
    张颢龄(1996—),男,助理研究员。研究方向为合成生物学、计算机科学、神经网络。 E-mail:zhanghaoling@genomics.com
    朱砂(1985—),男,博士,罗氏(英国)资深统计分析师。研究方向为生物遗传模型相关的概率论。长期从事计算机统计方法研究和程序开发。E-mail:sha.joe.zhu@gmail.com
    沈玥(1986—),女,博士,研究员。研究方向为合成生物学、合成基因组学、DNA合成技术与工具开发。E-mail:shenyue@genomics.cn
  • 基金资助:
    国家重点研发计划(2020YFA0712100);广东省高通量基因组测序与合成编辑应用重点实验室项目(2017B030301011);广东省华大基因合成基因组学院士工作站项目(2017B090904014)

Chamaeleo: an integrated evaluation platform for DNA storage

Zhi PING1,2,3, Haoling ZHANG1,2,3, Shihong CHEN4,5, Ming NI1,3, Xun XU1,2,3, Sha ZHU6,7, Yue SHEN1,2,3,5   

  1. 1.BGI-Shenzhen,Shenzhen 518083,Guangdong,China
    2.Shenzhen Institute of Synthetic Biology,Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences,Shenzhen 518055,Guangdong,China
    3.Guangdong Provincial Key Laboratory of Genome Read and Write,BGI-Shenzhen,Shenzhen 518120,Guangdong,China
    4.Guangdong Provincial Academician Workstation of BGI Synthetic Genomics,BGI-Shenzhen,Shenzhen 518120,Guangdong,China
    5.China National GeneBank,BGI-Shenzhen,Shenzhen 518120,Guangdong,China
    6.Big Data Institute,University of Oxford,Oxford,OX3 7LF,United Kingdom
    7.TAICHI AI Ltd. ,London,N1 7GU,United Kingdom
  • Received:2020-12-01 Revised:2020-12-31 Online:2021-06-30 Published:2021-07-13
  • Contact: Sha ZHU, Yue SHEN

摘要:

近年来DNA存储因其数据存储密度与保存时间方面的优势而备受关注,有望在如光盘、硬盘等传统存储介质之外作为一种新型信息存储方式,满足海量数据存储及特殊应用领域数据加密存储的迫切需求。DNA存储流程中,二进制信息到DNA碱基序列的相互转换(即编解码)方法是实现数字信息技术与生物技术衔接的最核心步骤。尽管DNA存储编解码研究已有丰富进展,但与现有上下游衔接技术的兼容性,对不同存储文件的适配性、存储稳健性和数据安全性等尚缺少一个可量化比较与评估的系统。因此,本研究开发了一个DNA存储编解码方法的可扩展集成与评估平台Chamaeleo,以模块化集成方式对已开发的编解码方法进行系统性量化分析与评估,可针对不同类型文件进行编解码方法的择优方案输出。Chamaeleo以开源方式运行,以便于未来新编解码方法和评价指标的持续加载,促进该领域开放交流,推动规范化有序发展。

关键词: DNA存储, 二进制-碱基编解码方法, 评估体系, 兼容性, 存储稳健性

Abstract:

The emerging field of DNA based data storage has attracted considerable interests for the enormous potentials of DNA in high density and durability as a medium. Compare to traditional storage material such as magnetic, optical and electronic storage media, the use of DNA as storage media has been considered as a promising novel solution to meet the global demand for storing the skyrocketing amount of data worldwide. In addition, DNA storage adds an extra layer of protection for the stored information because the coding and decoding process of DNA based data storage relies on the combined implementation of DNA synthesis and sequencing technologies, which are not as commonly used as technologies in information communication area. Transcoding between binary digital data and quaternary DNA molecules is the most important step in the whole process of DNA-based data storage. Several coding methods have been developed using different programming languages in the past decades, however, it is difficult to compare the overall performance of these methods due to different software architectures and varying parameters. Thus, it brings challenges for researchers to further develop or for users to compare and choose the suitable methods as needed. In this study, we introduce an integrated evaluation platform "Chamaeleo" to address the issues as stated above. One of the key features of Chamaeleo is the integration of existing coding schemes and modulization of functions including data handling, transcoding, index operating and error-correcting as a user-friendly design. The other key feature is the function of evaluating a coding scheme in a qualitative and quantitative manner. A set of widely recognized and accepted indexes are chosen to evaluate the compatibility with DNA writing and reading technologies, the robustness regarding tolerance of introduced errors or data loss and the complexity of transcoding rules. Considering the rapid advancement in this field, Chamaeleo is designed as an open-source style for researchers to incorporate new coding schemes and evaluation indexes into the platform, thus encouraging the community to contribute together in the shaping of future DNA based data storage.

Key words: DNA digital storage, binary-nucleotide transcoding scheme, evaluation system, compatibility, storage robustness

中图分类号: