Synthetic Biology Journal ›› 2022, Vol. 3 ›› Issue (6): 1262-1276.DOI: 10.12211/2096-8280.2022-016

• Invited Review • Previous Articles     Next Articles

Identification of RiPPs precursor peptides and cleavage sites based on deep learning

Jingwei LYU1, Zixin DENG1, Qi ZHANG2, Wei DING1   

  1. 1.State Key Laboratory of Microbial Metabolism,School of Life Sciences and Biotechnology,Shanghai Jiao Tong University,Shanghai,200240,China
    2.Department of Chemistry,Fudan University,shanghai 200243,China
  • Received:2022-03-07 Revised:2022-04-23 Online:2023-01-17 Published:2022-12-31
  • Contact: Wei DING

基于深度学习识别RiPPs前体肽及裂解位点

吕靖伟1, 邓子新1, 张琪2, 丁伟1   

  1. 1.上海交通大学生命科学技术学院,微生物代谢国家重点实验室,上海 200030
    2.复旦大学化学系,上海 200243
  • 通讯作者: 丁伟
  • 作者简介:吕靖伟(1996—),硕士研究生。研究方向为天然产物合成基因挖掘。E-mail:jingwei_lv@sjtu.edu.cn
    丁伟(1981—),男,博士,副教授,博士生导师。研究方向为微生物代谢及合成生物学。E-mail:weiding@sjtu.edu.cn
  • 基金资助:
    国家重点研发计划(2018Y F A0900402)

Abstract:

Genome sequencing data showed explosive growth attributed to the rapid development of DNA sequencing technology. Ribosomally synthesized and post-translationally modified peptides are a kind of natural peptide product that gradually came into people's view in the last decade. These compounds are widely distributed in nature, diverse in structure and bioactivity, and are important sources of natural drugs. The discovery of RiPPs mainly relies on low-throughput biological experiments, which are accurate but costly. With the development of new information technologies, bioinformatics tools such as antiSMASH and RIPP-Prism can greatly accelerate the process of RiPPs mining. However, methods based on gene homology, such as searching for conserved biosynthetic enzymes, are still unable to effectively identify novel RiPPs with different biosynthetic mechanisms. Here, for the first time, based on the natural language processing pre-training model BERT, four deep learning models that can fully rely on sequence data to identify RiPPs instead of homology and genomic context information are proposed and trained on the same RiPPs dataset. Through verification and comparison of these models, the best model BERiPPs performs well on the RiPPs identification track and is as accurate as the homology-based method. BERiPPs can identify RiPPs precursor peptides and RiPPs classes in an unbiased manner regardless of the genomic background, extending the range of novel RiPPs captured by approximately 60% compared to homology-based approaches. By combining BERiPPs with a conditional random field, the prediction of the cleavage site of the leader peptide can be indirectly generated with high accuracy by the recognition of each amino acid label in the sequence. The deep learning based on the pre-training model provides the possibility for high-throughput mining of novel RiPPs in a manner different from that of the gene context-dependent methods and reveals the underlying biological relationship between precursor peptides and modified enzymes.

Key words: deep learning, RiPPs, precursor peptides, pre-training model, natural products mining

摘要:

得益于基因测序技术的快速发展,基因组测序数据呈现爆炸式增长,核糖体合成和翻译后修饰肽(RiPPs)是近十年逐渐进入人们视野的一大类肽类天然产物。这类化合物在自然界中分布极其广泛,具有丰富的结构多样性和生物活性多样性,是天然药物的重要来源。RiPPs的发现主要依赖低通量生物实验,传统方法精确但成本高昂,随着新型计算机技术的更新迭代,包括antiSMASH、RiPP-PRISM等在内的生物信息学工具能够极大加速RiPPs挖掘进程,但依然无法突破基于同源性方法(例如搜索保守的生物合成酶)的限制——无法有效识别具有不同生物合成机制的新型RiPPs。在这里,本文首次基于自然语言处理预训练模型BERT,提出四种可以完全依赖序列数据识别RiPPs而非基于同源性及基因组上下文信息的深度学习模型,通过对各模型进行验证分析和对比,最终确定在RiPPs识别赛道上表现卓越的最佳模型BERiPPs(bidirectional language model for enhancing the performance of identification of RiPPs precursor peptides)。BERiPPs能够在不考虑基因组背景的情况下以无偏见的方式识别RiPPs前体肽,并可通过条件随机场生成对前导肽裂解位点的预测,为高通量挖掘全新RiPPs提供了思路,并在一定程度下揭示了前体肽和修饰酶间的生物学底层关系。

关键词: 深度学习, RiPPs, 前体肽, 预训练模型, 天然产物挖掘

CLC Number: