Synthetic Biology Journal ›› 2022, Vol. 3 ›› Issue (3): 429-444.DOI: 10.12211/2096-8280.2021-032

• Invited Review • Previous Articles     Next Articles

Artificial intelligence-assisted protein engineering

Jiahao BIAN, Guangyu YANG   

  1. State Key Laboratory of Microbial Metabolism,School of Life Sciences and Biotechnology,Shanghai Jiao Tong University,Shanghai 200240,China
  • Received:2021-03-16 Revised:2021-05-24 Online:2022-07-13 Published:2022-06-30
  • Contact: Guangyu YANG

人工智能辅助的蛋白质工程

卞佳豪, 杨广宇   

  1. 上海交通大学 生命科学技术学院,微生物代谢国家重点实验室,上海 200240
  • 通讯作者: 杨广宇
  • 作者简介:卞佳豪(1997—),男,硕士研究生。研究方向为人工智能辅助定向进化的组合方法。 E-mail:bjh2170@sjtu.edu.cn
    杨广宇(1980—),男,研究员,博士生导师。研究方向为酶定向进化、酶高通量筛选、酶技术应用、体外合成生物学等。E-mail:yanggy@sjtu.edu.cn
  • 基金资助:
    国家自然科学基金(32030063)

Abstract:

Protein engineering is one of the important research fields of synthetic biology. However, de novo design of protein functions based on rational design is still challenging, because of the limited understanding on biological fundamentals such as protein folding and the natural evolution mechanism of enzymes. Directed evolution is capable of optimizing protein functions effectively by mimicking the principle of natural evolution in the laboratory without relying on structure and mechanism information. However, directed evolution is highly dependent on high-throughput screening methods, which also limits its applications on proteins which lack high-throughput screening methods. In recent years, artificial intelligence has been developed very rapidly for integrating into multidisciplinary fields. In synthetic biology, artificial intelligence-assisted protein engineering has become an efficient strategy for protein engineering besides rational design and directed evolution, which has shown unique advantages in predicting the structure, function, solubility of proteins and enzymes. Artificial intelligence models can learn the internal properties and relationships from given sequence-function data sets to make predictions on properties for virtual sequences. In this article, we review the application of artificial intelligence-assisted protein engineering. With the basic and process of the strategy introduced, three key points that affect the performance of the predictive model are analyzed: data, molecular descriptors and artificial intelligence algorithms. In order to provide useful tools for researchers who want to take advantage of this strategy, we summarize the main public database, diverse toolkits and web servers of the common molecular descriptors and artificial intelligence algorithms. We also comment on the functions, applications and websites of several artificial intelligence-assisted protein engineering platforms, through which a complete prediction task including protein sequences representation, feature analysis, model construction and output can be completed easily. Finally, we analyze some challenges that need to be solved in the artificial intelligence-assisted protein engineering, such as the lack of high-quality data, deviation in data sets and lacking of the universal models. However, with the development of automated gene annotations, ultra-high-throughput screening technologies and artificial intelligence algorithms, sufficient high-quality data and appropriate algorithms will be developed, which can enhance the performance of artificial intelligence-assisted protein engineering and thus facilitate the development of synthetic biology techniques.

Key words: protein engineering, synthetic biology, artificial intelligence, predictive model, database, molecular descriptor

摘要:

蛋白质工程是合成生物学领域的重要研究方向之一。但目前人类对于蛋白质折叠、酶天然进化机制等基础生物学问题的理解仍很有限,因此基于理性设计方法进行蛋白质的功能从头设计(de novo design)仍然是一个难题。定向进化(directed evolution)通过在实验室模拟自然进化的原理,可以在不依赖结构和机制信息的基础上对蛋白质的功能进行有效优化。但是定向进化高度依赖高通量筛选方法,也限制了其对缺少高通量筛选方法的蛋白质进行改造的能力。近年来,人工智能辅助的蛋白质工程逐渐发展成为一种高效的蛋白质分子设计新策略,在蛋白质的结构预测、功能预测、溶解度预测和指导智能文库设计等多个方面显现出独特的优势,成为理性设计和定向进化之后的又一次技术发展的浪潮。本文综述了近年来人工智能辅助的蛋白质工程的应用进展,对其中的代表性工作进行了重点阐述。在简单介绍了人工智能蛋白质工程策略的原理和流程之后,对数据、分子描述符和人工智能算法等三个影响预测模型性能的关键点进行了分析,总结了该策略中的主要数据库、分子描述符和算法的主流工具包及平台,介绍了它们的功能、用途和网址。我们还对人工智能策略目前仍面临的不足进行了探讨,如高质量数据不足、实验数据存在偏差、缺少通用模型等。随着自动基因功能注释技术、超高通量筛选技术和人工智能算法的不断发展,将会给人工智能辅助的蛋白质工程提供足够的高质量数据和更准确的算法,从而不断提升人工智能辅助的蛋白质工程预测准确度,为合成生物学研究提供更大的助力。

关键词: 蛋白质工程, 合成生物学, 人工智能, 预测模型, 数据库, 分子描述符

CLC Number: