Synthetic Biology Journal

   

Challenges and opportunities in text mining-based protein function annotation

Chengxin ZHANG1,2   

  1. 1.CAS Key Laboratory of Quantitative Engineering Biology,Shenzhen Institute of Synthetic Biology,Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences,Shenzhen 518055,Guangdong,China
    2.Gilbert S Omenn Department of Computational Medicine and Bioinformatics,University of Michigan,Ann Arbor 48109,Michigan,USA
  • Received:2025-01-02 Revised:2025-03-04 Published:2025-03-06
  • Contact: Chengxin ZHANG

基于文本数据挖掘的蛋白功能预测的机遇与挑战

张成辛1,2   

  1. 1.中国科学院深圳先进技术研究院,合成生物学研究所,广东省 深圳市 518055
    2.密歇根大学,计算医学与生物信息学院,美国,密歇根州 安娜堡 48109
  • 通讯作者: 张成辛
  • 作者简介:张成辛(1991—),男,博士,研究员,研究方向为蛋白质与RNA的结构与功能预测。E-mail:cx.zhang2@siat.ac.cn

Abstract:

Understanding the biological functions of proteins is crucial for advancing quantitative synthetic biology. Except for a small number of model organisms, most species contain many proteins whose functions have not been experimentally verified, necessitating the development of accurate, automated protein function annotation methods. Recent progress in the field of protein bioinformatics, particularly in predicting protein structures and functions, has been significantly driven by the application of artificial intelligence (AI) algorithms, with a notable emphasis on deep learning models. For instance, the top-ranked methods in recent Critical Assessment of Function Annotation (CAFA) challenge have used deep learning models, primarily large language models, to perform text mining-based protein function annotation. These methods either predict Gene Ontology (GO) terms directly from text features extracted from scientific literature or from template proteins with similar literature. Despite the extensive work in developing increasingly powerful deep learning models for text mining-based protein function annotation, several major challenges have been overlooked when parsing scientific literature data. This manuscript reviews existing methods and challenges in protein function annotation. First, many text mining-based protein function predictors rely exclusively on PubMed abstracts collected by UniProt curators for the query protein, ignoring literature that has not been reviewed by biocurators. Consequently, protein functions predicted by text mining overlap with those from manual curation of the UniProt Gene Ontology Annotation. Second, nearly all methods only parse PubMed abstracts, ignoring the more informative full-text documents often available in the PubMed Central and Europe PMC repositories. Third, few studies have been proposed to automatically differentiate between different categories of literature, such as low throughput experiments, high throughput studies, and computational predictions. This manuscript also proposes promising approaches to enhance text mining-based protein function annotations using the latest developments in artificial intelligence. This work contributes to the development of next-generation text mining tools for more accurate function annotations.

Key words: proteins, biological functions, text mining, Gene Ontology (GO) terms, deep learning

摘要:

理解蛋白质的生物学功能是定量合成生物学成功的前提。然而,除了少数模式生物外,大多数生物中有许多蛋白质的功能尚未通过实验进行解析。因此,开发自动、准确的蛋白质功能预测算法尤为重要。近年来,以深度学习为代表的人工智能算法成为蛋白质生物信息学发展的主流。在蛋白质功能预测领域,深度学习尤为显著。例如,在最近几届国际蛋白质功能预测大赛(Critical Assessment of Function Annotation,CAFA)中,排名靠前的算法使用深度学习模型(主要是大语言模型)实现基于文本数据挖掘的蛋白质功能预测。具体而言,这些方法或直接利用从科学文献中提取的文本特征来预测基因本体(Gene Ontology,GO),或通过具有相似文献的模板蛋白质来预测GO。尽管在开发更强大的深度学习模型用于基于文本挖掘的蛋白质功能注释方面已有大量研究,基于文本挖掘的蛋白质功能预测算法在处理科学文献数据时仍存在一些长期被忽视的问题。本文首先回顾了蛋白质功能注释中现有的方法和挑战。第一,大多数基于文本挖掘的蛋白质功能预测器仅使用由UniProt数据库管理员为目标蛋白手工收集的PubMed摘要,忽略了尚未被UniProt收录的文献。第二,几乎所有方法都只处理摘要,而忽略了PubMed Central和Europe PMC等数据库中可获得的更详尽的全文文献。第三,鲜有研究工作能自动区分低通量实验、高通量研究和计算预测等不同类别的科研文献,这大大增加了基于文本进行功能注释的难度。此外,本文还提出了利用人工智能最新发展的有前景的方法,以改进基于文本挖掘的蛋白质功能注释。这有助于开发下一代文本挖掘工具,针对性攻克文本数据处理的现有困难,以实现更准确的功能注释。

关键词: 蛋白质, 生物学功能, 基因本体, 文本数据挖掘, 深度学习

CLC Number: