Challenges and opportunities in text mining-based protein function annotation

ZHANG Chengxin

Synthetic Biology Journal 2025, 6 (3): 603-616. DOI: 10.12211/2096-8280.2025-002

Abstract （640）

HTML （38）

PDF（pc）（1924KB）（292）

Understanding the biological function of proteins is crucial for advancing quantitative synthetic biology. Except for a small number of model organisms, most species contain many proteins whose functions have not been experimentally verified, necessitating the development of accurate, automated protein function annotation methods. Recent progress in protein bioinformatics, particularly in predicting protein structures and functions, has been driven significantly by the application of artificial intelligence (AI) algorithms, with a notable emphasis on deep learning models. For instance, the top-ranked methods in recent Critical Assessment of Function Annotation (CAFA) challenge have used deep learning models, primarily large language models, to perform text mining-based protein function annotation. These methods either predict Gene Ontology (GO) terms directly from text features extracted from scientific literatures or from template proteins with databases. Despite the extensive work in developing increasingly powerful deep learning models for text mining-based protein function annotation, several major challenges have been overlooked when parsing scientific literature data. This manuscript reviews existing methods and challenges in protein function annotation. First, many text mining-based protein function predictors rely exclusively on PubMed abstracts collected by UniProt curators for the query protein, ignoring literatures that have not been reviewed by biocurators. Consequently, protein functions predicted by text mining might overlap with those from manual curation of the UniProt Gene Ontology Annotation. Second, nearly all methods only parse PubMed abstracts, ignoring the more informative full-text documents often available in the PubMed Central and Europe PMC repositories. Third, few studies have been proposed to automatically differentiate between different categories of literatures, such as low and high throughput experiments, and computational predictions. This manuscript also proposes promising approaches to enhance text mining-based protein function annotation using the latest development in AI, which is expected to contribute to the development of next-generation text mining tools for more accurate function annotation.

方法

功能预测的信息来源（特征）

机器学习模型

GOtcha、Blast2GO、BAR+

BLASTp搜索得到的同源序列

无

ConFunc、PFP、GoFDR

PSI-BLAST搜索得到的同源序列

无

HFSP

MMseqs2搜索得到的同源序列

无

ProFunc

BLASTp搜索得到的同源序列、SSM与Jess结构搜索得到的相似结构

无

COFACTOR

BLASTp与PSI-BLAST搜索得到的同源序列、TM-align结构搜索得到的相似结构、蛋白-蛋白互作

无

MetaGO

BLASTp与PSI-BLAST搜索得到的同源序列、TM-align结构搜索得到的相似结构、蛋白-蛋白互作

逻辑回归

StarFunc

BLASTp搜索得到的同源序列、Foldseek与TM-align结构搜索得到的相似结构、Pfam蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列（ESM蛋白语言模型提取的特征）

逻辑回归、全连接神经网络、随机森林

DeepFRI、Struct2Go

三维结构提取的残基接触图、目标蛋白序列（独热编码）

图卷积神经网络

TALE-cmap

三维结构提取的残基接触图、多序列比对（ESM-MSA蛋白语言模型提取的特征）

Transformer

CLEAN-Contact

三维结构提取的残基接触图、目标蛋白序列（ESM蛋白语言模型提取的特征）

卷积神经网络

MS-kNN

同源序列、基因表达谱、蛋白-蛋白互作

k-最近邻

INGA

BLASTp搜索得到的同源序列、蛋白-蛋白互作、Pfam蛋白结构域家族

无

GOLabeler

BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、目标蛋白序列（连续三个氨基酸残基序列片段的频率、ProFET程序提取的序列特征）

逻辑回归、梯度增强树

NetGO

BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列（连续三个氨基酸残基序列片段的频率、ProFET程序提取的序列特征）

逻辑回归、梯度增强树

NetGO2.0

BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列（连续三个氨基酸残基序列片段的频率、独热编码）、PubMed摘要

逻辑回归、双向长短期记忆神经网络、梯度增强树

DeepGO、DeepGOplus、ProteInfer、DeepEC、ECPICK

目标蛋白序列（独热编码）

卷积神经网络

ATGO+

BLASTp搜索得到的同源序列、目标蛋白序列（ESM蛋白语言模型提取的特征）

全连接神经网络

InterLabelGO+

DIAMOND搜索得到的同源序列、目标蛋白序列（ESM蛋白语言模型提取的特征）

全连接神经网络

DeepGO-SE

目标蛋白序列（ESM蛋白语言模型提取的特征）、蛋白-蛋白互作

全连接神经网络、图注意力网络

DeepECtransformer

DIAMOND搜索得到的同源序列、目标蛋白序列（ESM蛋白语言模型提取的特征）

注意力网络

CLEAN

目标蛋白序列（ESM蛋白语言模型提取的特征）

全连接神经网络