Please wait a minute...
IMAGE/TABLE DETAILS
Challenges and opportunities in text mining-based protein function annotation
ZHANG Chengxin
Synthetic Biology Journal    2025, 6 (3): 603-616.   DOI: 10.12211/2096-8280.2025-002
Abstract   (265 HTML18 PDF(pc) (1924KB)(153)  

Understanding the biological function of proteins is crucial for advancing quantitative synthetic biology. Except for a small number of model organisms, most species contain many proteins whose functions have not been experimentally verified, necessitating the development of accurate, automated protein function annotation methods. Recent progress in protein bioinformatics, particularly in predicting protein structures and functions, has been driven significantly by the application of artificial intelligence (AI) algorithms, with a notable emphasis on deep learning models. For instance, the top-ranked methods in recent Critical Assessment of Function Annotation (CAFA) challenge have used deep learning models, primarily large language models, to perform text mining-based protein function annotation. These methods either predict Gene Ontology (GO) terms directly from text features extracted from scientific literatures or from template proteins with databases. Despite the extensive work in developing increasingly powerful deep learning models for text mining-based protein function annotation, several major challenges have been overlooked when parsing scientific literature data. This manuscript reviews existing methods and challenges in protein function annotation. First, many text mining-based protein function predictors rely exclusively on PubMed abstracts collected by UniProt curators for the query protein, ignoring literatures that have not been reviewed by biocurators. Consequently, protein functions predicted by text mining might overlap with those from manual curation of the UniProt Gene Ontology Annotation. Second, nearly all methods only parse PubMed abstracts, ignoring the more informative full-text documents often available in the PubMed Central and Europe PMC repositories. Third, few studies have been proposed to automatically differentiate between different categories of literatures, such as low and high throughput experiments, and computational predictions. This manuscript also proposes promising approaches to enhance text mining-based protein function annotation using the latest development in AI, which is expected to contribute to the development of next-generation text mining tools for more accurate function annotation.


方法功能预测的信息来源(特征)机器学习模型
GOtcha、Blast2GO、BAR+BLASTp搜索得到的同源序列
ConFunc、PFP、GoFDRPSI-BLAST搜索得到的同源序列
HFSPMMseqs2搜索得到的同源序列
ProFuncBLASTp搜索得到的同源序列、SSM与Jess结构搜索得到的相似结构
COFACTORBLASTp与PSI-BLAST搜索得到的同源序列、TM-align结构搜索得到的相似结构、蛋白-蛋白互作
MetaGOBLASTp与PSI-BLAST搜索得到的同源序列、TM-align结构搜索得到的相似结构、蛋白-蛋白互作逻辑回归
StarFuncBLASTp搜索得到的同源序列、Foldseek与TM-align结构搜索得到的相似结构、Pfam蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列(ESM蛋白语言模型提取的特征)逻辑回归、全连接神经网络、随机森林
DeepFRI、Struct2Go三维结构提取的残基接触图、目标蛋白序列(独热编码)图卷积神经网络
TALE-cmap三维结构提取的残基接触图、多序列比对(ESM-MSA蛋白语言模型提取的特征)Transformer
CLEAN-Contact三维结构提取的残基接触图、目标蛋白序列(ESM蛋白语言模型提取的特征)卷积神经网络
MS-kNN同源序列、基因表达谱、蛋白-蛋白互作k-最近邻
INGABLASTp搜索得到的同源序列、蛋白-蛋白互作、Pfam蛋白结构域家族
GOLabelerBLASTp搜索得到的同源序列、InterPro蛋白结构域家族、目标蛋白序列(连续三个氨基酸残基序列片段的频率、ProFET程序提取的序列特征)逻辑回归、梯度增强树
NetGOBLASTp搜索得到的同源序列、InterPro蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列(连续三个氨基酸残基序列片段的频率、ProFET程序提取的序列特征)逻辑回归、梯度增强树
NetGO2.0BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列(连续三个氨基酸残基序列片段的频率、独热编码)、PubMed摘要逻辑回归、双向长短期记忆神经网络、梯度增强树
DeepGO、DeepGOplus、ProteInfer、DeepEC、ECPICK目标蛋白序列(独热编码)卷积神经网络
ATGO+BLASTp搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征)全连接神经网络
InterLabelGO+DIAMOND搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征)全连接神经网络
DeepGO-SE目标蛋白序列(ESM蛋白语言模型提取的特征)、蛋白-蛋白互作全连接神经网络、图注意力网络
DeepECtransformerDIAMOND搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征)注意力网络
CLEAN目标蛋白序列(ESM蛋白语言模型提取的特征)全连接神经网络
Table 2 Existing methods for protein function prediction
Other Images/Table from this Article