Please wait a minute...
IMAGE/TABLE DETAILS
Challenges and opportunities in text mining-based protein function annotation
ZHANG Chengxin
Synthetic Biology Journal    2025, 6 (3): 603-616.   DOI: 10.12211/2096-8280.2025-002
Abstract   (265 HTML18 PDF(pc) (1924KB)(152)  

Understanding the biological function of proteins is crucial for advancing quantitative synthetic biology. Except for a small number of model organisms, most species contain many proteins whose functions have not been experimentally verified, necessitating the development of accurate, automated protein function annotation methods. Recent progress in protein bioinformatics, particularly in predicting protein structures and functions, has been driven significantly by the application of artificial intelligence (AI) algorithms, with a notable emphasis on deep learning models. For instance, the top-ranked methods in recent Critical Assessment of Function Annotation (CAFA) challenge have used deep learning models, primarily large language models, to perform text mining-based protein function annotation. These methods either predict Gene Ontology (GO) terms directly from text features extracted from scientific literatures or from template proteins with databases. Despite the extensive work in developing increasingly powerful deep learning models for text mining-based protein function annotation, several major challenges have been overlooked when parsing scientific literature data. This manuscript reviews existing methods and challenges in protein function annotation. First, many text mining-based protein function predictors rely exclusively on PubMed abstracts collected by UniProt curators for the query protein, ignoring literatures that have not been reviewed by biocurators. Consequently, protein functions predicted by text mining might overlap with those from manual curation of the UniProt Gene Ontology Annotation. Second, nearly all methods only parse PubMed abstracts, ignoring the more informative full-text documents often available in the PubMed Central and Europe PMC repositories. Third, few studies have been proposed to automatically differentiate between different categories of literatures, such as low and high throughput experiments, and computational predictions. This manuscript also proposes promising approaches to enhance text mining-based protein function annotation using the latest development in AI, which is expected to contribute to the development of next-generation text mining tools for more accurate function annotation.


证据编码详细解释
Inferred from Experiment (EXP)实验验证的生物功能
Inferred from Direct Assay(IDA)生物化学或细胞生物学实验验证的生物功能
Inferred from Physical Interaction(IPI)实验验证的蛋白-蛋白、蛋白-核酸或蛋白-小分子配体相互作用
Inferred from Mutant Phenotype(IMP)根据同一个基因的两个等位基因的功能差异推测的生物功能
Inferred from Genetic Interaction(IGI)涉及两个或以上的基因的序列改变或者表达量改变的实验验证的生物功能
Inferred from Expression Pattern(IEP)根据基因表达的位置或者基因表达时间推测的生物过程
Inferred from High Throughput Experiment(HTP)高通量实验验证的生物功能
Inferred from High Throughput Direct Assay(HDA)高通量生物化学实验或高通量细胞生物学实验验证的生物功能
Inferred from High Throughput Mutant Phenotype(HMP)根据高通量实验中一个基因的两个等位基因的功能差异推测的生物功能
Inferred from Hight Throughput Genetic Interaction(HGI)涉及两个或以上的基因的序列改变或者表达量改变的高通量实验验证的生物功能
Inferred from High Throughput Expression Pattern(HEP)根据高通量实验中基因表达的位置或者基因表达时间推测的生物过程
Inferred from Sequence or Structural Similarity (ISS)根据序列分析或者结构相似性预测并经过人工审核的生物功能
Inferred from Sequence Orthology(ISO)根据直系同源关系预测并经过人工审核的生物功能
Inferred from Sequence Alignment(ISA)根据序列比对预测的生物功能;功能预测与序列比对本身都经过人工审核
Inferred from Sequence Model(ISM)基于隐马尔科夫模型(如Pfam)等蛋白家族的统计模型预测并经过人工审核的生物功能
Inferred from Genomic Context(IGC)根据目标基因在基因组上邻近的其他基因元件预测并经过人工审核的生物功能
Inferred from Reviewed Computational Analysis(RCA)根据大规模实验数据(如酵母双杂交、质谱、基因芯片)预测或者结合多种类型的数据预测并经过人工审核的生物功能
Inferred from Biological Aspect of Ancestor(IBA)根据系统发生树中的先祖基因的功能推测的后代基因的生物功能
Inferred from Biological Aspect of Descendant(IBD)根据系统发生树中的后代基因的功能推测的先祖基因的生物功能
Inferred from Key Residues(IKR)根据关键氨基酸残基缺失推测的生物功能缺失
Inferred from Rapid Divergence(IRD)根据后代基因与先祖基因在进化上的快速分歧推断的生物功能缺失
Traceable Author Statement(TAS)根据综述文献或者实验文献的介绍或讨论章节中的引用文献总结的生物功能
Non-traceable Author Statement(NAS)根据文献中没有明确实验依据或引用支持的文字描述总结的生物功能
Inferred by Curator(IC)根据蛋白的已有功能注释推测的相关生物功能;例如,根据一个真核蛋白的已知功能“RNA聚合酶Ⅱ活性”推测该蛋白应具有功能注释“细胞核”
Inferred from Electronic Annotation(IEA)无人工审核的计算预测得到的生物功能
Table 1 Evidence codes used for Gene Ontology annotation
Other Images/Table from this Article