• 特约评述 •
张成辛1,2
收稿日期:
2025-01-02
修回日期:
2025-03-04
出版日期:
2025-03-06
通讯作者:
张成辛
作者简介:
Chengxin ZHANG1,2
Received:
2025-01-02
Revised:
2025-03-04
Online:
2025-03-06
Contact:
Chengxin ZHANG
摘要:
理解蛋白质的生物学功能是定量合成生物学成功的前提。然而,除了少数模式生物外,大多数生物中有许多蛋白质的功能尚未通过实验进行解析。因此,开发自动、准确的蛋白质功能预测算法尤为重要。近年来,以深度学习为代表的人工智能算法成为蛋白质生物信息学发展的主流。在蛋白质功能预测领域,深度学习尤为显著。例如,在最近几届国际蛋白质功能预测大赛(Critical Assessment of Function Annotation,CAFA)中,排名靠前的算法使用深度学习模型(主要是大语言模型)实现基于文本数据挖掘的蛋白质功能预测。具体而言,这些方法或直接利用从科学文献中提取的文本特征来预测基因本体(Gene Ontology,GO),或通过具有相似文献的模板蛋白质来预测GO。尽管在开发更强大的深度学习模型用于基于文本挖掘的蛋白质功能注释方面已有大量研究,基于文本挖掘的蛋白质功能预测算法在处理科学文献数据时仍存在一些长期被忽视的问题。本文首先回顾了蛋白质功能注释中现有的方法和挑战。第一,大多数基于文本挖掘的蛋白质功能预测器仅使用由UniProt数据库管理员为目标蛋白手工收集的PubMed摘要,忽略了尚未被UniProt收录的文献。第二,几乎所有方法都只处理摘要,而忽略了PubMed Central和Europe PMC等数据库中可获得的更详尽的全文文献。第三,鲜有研究工作能自动区分低通量实验、高通量研究和计算预测等不同类别的科研文献,这大大增加了基于文本进行功能注释的难度。此外,本文还提出了利用人工智能最新发展的有前景的方法,以改进基于文本挖掘的蛋白质功能注释。这有助于开发下一代文本挖掘工具,针对性攻克文本数据处理的现有困难,以实现更准确的功能注释。
中图分类号:
张成辛. 基于文本数据挖掘的蛋白功能预测的机遇与挑战[J]. 合成生物学, DOI: 10.12211/2096-8280.2025-002.
Chengxin ZHANG. Challenges and opportunities in text mining-based protein function annotation[J]. Synthetic Biology Journal, DOI: 10.12211/2096-8280.2025-002.
图1 基因本体(GO)示意图。(a) GO涵盖的三个方面的功能:分子功能、生物学过程、细胞组分。(b) 腺苷酸激酶催化的酶促反应。(c) 腺苷酸激酶活性对应的GO项及其父节点GO项构成的有向非环图。
Fig. 1 Illustration of Gene Ontology (GO).(a) Three aspects of GO: Molecular Function (MF), Biological Process (BP) and Cellular Component (CC). (b) The chemical reaction catalyzed by the adenylate kinase. (c) The directed acyclic graph (DAG) consisting of the GO term for adenylate kinase activity and its parent GO terms.
图2 UniProt与Swiss-Prot数据库收录的蛋白数目在过去14年间的增长情况。(在2015年,UniProt收录的蛋白数目有所下降,这是因为 当时UniProt引入了一个主要针对微生物的冗余蛋白去除算法,具体而言,如果将两条序列高度相似的蛋白来自相同物种的不同菌株,则仅保留其中一条序列。)
Fig. 2 Accumulation of protein entries in the UniProt and Swiss-Prot databases in the past 14 years.(The drop in the number of UniProt proteins in 2015 is caused by removal of redundant microbial proteins, i.e., if two proteins are from different strains or isolates of the same species are almost identical, only one protein is kept.)
证据编码 | 详细解释 |
---|---|
Inferred from Experiment (EXP) | 实验验证的生物功能 |
Inferred from Direct Assay(IDA) | 生物化学或细胞生物学实验验证的生物功能 |
Inferred from Physical Interaction(IPI) | 实验验证的蛋白-蛋白、蛋白-核酸或蛋白-小分子配体相互作用 |
Inferred from Mutant Phenotype(IMP) | 根据同一个基因的两个等位基因的功能差异推测的生物功能 |
Inferred from Genetic Interaction(IGI) | 涉及两个或以上的基因的序列改变或者表达量改变的实验验证的生物功能 |
Inferred from Expression Pattern(IEP) | 根据基因表达的位置或者基因表达时间推测的生物过程 |
Inferred from High Throughput Experiment(HTP) | 高通量实验验证的生物功能 |
Inferred from High Throughput Direct Assay(HDA) | 高通量生物化学实验或高通量细胞生物学实验验证的生物功能 |
Inferred from High Throughput Mutant Phenotype(HMP) | 根据高通量实验中的一个基因的两个等位基因的功能差异推测的生物功能 |
Inferred from Hight Throughput Genetic Interaction(HGI) | 涉及两个或以上的基因的序列改变或者表达量改变的高通量实验验证的生物功能 |
Inferred from High Throughput Expression Pattern(HEP) | 根据高通量实验中基因表达的位置或者基因表达时间推测的生物过程 |
Inferred from Sequence or structural Similarity (ISS) | 根据序列分析或者结构相似性预测并经过人工审核的生物功能 |
Inferred from Sequence Orthology(ISO) | 根据直系同源关系预测并经过人工审核的生物功能 |
Inferred from Sequence Alignment(ISA) | 根据序列比对预测的生物功能;功能预测与序列比对本身都经过人工审核 |
Inferred from Sequence Model(ISM) | 基于隐式马尔科夫模型(如Pfam)等蛋白家族的统计模型预测并经过人工审核的生物功能 |
Inferred from Genomic Context(IGC) | 根据目标基因在基因组上邻近的其它基因元件预测并经过人工审核的生物功能 |
Inferred from Reviewed Computational Analysis(RCA) | 根据大规模实验数据(如酵母双杂交、质谱、基因芯片)预测或者结合多种类型的数据预测并经过人工审核的生物功能 |
Inferred from Biological aspect of Ancestor(IBA) | 根据系统发生树中的先祖基因的功能推测的后代基因的生物功能 |
Inferred from Biological aspect of Descendant(IBD) | 根据系统发生树中的后代基因的功能推测的先祖基因的生物功能 |
Inferred from Key Residues(IKR) | 根据关键氨基酸残基缺失推测的生物功能缺失 |
Inferred from Rapid Divergence(IRD) | 根据后代基因与先祖基因在进化上的快速分歧推断的生物功能缺失 |
Traceable Author Statement(TAS) | 根据综述文献或者实验文献的介绍或讨论章节中的引用文献总结的生物功能 |
Non-traceable Author Statement(NAS) | 根据文献中没有明确实验依据或引用支持的文字描述总结的生物功能 |
Inferred by Curator(IC) | 根据蛋白的已有功能注释推测的相关生物功能;例如,根据一个真核蛋白的已知功能“RNA聚合酶II活性”推测该蛋白应具有功能注释“细胞核” |
Inferred from Electronic Annotation(IEA) | 无人工审核的计算预测得到的生物功能 |
表1 GO注释证据代码
Tab. 1 Evidence codes used for Gene Ontology annotation
证据编码 | 详细解释 |
---|---|
Inferred from Experiment (EXP) | 实验验证的生物功能 |
Inferred from Direct Assay(IDA) | 生物化学或细胞生物学实验验证的生物功能 |
Inferred from Physical Interaction(IPI) | 实验验证的蛋白-蛋白、蛋白-核酸或蛋白-小分子配体相互作用 |
Inferred from Mutant Phenotype(IMP) | 根据同一个基因的两个等位基因的功能差异推测的生物功能 |
Inferred from Genetic Interaction(IGI) | 涉及两个或以上的基因的序列改变或者表达量改变的实验验证的生物功能 |
Inferred from Expression Pattern(IEP) | 根据基因表达的位置或者基因表达时间推测的生物过程 |
Inferred from High Throughput Experiment(HTP) | 高通量实验验证的生物功能 |
Inferred from High Throughput Direct Assay(HDA) | 高通量生物化学实验或高通量细胞生物学实验验证的生物功能 |
Inferred from High Throughput Mutant Phenotype(HMP) | 根据高通量实验中的一个基因的两个等位基因的功能差异推测的生物功能 |
Inferred from Hight Throughput Genetic Interaction(HGI) | 涉及两个或以上的基因的序列改变或者表达量改变的高通量实验验证的生物功能 |
Inferred from High Throughput Expression Pattern(HEP) | 根据高通量实验中基因表达的位置或者基因表达时间推测的生物过程 |
Inferred from Sequence or structural Similarity (ISS) | 根据序列分析或者结构相似性预测并经过人工审核的生物功能 |
Inferred from Sequence Orthology(ISO) | 根据直系同源关系预测并经过人工审核的生物功能 |
Inferred from Sequence Alignment(ISA) | 根据序列比对预测的生物功能;功能预测与序列比对本身都经过人工审核 |
Inferred from Sequence Model(ISM) | 基于隐式马尔科夫模型(如Pfam)等蛋白家族的统计模型预测并经过人工审核的生物功能 |
Inferred from Genomic Context(IGC) | 根据目标基因在基因组上邻近的其它基因元件预测并经过人工审核的生物功能 |
Inferred from Reviewed Computational Analysis(RCA) | 根据大规模实验数据(如酵母双杂交、质谱、基因芯片)预测或者结合多种类型的数据预测并经过人工审核的生物功能 |
Inferred from Biological aspect of Ancestor(IBA) | 根据系统发生树中的先祖基因的功能推测的后代基因的生物功能 |
Inferred from Biological aspect of Descendant(IBD) | 根据系统发生树中的后代基因的功能推测的先祖基因的生物功能 |
Inferred from Key Residues(IKR) | 根据关键氨基酸残基缺失推测的生物功能缺失 |
Inferred from Rapid Divergence(IRD) | 根据后代基因与先祖基因在进化上的快速分歧推断的生物功能缺失 |
Traceable Author Statement(TAS) | 根据综述文献或者实验文献的介绍或讨论章节中的引用文献总结的生物功能 |
Non-traceable Author Statement(NAS) | 根据文献中没有明确实验依据或引用支持的文字描述总结的生物功能 |
Inferred by Curator(IC) | 根据蛋白的已有功能注释推测的相关生物功能;例如,根据一个真核蛋白的已知功能“RNA聚合酶II活性”推测该蛋白应具有功能注释“细胞核” |
Inferred from Electronic Annotation(IEA) | 无人工审核的计算预测得到的生物功能 |
方法 | 功能预测的信息来源(特征) | 机器学习模型 |
---|---|---|
GOtcha、Blast2GO、BAR+ | BLASTp搜索得到的同源序列 | 无 |
ConFunc、PFP、GoFDR | PSI-BLAST搜索得到的同源序列 | 无 |
HFSP | MMseqs2搜索得到的同源序列 | 无 |
ProFunc | BLASTp搜索得到的同源序列、SSM与Jess结构搜索得到的相似结构 | 无 |
COFACTOR | BLASTp与PSI-BLAST搜索得到的同源序列、TM-align结构搜索得到的相似结构、蛋白-蛋白相互作用 | 无 |
MetaGO | BLASTp与PSI-BLAST搜索得到的同源序列、TM-align结构搜索得到的相似结构、蛋白蛋白相互作用 | 逻辑回归 |
StarFunc | BLASTp搜索得到的同源序列、Foldseek与TM-align结构搜索得到的相似结构、Pfam蛋白结构域家族、蛋白-蛋白相互作用、目标蛋白序列(ESM蛋白语言模型提取的特征) | 逻辑回归、全连接神经网络、随机森林 |
DeepFRI、Struct2Go | 三维结构提取的残基接触图、目标蛋白序列(独热编码) | 图卷积神经网络 |
TALE-cmap | 三维结构提取的残基接触图、多序列比对(ESM-MSA蛋白语言模型提取的特征) | Transformer |
CLEAN-Contact | 三维结构提取的残基接触图、目标蛋白序列(ESM蛋白语言模型提取的特征) | 卷积神经网络 |
MS-kNN | 同源序列、基因表达谱、蛋白-蛋白相互作用 | k-最近邻 |
INGA | BLASTp搜索得到的同源序列、蛋白-蛋白相互作用、Pfam蛋白结构域家族 | 无 |
GOLabeler | BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、目标蛋白序列(连续三个氨基酸残基序列片段的频率、ProFET程序提取的序列特征) | 逻辑回归、梯度增强树 |
NetGO | BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列(连续三个氨基酸残基序列片段的频率、ProFET程序提取的序列特征) | 逻辑回归、梯度增强树 |
NetGO2.0 | BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列(连续三个氨基酸残基序列片段的频率、独热编码)、PubMed摘要 | 逻辑回归、双向长短期记忆神经网络、梯度增强树 |
DeepGO、DeepGOplus、ProteInfer、DeepEC、ECPICK | 目标蛋白序列(独热编码) | 卷积神经网络 |
ATGO+ | BLASTp搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征) | 全连接神经网络 |
InterLabelGO+ | DIAMOND搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征) | 全连接神经网络 |
DeepGO-SE | 目标蛋白序列(ESM蛋白语言模型提取的特征)、蛋白-蛋白相互作用 | 全连接神经网络、图注意力网络 |
DeepECtransformer | DIAMOND搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征) | 注意力网络 |
CLEAN | 目标蛋白序列(ESM蛋白语言模型提取的特征) | 全连接神经网络 |
表2 现有的蛋白功能预测方法
Tab. 2 Existing methods for protein function prediction
方法 | 功能预测的信息来源(特征) | 机器学习模型 |
---|---|---|
GOtcha、Blast2GO、BAR+ | BLASTp搜索得到的同源序列 | 无 |
ConFunc、PFP、GoFDR | PSI-BLAST搜索得到的同源序列 | 无 |
HFSP | MMseqs2搜索得到的同源序列 | 无 |
ProFunc | BLASTp搜索得到的同源序列、SSM与Jess结构搜索得到的相似结构 | 无 |
COFACTOR | BLASTp与PSI-BLAST搜索得到的同源序列、TM-align结构搜索得到的相似结构、蛋白-蛋白相互作用 | 无 |
MetaGO | BLASTp与PSI-BLAST搜索得到的同源序列、TM-align结构搜索得到的相似结构、蛋白蛋白相互作用 | 逻辑回归 |
StarFunc | BLASTp搜索得到的同源序列、Foldseek与TM-align结构搜索得到的相似结构、Pfam蛋白结构域家族、蛋白-蛋白相互作用、目标蛋白序列(ESM蛋白语言模型提取的特征) | 逻辑回归、全连接神经网络、随机森林 |
DeepFRI、Struct2Go | 三维结构提取的残基接触图、目标蛋白序列(独热编码) | 图卷积神经网络 |
TALE-cmap | 三维结构提取的残基接触图、多序列比对(ESM-MSA蛋白语言模型提取的特征) | Transformer |
CLEAN-Contact | 三维结构提取的残基接触图、目标蛋白序列(ESM蛋白语言模型提取的特征) | 卷积神经网络 |
MS-kNN | 同源序列、基因表达谱、蛋白-蛋白相互作用 | k-最近邻 |
INGA | BLASTp搜索得到的同源序列、蛋白-蛋白相互作用、Pfam蛋白结构域家族 | 无 |
GOLabeler | BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、目标蛋白序列(连续三个氨基酸残基序列片段的频率、ProFET程序提取的序列特征) | 逻辑回归、梯度增强树 |
NetGO | BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列(连续三个氨基酸残基序列片段的频率、ProFET程序提取的序列特征) | 逻辑回归、梯度增强树 |
NetGO2.0 | BLASTp搜索得到的同源序列、InterPro蛋白结构域家族、蛋白-蛋白互作、目标蛋白序列(连续三个氨基酸残基序列片段的频率、独热编码)、PubMed摘要 | 逻辑回归、双向长短期记忆神经网络、梯度增强树 |
DeepGO、DeepGOplus、ProteInfer、DeepEC、ECPICK | 目标蛋白序列(独热编码) | 卷积神经网络 |
ATGO+ | BLASTp搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征) | 全连接神经网络 |
InterLabelGO+ | DIAMOND搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征) | 全连接神经网络 |
DeepGO-SE | 目标蛋白序列(ESM蛋白语言模型提取的特征)、蛋白-蛋白相互作用 | 全连接神经网络、图注意力网络 |
DeepECtransformer | DIAMOND搜索得到的同源序列、目标蛋白序列(ESM蛋白语言模型提取的特征) | 注意力网络 |
CLEAN | 目标蛋白序列(ESM蛋白语言模型提取的特征) | 全连接神经网络 |
图4 NetGO2.0中基于文本数据挖掘的GO预测。(a)NetGO2.0的文本数据挖掘模型LR-Text。(b) LR-Text中文本数据特征生成的两个方法之一Doc2Vec的架构。在本例子中,Doc2Vec神经网络被训练预测在上下文为“The quick brown fox ___ over the lazy dog”中缺失的单词 “jump”。原句中无意义的词“the”不包含在输入语句中。
Fig. 4 Text mining-based protein GO term prediction in NetGO2.0.(a) LR-Text, the text mining model in NetGO2.0. (b) The architecture of Doc2Vec, which is used as one of the text feature generation methods in LR-Text. In this example, the Doc2Vec neural network model is trained to predict the masked word "jump" given its context in the sentence "The quick brown fox ___ over the lazy dog." The word "the" is excluded from the input sentence as it does not have meaningful information.
图5 GOcurator中三个基于文本数据挖掘的GO预测。(a) LR-MEM。(b) GOXML。(c) GORetriever。(a) LR-MEM. (b) GOXML. (c) GORetriever.
Fig. 5 Three text mining-based models for protein GO term prediction in GOcurator.
图6 UniProt数据库2024年新收录的8935篇PubMed文献的发表年份。(549篇1999年或之前发表的文献没有在图中显示。)
Fig. 6 Year of publication for the 8935 PubMed citations newly added to UniProt in 2024.(The 549 papers published in or before year 1999 are not shown.)
1 | ASHBURNER M, BALL C A, BLAKE J A, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium [J]. Nat Genet, 2000, 25(1): 25-9. |
2 | COMMITTEE I U O B N, BIOCHEMISTRY I U O. Enzyme nomenclature, 1978: recommendations of the nomenclature Committee of the International Union of biochemistry on the nomenclature and classification of enzymes [M]. Academic Press, 1979. |
3 | GARGANO M A, MATENTZOGLU N, COLEMAN B, et al. The Human Phenotype Ontology in 2024: phenotypes around the world [J]. Nucleic Acids Res, 2024, 52(D1): D1333-D46. |
4 | UNIPROT C. UniProt: the Universal Protein Knowledgebase in 2025 [J]. Nucleic Acids Res, 2024. |
5 | HUNTLEY R P, SAWFORD T, MUTOWO-MEULLENET P, et al. The GOA database: gene Ontology annotation updates for 2015 [J]. Nucleic Acids Res, 2015, 43(Database issue): D1057-63. |
6 | FELDMANN P, EICHER E N, LEEVERS S J, et al. Control of growth and differentiation by Drosophila RasGAP, a homolog of p120 Ras-GTPase-activating protein [J]. Mol Cell Biol, 1999, 19(3): 1928-37. |
7 | GAUDET P, LIVSTONE M S, LEWIS S E, et al. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium [J]. Brief Bioinform, 2011, 12(5): 449-62. |
8 | WEI X, ZHANG C, FREDDOLINO P L, et al. Detecting Gene Ontology misannotations using taxon-specific rate ratio comparisons [J]. Bioinformatics, 2020, 36(16): 4383-8. |
9 | MARTIN D M, BERRIMAN M, BARTON G J. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes [J]. BMC Bioinformatics, 2004, 5: 178. |
10 | CONESA A, GOTZ S. Blast2GO: A comprehensive suite for functional analysis in plant genomics [J]. Int J Plant Genomics, 2008, 2008: 619832. |
11 | PIOVESAN D, MARTELLI P L, FARISELLI P, et al. BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences [J]. Nucleic Acids Res, 2011, 39(Web Server issue): W197-202. |
12 | ALTSCHUL S F, MADDEN T L, SCHAFFER A A, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J]. Nucleic Acids Res, 1997, 25(17): 3389-402. |
13 | WASS M N, STERNBERG M J. ConFunc--functional annotation in the twilight zone [J]. Bioinformatics, 2008, 24(6): 798-806. |
14 | HAWKINS T, CHITALE M, LUBAN S, et al. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data [J]. Proteins, 2009, 74(3): 566-82. |
15 | GONG Q, NING W, TIAN W. GoFDR: A sequence alignment based method for predicting protein functions [J]. Methods, 2016, 93: 3-14. |
16 | MAHLICH Y, STEINEGGER M, ROST B, et al. HFSP: high speed homology-driven function annotation of proteins [J]. Bioinformatics, 2018, 34(13): i304-i12. |
17 | STEINEGGER M, SODING J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets [J]. Nat Biotechnol, 2017, 35(11): 1026-8. |
18 | KULMANOV M, HOEHNDORF R. DeepGOPlus: improved protein function prediction from sequence [J]. Bioinformatics, 2020, 37(8): 1187. |
19 | KULMANOV M, HOEHNDORF R. DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms [J]. Bioinformatics, 2022, 38(): i238-i45. |
20 | YUAN Q, XIE J, XIE J, et al. Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion [J]. Brief Bioinform, 2023, 24(3). |
21 | BUCHFINK B, REUTER K, H-G DROST. Sensitive protein alignments at tree-of-life scale using DIAMOND [J]. Nature methods, 2021, 18(4): 366-8. |
22 | ZHANG C, FREDDOLINO P L. A large-scale assessment of sequence database search tools for homology-based protein function prediction [J]. bioRxiv, 2023: 2023.11. 14.567021. |
23 | ZHANG C, FREDDOLINO P L, ZHANG Y. COFACTOR: improved protein function prediction by combining structure, sequence and protein-protein interaction information [J]. Nucleic Acids Res, 2017, 45(W1): W291-W9. |
24 | ZHANG C X, ZHENG W, FREDDOLINO P L, et al. MetaGO: Predicting Gene Ontology of Non-homologous Proteins Through Low-Resolution Protein Structure Prediction and Protein Protein Network Mapping [J]. J Mol Biol, 2018, 430(15): 2256-65. |
25 | ZHANG Y, SKOLNICK J. TM-align: a protein structure alignment algorithm based on the TM-score [J]. Nucleic Acids Res, 2005, 33(7): 2302-9. |
26 | ZHANG C, ZHANG X, FREDDOLINO P L, et al. BioLiP2: an updated structure database for biologically relevant ligand-protein interactions [J]. Nucleic Acids Res, 2023. |
27 | LASKOWSKI R A. The ProFunc Function Prediction Server [J]. Methods Mol Biol, 2017, 1611: 75-95. |
28 | KRISSINEL E, HENRICK K. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions [J]. Acta Crystallogr D Biol Crystallogr, 2004, 60(Pt 12 Pt 1): 2256-68. |
29 | BARKER J A, THORNTON J M. An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis [J]. Bioinformatics, 2003, 19(13): 1644-9. |
30 | ZHANG C, LIU Q, FREDDOLINO P L. StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction [J]. bioRxiv, 2024: 2024.05.15.594113. |
31 | VAN KEMPEN M, KIM S S, TUMESCHEIT C, et al. Fast and accurate protein structure search with Foldseek [J]. Nat Biotechnol, 2023. |
32 | VARADI M, ANYANGO S, DESHPANDE M, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models [J]. Nucleic Acids Res, 2022, 50(D1): D439-D44. |
33 | MISTRY J, CHUGURANSKY S, WILLIAMS L, et al. Pfam: The protein families database in 2021 [J]. Nucleic Acids Res, 2021, 49(D1): D412-D9. |
34 | LIU Q, ZHANG C, FREDDOLINO L. InterLabelGO+: unraveling label correlations in protein function prediction [J]. Bioinformatics, 2024, 40(11): btae655. |
35 | GLIGORIJEVIC V, RENFREW P D, KOSCIOLEK T, et al. Structure-based protein function prediction using graph convolutional networks [J]. Nat Commun, 2021, 12(1): 3168. |
36 | MA W, ZHANG S, LI Z, et al. Enhancing Protein Function Prediction Performance by Utilizing AlphaFold-Predicted Protein Structures [J]. J Chem Inf Model, 2022, 62(17): 4008-17. |
37 | QIU X-Y, WU H, SHAO J. TALE-cmap: Protein function prediction based on a TALE-based architecture and the structure information from contact map [J]. Computers in Biology and Medicine, 2022, 149: 105938. |
38 | YANG Y, JERGER A, FENG S, et al. Improved enzyme functional annotation prediction using contrastive learning with structural inference [J]. Communications Biology, 2024, 7(1): 1690. |
39 | LAN L, DJURIC N, GUO Y, et al. MS-kNN: protein function prediction by integrating multiple data sources [J]. BMC Bioinformatics, 2013, 14(3): 1-10. |
40 | PIOVESAN D, TOSATTO S C. INGA 2.0: improving protein function prediction for the dark proteome [J]. Nucleic acids research, 2019, 47(W1): W373-W8. |
41 | YOU R, ZHANG Z, XIONG Y, et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank [J]. Bioinformatics, 2018, 34(14): 2465-73. |
42 | BLUM M, CHANG H Y, CHUGURANSKY S, et al. The InterPro protein families and domains database: 20 years on [J]. Nucleic Acids Res, 2021, 49(D1): D344-D54. |
43 | CHEN T, GUESTRIN C. Xgboost: A scalable tree boosting system; proceedings of the Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, F, 2016 [C]. |
44 | YOU R, YAO S, XIONG Y, et al. NetGO: improving large-scale protein function prediction with massive network information [J]. Nucleic Acids Res, 2019, 47(W1): W379-W87. |
45 | YAO S, YOU R, WANG S, et al. NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information [J]. Nucleic Acids Res, 2021, 49(W1): W469-W75. |
46 | KULMANOV M, KHAN M A, HOEHNDORF R, et al. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier [J]. Bioinformatics, 2018, 34(4): 660-8. |
47 | SANDERSON T, BILESCHI M L, BELANGER D, et al. ProteInfer, deep neural networks for protein functional inference [J]. Elife, 2023, 12. |
48 | RYU J Y, KIM H U, LEE S Y. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers [J]. Proceedings of the National Academy of Sciences, 2019, 116(28): 13996-4001. |
49 | HAN S-R, PARK M, KOSARAJU S, et al. Evidential deep learning for trustworthy prediction of enzyme commission number [J]. Brief Bioinform, 2024, 25(1): bbad401. |
50 | ZHU Y H, ZHANG C, YU D J, et al. Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction [J]. PLoS Comput Biol, 2022, 18(12): e1010793. |
51 | KULMANOV M, GUZMÁN-VEGA F J, ROGGLI P D, et al. DeepGO-SE: Protein function prediction as Approximate Semantic Entailment [J]. bioRxiv, 2023: 2023.09.26.559473. |
52 | KIM G B, KIM J Y, LEE J A, et al. Functional annotation of enzyme-encoding genes using deep learning with transformer layers [J]. Nat Commun, 2023, 14(1): 7370. |
53 | YU T, CUI H, LI J C, et al. Enzyme function prediction using contrastive learning [J]. Science, 2023, 379(6639): 1358-63. |
54 | LIN Z, AKIN H, RAO R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model [J]. Science, 2023, 379(6637): 1123-30. |
55 | ELNAGGAR A, HEINZINGER M, DALLAGO C, et al. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 7112-27. |
56 | VASWANI A. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017. |
57 | RADIVOJAC P, CLARK W T, ORON T R, et al. A large-scale evaluation of computational protein function prediction [J]. Nat Methods, 2013, 10(3): 221-7. |
58 | ZHOU N H, ET AL. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens [J]. Genome Biol, 2019, 20(1). |
59 | JIANG Y, ORON T R, CLARK W T, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy [J]. Genome Biol, 2016, 17: 1-19. |
60 | YAN H, WANG S, LIU H, et al. GORetriever: reranking protein-description-based GO candidates by literature-driven deep information retrieval for protein function annotation [J]. Bioinformatics, 2024, 40(): ii53-ii61. |
61 | CHUA Z M, RAJESH A, SINHA S, et al. PROTGOAT: Improved automated protein function predictions using Protein Language Models [J]. bioRxiv, 2024: 2024.04. 01.587572. |
62 | COZZETTO D, BUCHAN D W, BRYSON K, et al. Protein function prediction by massive integration of evolutionary analyses and multiple data sources; proceedings of the BMC bioinformatics, F, 2013 [C]. Springer. |
63 | YOU R, HUANG X, ZHU S. DeepText2GO: improving large-scale protein function prediction with deep semantic text representation [J]. Methods, 2018, 145: 82-90. |
64 | LE Q, MIKOLOV T. Distributed representations of sentences and documents; proceedings of the International conference on machine learning, F, 2014 [C]. PMLR. |
65 | GU Y, TINN R, CHENG H, et al. Domain-specific language model pretraining for biomedical natural language processing [J]. ACM Transactions on Computing for Healthcare (HEALTH), 2021, 3(1): 1-23. |
66 | COHAN A, FELDMAN S, BELTAGY I, et al. Specter: Document-level representation learning using citation-informed transformers [J]. arXiv preprint arXiv:, 2020. |
67 | REIMERS N. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [J]. arXiv preprint arXiv:, 2019. |
68 | WU J, YIN Q, ZHANG C, et al. Function Prediction for G Protein-Coupled Receptors through Text Mining and Induction Matrix Completion [J]. ACS omega, 2019, 4(2): 3045-54. |
69 | BADAL V D, KUNDROTAS P J, VAKSER I A. Text Mining for Protein Docking [J]. PLoS Comput Biol, 2015, 11(12): e1004630. |
70 | KAFKAS Ş, HOEHNDORF R. Ontology based text mining of gene-phenotype associations: application to candidate gene prediction [J]. Database, 2019, 2019: baz019. |
71 | CZARNECKI J, NOBELI I, SMITH A M, et al. A text-mining system for extracting metabolic reactions from full-text articles [J]. BMC bioinformatics, 2012, 13: 1-14. |
72 | VERSPOOR K M, COHN J D, RAVIKUMAR K E, et al. Text mining improves prediction of protein functional sites [J]. PloS one, 2012, 7(2): e32171. |
73 | WEI X, ZOU S, XIE Z, et al. EDIL3 deficiency ameliorates adverse cardiac remodelling by neutrophil extracellular traps (NET)-mediated macrophage polarization [J]. Cardiovasc Res, 2022, 118(9): 2179-95. |
74 | PAFILIS E, BUTTIGIEG P L, FERRELL B, et al. EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation [J]. Database (Oxford), 2016, 2016. |
75 | WEI C H, KAO H Y, LU Z. PubTator: a web-based text mining tool for assisting biocuration [J]. Nucleic Acids Res, 2013, 41(Web Server issue): W518-22. |
76 | WEBER L, SANGER M, MUNCHMEYER J, et al. HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition [J]. Bioinformatics, 2021, 37(17): 2792-4. |
77 | GIORGI J M, BADER G D. Towards reliable named entity recognition in the biomedical domain [J]. Bioinformatics, 2020, 36(1): 280-6. |
78 | FURRER L, JANCSO A, COLIC N, et al. OGER++: hybrid multi-type entity recognition [J]. J Cheminform, 2019, 11(1): 7. |
[1] | 王子渊, 杨立荣, 吴坚平, 郑文隆. 酶促合成手性氨基酸的研究进展[J]. 合成生物学, 2024, 5(6): 1319-1349. |
[2] | 朱景勇, 李钧翔, 李旭辉, 张瑾, 毋文静. 深度学习在基于序列的蛋白质互作预测中的应用进展[J]. 合成生物学, 2024, 5(1): 88-106. |
[3] | 吴玉洁, 刘欣欣, 刘健慧, 杨开广, 随志刚, 张丽华, 张玉奎. 基于高通量液相色谱质谱技术的菌株筛选与关键分子定量分析研究进展[J]. 合成生物学, 2023, 4(5): 1000-1019. |
[4] | 宋益东, 袁乾沐, 杨跃东. 深度学习在蛋白质功能预测中的应用[J]. 合成生物学, 2023, 4(3): 488-506. |
[5] | 黄鹤, 吴桐, 王闻达, 李佳珊, 孙黛雯, 叶启威, 龚新奇. 蛋白质复合物结构预测:方法与进展[J]. 合成生物学, 2023, 4(3): 507-523. |
[6] | 陈志航, 季梦麟, 戚逸飞. 人工智能蛋白质结构设计算法研究进展[J]. 合成生物学, 2023, 4(3): 464-487. |
[7] | 唐一鸣, 姚逸飞, 杨中元, 周运, 王子超, 韦广红. 神经退行性疾病相关蛋白病理性聚集和液液相分离研究进展[J]. 合成生物学, 2023, 4(3): 590-610. |
[8] | 孟巧珍, 郭菲. “可折叠性”在酶智能设计改造中的应用研究——以AlphaFold2为例[J]. 合成生物学, 2023, 4(3): 571-589. |
[9] | 康里奇, 谈攀, 洪亮. 人工智能时代下的酶工程[J]. 合成生物学, 2023, 4(3): 524-534. |
[10] | 王晟, 王泽琛, 陈威华, 陈珂, 彭向达, 欧发芬, 郑良振, 孙瑨原, 沈涛, 赵国屏. 基于人工智能和计算生物学的合成生物学元件设计[J]. 合成生物学, 2023, 4(3): 422-443. |
[11] | 阮青云, 黄莘, 孟子钧, 全舒. 蛋白质稳定性计算设计与定向进化前沿工具[J]. 合成生物学, 2023, 4(1): 5-29. |
[12] | 梁丽亚, 刘嵘明. 靶向DNA的Ⅱ类CRISPR/Cas系统的蛋白工程化改造[J]. 合成生物学, 2023, 4(1): 86-101. |
[13] | 祁延萍, 朱晋, 张凯, 刘彤, 王雅婕. 定向进化在蛋白质工程中的应用研究进展[J]. 合成生物学, 2022, 3(6): 1081-1108. |
[14] | 吕靖伟, 邓子新, 张琪, 丁伟. 基于深度学习识别RiPPs前体肽及裂解位点[J]. 合成生物学, 2022, 3(6): 1262-1276. |
[15] | 易琪昆, 孙晨博, 杨中光, 王日, 寇松姿, 李朝霞, 孙飞. 可基因编码点击化学在材料合成生物学中的应用[J]. 合成生物学, 2022, 3(4): 690-708. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||