Synthetic Biology Journal

Contents in Chinese and English

2023, 4(3): 1.

Asbtract ( 280 )

PDF (625KB) ( 292 )

Related Articles | Metrics

Design of synthetic biology components based on artificial intelligence and computational biology

WANG Sheng, WANG Zechen, CHEN Weihua, CHEN Ke, PENG Xiangda, OU Fafen, ZHENG Liangzhen, SUN Jinyuan, SHEN Tao, ZHAO Guoping

2023, 4(3): 422-443. doi:10.12211/2096-8280.2023-004

Asbtract ( 5312 )

HTML ( 717)

PDF (1930KB) ( 6735 )

Figures and Tables | References | Related Articles | Metrics

The primary objective of synthetic biology is to conceptualize, engineer, and construct novel biological components, devices, and systems based on established principles and extant information or to reconfigure existing natural biological systems. The core concept of synthetic biology encompasses the design, modification, reconstruction, or fabrication of biological components, reaction systems, metabolic pathways and processes, and even the creation of cells and organisms with functions or living characteristics. This burgeoning field offers innovative technologies to address challenges with sustainable development in environment, resource, energy, and so on. Undeniably, synthetic biology has yielded significant progress in numerous fields, ranging from DNA recombination to gene circuit design, yet its full potential remains insufficiently explored, but the emergence and application of artificial intelligence (AI) definitely can facilitate the development of synthetic biology for more applications. From a synthetic biology perspective, essence for life is rooted in digitalization and designability. This article reviews current advances in computational biology, particularly AI for synthetic biology to be more efficient and effective, focusing on the development of biocatalysts, regulators, and sensors. De novo enzyme design has been successfully implemented by using Rosetta software, as AI exhibiting significant potential for generating innovative structures and protein sequences with diverse functions. Also, the reprogramming of natural enzymes for specific purposes is crucial for synthetic biology applications. By employing various force fields and sampling techniques, promiscuity and thermal stability can be modified to accommodate specific requirements rather than those with natural hosts. AI can be integrated into the life-cycle of synthetic biology through an active learning paradigm, which enables alterations in enzyme specificity, and demonstrates potential for accurately and rapidly predicting mutation effects, surpassing force-field-based methods. The rapidly decreasing cost of sequencing has facilitated the characterization of cis-regulators, primarily DNA and RNA, with high-throughput. Concurrently, more trans-regulators have been identified in sequenced genomes. The expanding wealth in big data serves as a driving force for AI. AI models have successfully predicted the strength of promoters, ribosome binding sites (RBSs), and enhancers, and generated artificial protomers and RBSs. Recent progress in RNA structure prediction is expected to aid the design of RNA elements. Sensors, vital for genetic circuits and other applications such as toxin detection, typically involve interactions among various molecules, including nucleic acids, proteins, small organic molecules, and metal ions. Consequently, sensor design necessitates the integration of diverse computational biology tools to balance accuracy and computational cost. As the pool of data keeps growing, we anticipate that AI will be increasingly applied to the design of more bio-parts.

Rational design for functional topology and its applications in synthetic biology

SUN Zhi, YANG Ning, LOU Chunbo, TANG Chao, YANG Xiaojing

2023, 4(3): 444-463. doi:10.12211/2096-8280.2023-003

Asbtract ( 1410 )

HTML ( 85)

PDF (2290KB) ( 1656 )

Figures and Tables | References | Related Articles | Metrics

Biological networks are capable of performing complex functions with accuracy, reliability, and robustness. In topology, the dynamic property and function of a network are closely related. How to depict this relationship quantitatively and discover design principles for complex and diverse biological networks are great challenges. In this review, we comment progress in this regard and its applications in synthetic biology. Biological networks are different statistically from random ones, which show a characteristic of modularization. There are recurrent network motifs linked to particular functions, such as temporally programed expression, reliable cell decisions, and robust biological oscillations, suggesting that despite the apparent complexity of cellular networks, there may only be a limited number of topological networks for executing particular biological functions robustly. Indeed, by enumerating all possible topological networks with two or three nodes, systems biology studies have shown that only a handful of topological networks can perform a given function. Here, we first summarize the high-frequency modules and their functions in natural biological systems, review progress in systems biology to find design principles for functional topology, including two current methods (i.e., enumeration and optimization) to search functional topology computationally, and highlight the typical functional topological networks that have been developed theoretically so far. Then, we focus on the functional topology that has been constructed and used in synthetic biology, such as genetic circuits developed based on transcriptional regulation. Organized by the number of nodes, we show typical examples and applications of different functional topological networks, including single-node positive/negative feedback loops, two-node positive/negative feedback loops, multi-node negative feedback/feedforward loops, and the combinational logic gates etc. Finally, we address frontiers in functional topology, including automated design of integrated gene circuits, other regulatory mechanisms beyond transcription, design of network robustness. We end by briefly discussing opportunities and challenges for designing complex functional topological networks.

Research progress of artificial intelligence in desiging protein structures

CHEN Zhihang, JI Menglin, QI Yifei

2023, 4(3): 464-487. doi:10.12211/2096-8280.2023-008

Asbtract ( 3609 )

HTML ( 304)

PDF (3481KB) ( 4362 )

Figures and Tables | References | Related Articles | Metrics

Proteins are essential to life as they carry out a great variety of biological functions. Protein sequences determine their three-dimensional structures, and therefore physiological functions. Proteins with specific functions have important applications in many fields such as biomedicine, where they are utilized in drug design and delivery. In the past, protein engineering and directed evolution are commonly used to improve the activity and stability of proteins. These methods, however, are both complex and expensive, as they require a large number of biological experiments for validation. Computational protein design (CPD) allows the design of amino acid sequences based on desired protein functions and structures, and more intriguingly, generation of proteins even not found in nature. Conventional CPD uses energy function and optimization algorithm to design protein sequences. In recent years, with the rapid development of artificial intelligence (AI) technique, the accumulation of big data and the development of high speed computing, AI has made great progresses in learning, and been successfully applied in CPD. In this review, based on the input constraints and sampling space size, we present a systematic overview of recent applications of AI in protein design from three aspects: fixed-backbone design, flexible-backbone design, and sequence structure generation. We focus on algorithms and protein feature encoding, present the effect of dataset size and architectural improvements on model performance in prediction, and showcase several enzymes, antibodies, and binding proteins that were successfully designed using these models. The advantages of AI compared with traditional CPD methods are also discussed. Finally, we highlight challenges in AI-aided protein design, and propose some strategies for solutions.

Application of deep learning in protein function prediction

SONG Yidong, YUAN Qianmu, YANG Yuedong

2023, 4(3): 488-506. doi:10.12211/2096-8280.2022-078

Asbtract ( 3156 )

HTML ( 258)

PDF (1457KB) ( 6237 )

Figures and Tables | References | Related Articles | Metrics

Protein function prediction is essential for bioinformatics analysis, which benefits a wide range of biological studies such as understanding the functions of metagenomes, uncovering mechanism underlying diseases, and finding new drug targets. With the rapid development of high-throughput sequencing technology, protein sequence data have been increased quickly, but functions of most proteins have not yet been identified. Since traditional biochemical experiments to determine protein functions are usually expensive, time-consuming, and less efficient, developing more efficient and effective computational methods for protein function prediction is of great significance. Deep learning technology has made breakthroughs in many fields, including image recognition, natural language processing, genomic analysis and drug discovery. In this review, we address applications of deep learning in protein function prediction, which can be divided into residue-level binding site prediction and protein-level gene ontology (GO) prediction. Protein binding sites are regions that bind to specific ligands, which play an important role in signal transduction, metabolism, revealing molecular mechanisms underlying diseases, and designing new drugs. Gene ontology is a standard function classification system for genes, which provides a set of annotations to comprehensively describe the properties of genes and gene products. Firstly, we introduce commonly used large-scale protein structure and function databases. Secondly, discriminative protein sequence and structure features are described. Thirdly, we summarize the latest protein function prediction methods: in terms of the prediction of binding sites, we introduce the latest methods based on the ligand type, including protein, peptide, nucleic acid and small molecule as well as ion ligand, and in the aspect of GO prediction, we highlight the latest sequence-based, structure-based, and protein interaction network-based methods developed with protein information. Finally, we comment the advantages and disadvantages of the current protein function prediction methods, and discuss the future development in this field.

Prediction of protein complex structure: methods and progress

HUANG He, WU Tong, WANG Wenda, LI Jiashan, SUN Daiwen, YE Qiwei, GONG Xinqi

2023, 4(3): 507-523. doi:10.12211/2096-8280.2022-079

Asbtract ( 3250 )

HTML ( 170)

PDF (1732KB) ( 5441 )

Figures and Tables | References | Related Articles | Metrics

Protein complexes carry out a variety of biological functions, and obtaining the three-dimensional structure of protein complexes is critical for understanding their functions. In many cases, not only can two proteins interact to form a protein dimer, but also multiple proteins interact to form a protein multimer. It is difficult and time-consuming to resolve the structure of protein complexes by experiments. Recently, there have been some attempts and methods to predict the structure of multimers based on the structure prediction for the monomers. Several groups in the CASP14 competition submitted the prediction of protein complex targets, which mainly included template -based methods or protein docking. Later, on the basis of AlphaFold2, researchers developed some end-to-end structure prediction methods for complexes, which accelerates the study of protein complex structure prediction. However, compared with the prediction of monomeric protein structure, the accuracy of prediction for protein complex structure is still lower. This review surveys updated methods and advances in protein complex prediction, including inter-chain residue contact prediction, protein docking, and end-to-end protein complex structure prediction. Firstly, AI algorithms for protein structure prediction are briefly introduced, including coevolutionary analysis and protein contact prediction, deep learning method and protein structure prediction, pretraining model, and protein representation learning. Secondly, basic methods for predicting interactions between protein complexes are systematically summarized, from the construction of multiple sequence alignments of the complexes to the prediction of the inter-residue contact between chains of homologous or heterologous complexes. Finally, basic methods and ideas for protein complex structure prediction are explored from the viewpoint of interaction sites guiding complex structure prediction, protein molecular docking algorithm, end-to-end complex structure prediction methods, etc. In order to better predict the structure of protein complexes, we need to devote our effort to following aspects: 1) constructing protein complexes datasets for training and evaluation of prediction methods for the structure of multimers, 2) developing efficient algorithms to improve the prediction accuracy such as MSA paring algorithm and building templates for multi-chain protein complex, and 3) enlarging databases for protein sequences and structures for better modeling protein complex with pretraining and self-supervised learning methods. In all, predicting protein complex structure still remains a challenge, and new methods to improve accuracy will be helpful for analyzing protein functions, designing proteins and drug discovery.

Enzyme engineering in the age of artificial intelligence

KANG Liqi, TAN Pan, HONG Liang

2023, 4(3): 524-534. doi:10.12211/2096-8280.2023-009

Asbtract ( 6916 )

HTML ( 761)

PDF (1310KB) ( 6290 )

Figures and Tables | References | Related Articles | Metrics

Enzymes have garnered significant attention in both research and industry due to their unparalleled specificity and functionality, and thus opportunities remain for enhancing their physichemical properties and fitness to improve catalytic performance. The primary objective of enzyme engineering is to optimize the fitness of targeted enzymes through various strategies for their modifications, even redesigning. This review provides a comprehensive overview for progress made in enzyme engineering, with a focus on artificial intelligence (AI)-guided design methodology. Several key strategies have been employed in enzyme engineering, including rational design, directed evolution, semi-rational design, and AI-guided design. Rational design relies on an extensive knowledge based on encompassing protein structures and catalytic mechanisms, allowing for purposeful manipulations of enzyme properties. Directed evolution, on the other hand, involves the generation of a library of random variants for subsequent high-throughput screening to identify beneficial mutations. Semi-rational design combines rational design and directed evolution, resulting in a smaller, yet more targeted, library of variants, which mitigates high cost associated with extensive screening of large libraries developed through directed evolution. In recent years, AI technologies, particularly deep neural networks, have emerged as a promising approach for enzyme engineering, and AI-guided methods leverage a vast amount of information regarding protein sequences, multiple sequence alignments, and protein structures to learn key features for correlations. These learned features can then be applied to various downstream tasks in enzyme engineering, such as predicting mutations with beneficial effect, optimizing protein stability, and enhancing catalytic activity. Herewith, we delves into advancements and successes in each of these strategies for enzyme engineering, highlighting the growing impact of AI-guided design on the process. By offering a detailed examination of the current state of enzyme engineering, we aim at providing valuable insight for researchers and engineers to further advance the development and optimization of enzymes for more applications.

Data-driven prediction and design for enzymatic reactions

ZENG Tao, WU Ruibo

2023, 4(3): 535-550. doi:10.12211/2096-8280.2022-066

Asbtract ( 2472 )

HTML ( 229)

PDF (1714KB) ( 3513 )

Figures and Tables | References | Related Articles | Metrics

Enzymes are efficient catalysts with substrate specificity and stereo- and regioselectivity, which are widely used in producing chemicals, drugs and materials. Enzymes are cores for biocatalysis, and thus prediction on their functions and design of enzymatic reactions are driving forces for intelligent biomanufacturing through biocatalysis. So far limited understanding on enzymatic catalysis hinders the exploration of enzymatic reactions for industrial applications. For example, it is difficult to predict enzymatic activities on unreported substrates, to elucidate synthetic routes for newly found structures of enzymes, and to redesign enzymes for specific scenarios. In the era of big data, data-driven approaches have exhibited powerful capabilities for exploring enzymatic reactions, by filling gap between the large corpora of enzymatic data and limited understanding on functions of the enzymes. Recently, computational tools and platforms have greatly accelerated experimental research, and improved the design-build-test-learn cycle. Herein we review progress in computational tools for enzymatic reaction prediction and design, focusing on the application of deep learning methods in this field. Referring to key elements (substrate, product and enzyme) for enzymatic reactions, related databases are summarized. Then, the data-driven approaches for forward and backward prediction of enzymatic reaction routes and functions of enzymes, their design and theoretical calculation for enzymatic catalysis are addressed. Finally, the status and prospective of data-driven approaches for enzymatic catalysis prediction and design, including the data, model, algorithm and platform, are discussed.

Target structure based computational design of cyclic peptides

WANG Fanhao, LAI Luhua, ZHANG Changsheng

2023, 4(3): 551-570. doi:10.12211/2096-8280.2023-006

Asbtract ( 2386 )

HTML ( 151)

PDF (2810KB) ( 5345 )

Figures and Tables | References | Related Articles | Metrics

Cyclic peptides (macrocycles) possess head-to-tail cyclic or partially cyclized substructures, which have received more and more attention in developing new drugs recently, since they have unique advantages in regulating protein-protein interactions (PPIs). Comparing to small-molecule compounds, it is easier to design cyclic peptide molecules that bind to target sites with high affinity and specificity, due to the broad and flat interfaces of PPIs and their large surfaces. Moreover, cyclic peptides are generally more rigid and difficult for digestion by proteases than their linear counterparts, making them more stable than linear peptides or proteins. Meanwhile, cyclic peptides are easier for modifications to increase transmembrane activity, targeting intracellular proteins through conformation adaptation or chemical modifications. 3D structure data and structure modeling technics are basis for designing structure based cyclic-peptide drugs. In this review, we assess the structures of cyclic peptides and target proteins available in protein structure database (PDB). Then, we review the algorithms of conformation generation or structure prediction for cyclic peptides, including homologous modeling, secondary structure prediction and optimization, backbone torsion sampling, and distance geometry method. We also summarize progress in target structure based computational design for cyclic peptides, including structure-based virtual screening, molecular dynamic simulation aided methods, de novo design algorithms, and the transmembrane cyclic peptide design. However, more generalized structure-based de novo design algorithms remains to be further explored, and methods to adopt unnatural amino acids or chemical modifications are also needs to be developed. It's worth noting that, with the increase of data for cyclic peptide 3D structures, the data-driven machine learning method may provide a more promising solution for improving the efficiency and effectiveness of structure based cyclic peptide de novo design and conformation generation to develop cyclic peptide drugs in the future.

Applications of foldability in intelligent enzyme engineering and design: take AlphaFold2 for example

MENG Qiaozhen, GUO Fei

2023, 4(3): 571-589. doi:10.12211/2096-8280.2023-011

Asbtract ( 2052 )

HTML ( 134)

PDF (1986KB) ( 4264 )

Figures and Tables | References | Related Articles | Metrics

Natural enzymes often have advantages of environmental friendliness, high catalytic efficiency and so on. However, due to inappropriate pH, temperature and other conditions in industrial environment, the application of natural enzymes in industrial production is unsatisfactory owing to challenges such as misfolding of proteins and limited functions. Compared with traditional methods, enzyme design and engineering with the help of artificial intelligence (AI) have advantages of high efficiency, high speed and low cost, but most work does not consider the 'foldability' in the process of enzyme engineering. A designed enzyme may fold to another state for minimum energy, so called misfolding. As we all know, protein design is regarded as an inverse folding process. Can we utilize protein folding tools to constrain the foldability of the designed enzyme? In recent years, protein structure prediction tools represented by AlphaFold2 have made breakthroughs with the help of AI for accuracy at atomic levels, which enriches existing enzyme structure data for subsequent studies to address the above question. Therefore, we discuss applying protein structural tools to fulfill the task of enzyme design and engineering, increase the proportion of reliable enzymes designed and reduce the cost of experiments. Firstly, we review the application of artificial intelligence technology in enzyme design and engineering from the perspective of sequence and structure. Then, we summarize existing protein structure prediction tools into four types and introduce their methods and prediction ability respectively. Furthermore, taking AlphaFold2 as an example, we group the applications which improve the rationality of enzyme modification and the "foldability" of design into three categories: 1) Structure 'Analyzer', 2) Mutation 'Filter' and 3) Folding 'Monitor'. Finally, we highlight drawbacks with existing algorithms for further improvements. With the rapid development of AI and understanding on protein function mechanism, the precision of enzyme modifications and designs will be increased.

Pathological aggregation and liquid-liquid phase separation of proteins associated with neurodegenerative diseases

TANG Yiming, YAO Yifei, YANG Zhongyuan, ZHOU Yun, WANG Zichao, WEI Guanghong

2023, 4(3): 590-610. doi:10.12211/2096-8280.2023-005

Asbtract ( 2344 )

HTML ( 88)

PDF (3163KB) ( 2094 )

Figures and Tables | References | Related Articles | Metrics

Protein misfolding and aggregation are closely related to the development of neurodegenerative diseases. Their main pathological hallmark is protein inclusion bodies, whose major components are amyloid fibrils formed by abnormal protein aggregation. For example, Alzheimer's disease is related to the amyloid plaques formed by β-amyloid proteins and the neurofibrillary tangles formed by tubulin-associated unit (tau) proteins. The pathological feature of Parkinson's disease is Lewy bodies formed by aggregation of α-synuclein. In addition, recent studies have shown that a majority of neurodegenerative disease-related proteins including Tau, α-synuclein, and TDP-43 can undergo liquid-liquid phase separation to form liquid condensates or membrane-free organelles. These condensates are involved in a number of cellular physiological processes, such as regulating signal transduction. Pathological fibrosis and liquid-liquid phase separation are two forms of protein aggregation, and protein liquid-liquid phase separation may be a driving force for misaggregation and fibrosis. Disease-related mutations, post-translational modifications including truncations, acetylations, and phosphorylations, and microenvironments such as pH, ion strength, and temperature can promote or inhibit liquid-solid phase transitions and the formation of pathological fibrils. Uncovering molecular mechanism underlying pathological protein aggregation and liquid-liquid phase separation is crucial to understanding the pathogenic process and developing effective therapeutic drugs as well. This review focuses on recent progress in experimental and computational studies on the pathological aggregation and liquid-liquid phase separation of neurodegenerative disease-related proteins, including β-amyloid, α-synuclein, TDP-43, tau, and FUS proteins. We briefly introduce the application of experimental methods (nuclear magnetic resonance, X-ray diffraction, and cryo-electron microscopy) for studying protein aggregation and determining fibril structure with cutting-edge techniques (differential interference contrast and fluorescence recovering after photobleaching) to explore protein phase separation. Advances in the conformational ensemble of proteins using enhanced sampling methods such as replica-exchange molecular dynamics simulations, and studies of the phase behavior of proteins using field-theoretic simulation and multiscale simulations are summarized. Machine learning in predicting protein phase separation ability is also addressed.

Microbiome-based biosynthetic gene cluster data mining techniques and application potentials

LAI Qilong, YAO Shuai, ZHA Yuguo, BAI Hong, NING Kang

2023, 4(3): 611-627. doi:10.12211/2096-8280.2022-075

Asbtract ( 5636 )

HTML ( 541)

PDF (3056KB) ( 5348 )

Figures and Tables | References | Related Articles | Metrics

Biosynthetic gene cluster (BGC) is an important type of gene set, which is commonly found in the genomes of various organisms, and plays important metabolic and regulatory roles. In terms of linear gene structure, the set of genes in a BGC is usually located in close proximity to each other in the genome, but for functions, genes in a BGC usually work synergistically and are responsible for a class of pathways that generate specific small molecules. Therefore, BGCs are vital in synthetic biology research as a highly promising source for elements. However, current BGC databases and analytical platforms are limited by the number and types of experimentally validated BGCs, as well as by the preliminary BGC data mining techniques. The establishment of data-driven systematic discovery of BGCs and their validation, as well as translational studies, are of great value in both fundamental research and practical applications. This article focuses on mining BGCs from big data with microbiome for synthetic biology research. We start with discussing the definition and significance of BGC mining, and summarize current data resources and methods for BGC mining: including MIBiG, antiSMASH and IMG-ABC for artificial intelligence (AI) enabled web services to accelerate BGC mining. Then, we compile a walk-through on how a typical BGC data mining could be conducted, with the history of BGC mining methods highlighted, which underlines the route build-up from traditional machine learning to deep learning. We also diagnose bottlenecks in BGC mining, and propose possible solutions. Furthermore, according to several BGC mining and validation experiments, we demonstrate the profound diversity and breadth of application scenarios with BGC discovery, as well as the importance of combining dry and wet lab experiments for validating newly discovered BGCs. Finally, we envision that the combination of advanced BGC mining methods and synthetic biology could broaden and deepen current synthetic biology research.

Table of Content