Skip to main content

AI/​alphafold directory


“State-of-the-Art Estimation of Protein Model Accuracy Using AlphaFold”, Roney & Ovchinnikov 2022

“State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold”⁠, James P. Roney, Sergey Ovchinnikov (2022-03-12; similar):

The problem of predicting a protein’‘s 3D structure from its primary amino acid sequence is a longstanding challenge in structural biology. Recently, approaches like AlphaFold have achieved remarkable performance on this task by combining deep learning techniques with coevolutionary data from multiple sequence alignments of related protein sequences. The use of coevolutionary information is critical to these models’’ accuracy, and without it their predictive performance drops considerably. In living cells, however, the 3D structure of a protein is fully determined by its primary sequence and the biophysical laws that cause it to fold into a low-energy configuration. Thus, it should be possible to predict a protein’’s structure from only its primary sequence by learning a highly-accurate biophysical energy function.

We provide evidence that AlphaFold has learned such an energy function, and uses coevolution data to solve the global search problem of finding a low-energy conformation.

We demonstrate that AlphaFold’’s learned potential function can be used to rank the quality of candidate protein structures with state-of-the-art accuracy, without using any coevolution data.

Finally, we propose a method for utilizing this potential function to predict protein structures without the need for MSAs.

“FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours”, Cheng et al 2022

“FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours”⁠, Shenggan Cheng, Ruidong Wu, Zhongming Yu, Binrui Li, Xiwen Zhang, Jian Peng, Yang You (2022-03-02; ; similar):

Protein structure prediction is an important method for understanding gene translation and protein function in the domain of structural biology. AlphaFold introduced the Transformer model to the field of protein structure prediction with atomic accuracy. However, training and inference of the AlphaFold model are time-consuming and expensive because of the special performance characteristics and huge memory consumption.

In this paper, we propose FastFold, a highly efficient implementation of protein structure prediction model for training and inference. FastFold includes a series of GPU optimizations based on a thorough analysis of AlphaFold’s performance. Meanwhile, with Dynamic Axial Parallelism and Duality Async Operation, FastFold achieves high model parallelism scaling efficiency, surpassing existing popular model parallelism techniques.

Experimental results show that FastFold reduces overall training time from 11 days to 67 hours [2.8 days] and achieves 7.5–9.5× speedup for long-sequence inference. Furthermore, we scaled FastFold to 512 GPUs and achieved an aggregate of 6.02 PetaFLOPs with 90.1% parallel efficiency.

The implementation can be found at Github⁠.

“AlphaFold Accelerates Artificial Intelligence Powered Drug Discovery: Efficient Discovery of a Novel Cyclin-dependent Kinase 20 (CDK20) Small Molecule Inhibitor”, Ren et al 2022

“AlphaFold Accelerates Artificial Intelligence Powered Drug Discovery: Efficient Discovery of a Novel Cyclin-dependent Kinase 20 (CDK20) Small Molecule Inhibitor”⁠, Feng Ren, Xiao Ding, Min Zheng, Mikhail Korzinkin, Xin Cai, Wei Zhu, Alexey Mantsyzov, Alex Aliper, Vladimir Aladinskiy et al (2022-01-21; similar):

The AlphaFold computer program predicted protein structures for the whole human genome, which has been considered as a remarkable breakthrough both in artificial intelligence (AI) application and structural biology. Despite the varying confidence level, these predicted structures still could significantly contribute to the structure-based drug design of novel targets, especially the ones with no or limited structural information.

In this work, we successfully applied AlphaFold in our end-to-end AI-powered drug discovery engines constituted of a biocomputational platform PandaOmics and a generative chemistry platform Chemistry42, to identify a first-in-class hit molecule of a novel target without an experimental structure starting from target selection towards hit identification in a cost-efficient and time-efficient manner. PandaOmics provided the targets of interest and Chemistry42 generated the molecules based on the AlphaFold predicted structure, and the selected molecules were synthesized and tested in biological assays.

Through this approach, we identified a small molecule hit compound for CDK20 with a Kd value of 8.9 ± 1.6 uM (n = 4) within 30 days from target selection and after only synthesizing 7 compounds.

To the best of our knowledge, this is the first reported small molecule targeting CDK20 and more importantly, this work is the first demonstration of AlphaFold application in the hit identification process in early drug discovery.

“The Accuracy of Protein Structures in Solution Determined by AlphaFold and NMR”, Fowler & Williamson 2022

“The accuracy of protein structures in solution determined by AlphaFold and NMR”⁠, Nicholas J. Fowler, Mike P. Williamson (2022-01-20; similar):

In the recent CASP (Critical Assessment of Structure Prediction) competition, AlphaFold performed outstandingly. Its worst predictions were for NMR structures, which has two alternative explanations: either the NMR structures were poor, implying that AlphaFold may be more accurate than NMR; or there is a genuine difference between crystal and solution structures. Here, we use the program ANSURR, which measures the accuracy of solution structures, and show that one of the NMR structures was indeed poor. We then compare AlphaFold predictions to NMR structures, and show that AlphaFold tends to be more accurate than NMR ensembles, in particular correctly more rigid in loops. There are however some cases where the NMR ensembles are more accurate. These tend to be dynamic structures where AlphaFold had low confidence. We suggest that AlphaFold could be used as the model for NMR structure refinements, and that AlphaFold structures validated by ANSURR require no further refinement.

“Protein Structure Predictions to Atomic Accuracy With AlphaFold”, Jumper & Hassabis 2022

2022-jumper.pdf: “Protein structure predictions to atomic accuracy with AlphaFold”⁠, John Jumper, Demis Hassabis (2022-01-11)

“Computed Structures of Core Eukaryotic Protein Complexes”, Humphreys et al 2021

2021-humphreys.pdf: “Computed structures of core eukaryotic protein complexes”⁠, Ian R. Humphreys, Jimin Pei, Minkyung Baek, Aditya Krishnakumar, Ivan Anishchenko, Sergey Ovchinnikov et al (2021-11-11; similar):

[Lowe commentary] Deep learning for protein interactions: The use of deep learning has revolutionized the field of protein modeling⁠. Humphreys et al 2021 combined this approach with proteome-wide, coevolution-guided protein interaction identification to conduct a large-scale screen of protein-protein interactions in yeast (see the Perspective by Pereira and Schwede). The authors generated predicted interactions and accurate structures for complexes spanning key biological processes in Saccharomyces cerevisiae⁠. The complexes include larger protein assemblies such as trimers⁠, tetramers⁠, and pentamers and provide insights into biological function.

Introduction: Protein-protein interactions play critical roles in biology, but the structures of many eukaryotic protein complexes are unknown, and there are likely many interactions not yet identified. High-throughput experimental methods such as yeast 2-hybrid and affinity-purification mass spectrometry have been used to identify interactions in multiple organisms, but there are inconsistencies between different datasets, and the methods do not provide high-resolution structural information. Here, we use deep learning methods to systematically identify and build structures for the protein complexes that mediate key processes in eukaryotes.

Rationale: Interacting proteins often co-evolve, and in prokaryotes, evolutionary information can be used to identify interactions on the proteome scale at an accuracy higher than that of experimental screens. Extending this method to eukaryotes is complicated because there are fewer genome sequences available, resulting in weaker coevolutionary signals. The deep learning methods RoseTTAFold and AlphaFold⁠, have a rich understanding of protein sequence-structure relationships, and so could help overcome this limitation.

Results: We developed a coevolution-guided protein interaction identification pipeline that incorporates a rapidly computable version of RoseTTAFold with the slower but more accurate AlphaFold to systematically evaluate interactions between 8.3 million pairs of yeast proteins. RoseTTAFold alone has comparable performance in identifying protein-protein interactions to that of large-scale experimental methods; combination with AlphaFold increases identification accuracy. In total, we constructed models for 106 previously unidentified assemblies and 806 that were structurally uncharacterized.

These complexes provide rich insights into a range of biological processes from transcription, translation, and DNA repair to protein transport and modification. For example, Rad51 plays a pivotal role in DNA repair through homologous recombination⁠, and mutations are associated with Fanconi anemia and cancer in humans. Rad55 and Rad57 are positive regulators of Rad51 assembly on single-stranded DNA. Our Rad55–Rad57–Rad51 complex model suggests that Rad55–Rad57 can bind at the 5′ end of the Rad51 single-stranded DNA filament and may stabilize the filament conformation of Rad51. Glycosylphosphatidylinositol transamidase (GPI-T) is a pentameric enzyme complex that catalyzes the attachment of GPI anchors to the C terminus of proteins. GPI-T is structurally uncharacterized, and mutations in subunits of the complex have been implicated in neurodevelopmental disorders and cancer in humans. Our model of the 5-protein assembly shows that the previously identified catalytic dyad is positioned adjacent to a channel formed by 3 other subunits that could function in C-terminal GPI-T signal peptide recognition.

Conclusion: Our approach extends the range of large-scale deep learning-based structure modeling from monomeric proteins to protein assemblies. Following up on the many new interactions and complex structures should advance the understanding of a wide range of eukaryotic cellular processes and provide new targets for therapeutic intervention. Our results herald a new era of structural biology in which computation plays a fundamental role in both interaction discovery and structure determination.

“Towards a Structurally Resolved Human Protein Interaction Network”, Burke et al 2021

“Towards a structurally resolved human protein interaction network”⁠, David F. Burke, Patrick Bryant, Inigo Barrio-Hernandez, Danish Memon, Gabriele Pozzati, Aditi Shenoy et al (2021-11-09; similar):

All cellular functions are governed by complex molecular machines that assemble through protein-protein interactions. Their atomic details are critical to the study of their molecular mechanisms but fewer than 5% of hundreds of thousands of human interactions have been structurally characterized. Here, we test the potential and limitations of recent progress in deep-learning methods using AlphaFold to predict structures for 65,484 human interactions. We show that higher confidence models are enriched in interactions supported by affinity or structure based methods and can be orthogonally confirmed by spatial constraints defined by cross-link data. We identify 3,137 high confidence models, of which 1,371 have no homology to a known structure, from which we identify interface residues harbouring disease mutations, suggesting potential mechanisms for pathogenic variants. We find groups of interface phosphorylation sites that show patterns of co-regulation across conditions, suggestive of coordinated tuning of multiple interactions as signalling responses. Finally, we provide examples of how the predicted binary complexes can be used to build larger assemblies. Accurate prediction of protein complexes promises to greatly expand our understanding of the atomic details of human cell biology in health and disease.

“A Structural Biology Community Assessment of AlphaFold 2 Applications”, Akdel et al 2021

“A structural biology community assessment of AlphaFold 2 applications”⁠, Mehmet Akdel, Douglas E. V Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O. Zalevsky, Bálint Mészáros et al (2021-09-26; similar):

Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods have led to protein structure predictions that have reached the accuracy of experimentally determined models. While this has been independently verified, the implementation of these methods across structural biology applications remains to be tested. Here, we evaluate the use of AlphaFold 2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modelling of interactions; and modelling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modelled when compared to homology modelling, identifying structural features rarely seen in the PDB. AF2-based predictions of protein disorder and protein complexes surpass state-of-the-art tools and AF2 models can be used across diverse applications equally well compared to experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life science research.

“Single-sequence Protein Structure Prediction Using Language Models from Deep Learning”, Chowdhury et al 2021

“Single-sequence protein structure prediction using language models from deep learning”⁠, Ratul Chowdhury, Nazim Bouatta, Surojit Biswas, Charlotte Rochereau, George M. Church, Peter K. Sorger et al (2021-08-04; similar):

AlphaFold and related systems use deep learning to predict protein structure from co-evolutionary relationships encoded in multiple sequence alignments (MSAs). Despite dramatic, recent increases in accuracy, three challenges remain: (1) prediction of orphan and rapidly evolving proteins for which an MSA cannot be generated, (2) rapid exploration of designed structures, and (3) understanding the rules governing spontaneous polypeptide folding in solution. Here we report development of an end-to-end differentiable recurrent geometric network (RGN) able to predict protein structure from single protein sequences without use of MSAs. This deep learning system has two novel elements: a protein language model (AminoBERT) that uses a Transformer to learn latent structural information from millions of unaligned proteins and a geometric module that compactly represents Cα backbone geometry. RGN2 outperforms AlphaFold2 and RoseTTAFold (as well as trRosetta) on orphan proteins and is competitive with designed sequences, while achieving up to a 106-fold reduction in compute time. These findings demonstrate the practical and theoretical strengths of protein language models relative to MSAs in structure prediction.

“Can AlphaFold2 Predict Protein-peptide Complex Structures Accurately?”, Ko & Lee 2021

“Can AlphaFold2 predict protein-peptide complex structures accurately?”⁠, Junsu Ko, Juyong Lee (2021-07-28; similar):

In this preprint, we investigated whether AlphaFold2 (AF2), can predict protein-peptide complex structures only with sequence information. We modeled the structures of 203 protein-peptide complexes from the PepBDB DB and 183 from the PepSet. The structures were modeling with concatenated sequences of receptors and peptides via poly-glycine linker. We found that for more than half of the test cases, AF2 predicted the bound structures of peptides with good accuracy, Cα-RMSD of a peptide < 3.0 Å. For about 40% of cases, the peptide structures were modeled with an accuracy of Cα-RMSD < 2.0 Å. Our benchmark results clearly show that AF2 has a great potential to be applied to various higher-order structure prediction tasks.

“Accurate Prediction of Protein Structures and Interactions Using a 3-track Network”, Baek et al 2021

“Accurate prediction of protein structures and interactions using a 3-track network”⁠, Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang et al (2021-06-15; backlinks; similar):

DeepMind presented remarkably accurate protein structure predictions at the CASP14 conference. We explored network architectures incorporating related ideas and obtained the best performance with a 3-track network in which information at the 1D sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The 3-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables rapid solution of challenging X-ray crystallography and cryo-EM structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate models of protein-protein complexes from sequence information alone, short circuiting traditional approaches which require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.

One-Sentence Summary

Accurate protein structure modeling enables rapid solution of structure determination problems and provides insights into biological function.

“Deep Learning Methods in Protein Structure Prediction”, Torrisi et al 2020

“Deep learning methods in protein structure prediction”⁠, Mirko Torrisi, Gianluca Pollastri, Quan Le (2020; similar):

Protein Structure Prediction is a central topic in Structural Bioinformatics. Since the ’60s statistical methods, followed by increasingly complex Machine Learning and recently Deep Learning methods, have been employed to predict protein structural information at various levels of detail. In this review, we briefly introduce the problem of protein structure prediction and essential elements of Deep Learning (such as Convolutional Neural Networks, Recurrent Neural Networks and basic feed-forward Neural Networks they are founded on), after which we discuss the evolution of predictive methods for one-dimensional and two-dimensional Protein Structure Annotations, from the simple statistical methods of the early days, to the computationally intensive highly-sophisticated Deep Learning algorithms of the last decade. In the process, we review the growth of the databases these algorithms are based on, and how this has impacted our ability to leverage knowledge about evolution and co-evolution to achieve improved predictions. We conclude this review outlining the current role of Deep Learning techniques within the wider pipelines to predict protein structures and trying to anticipate what challenges and opportunities may arise next.

Transformer (machine learning model)




Protein structure prediction


Critical Assessment of Techniques for Protein Structure Prediction