Computed Structure Models and RCSB.org
Protein Data Bank and RCSB.org
The Protein Data Bank (PDB) archives experimentally determined three dimensional (3D) structural data of biological macromolecules. In addition to providing access to 3D structural data, all members of the worldwide PDB (wwPDB) offer tools to query the archive, and then organize, visualize, and analyze groups of structures to learn about any topic of interest. The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), website (RCSB.org) integrates information about properties and functions of proteins from a variety of publicly available bioinformatics data resources - e.g., information about gene sequences, mutations, disease correlations, small molecule (drug) binding affinities. Mapping information integrated from these resources to 3D structural data can provide insights beyond what is available from the molecular structures alone. Users can access the structural and integrated information from the RCSB.org website for new perspectives about the topic of interest and ask new questions.
Expanding the limits of PDB
Although the PDB archive continues to rapidly grow in size and complexity, there are many millions of proteins whose structures have not yet been solved. For the past few decades researchers have been developing a variety of approaches for reliably predicting computed structure models of proteins. In 2020, two different projects [AlphaFold2 (Jumper, J. et al., 2021, Varadi M. et al., 2022) and RoseTTAFold (Baek et al., 2021)] used artificial intelligence (AI) and machine learning (ML) to successfully predict protein structures from their sequences. The approaches utilize knowledge of protein structures from the PDB, and vast amounts of protein sequencing data to compute these models.
Access to reliable computed structure models (CSMs) has created new opportunities for molecular explorations and analysis. When experimental structures of the protein or complexes being studied are not available, CSMs can provide a great alternative and/or an initial model for data analysis, and hypothesis development. To make it easier for users to query, organize, visualize, analyze and compare experimental and predicted structures alongside each other, RCSB PDB has integrated CSMs from a few specific resources.
Experimental Structures and CSM at RCSB.org
Access to both experimental structure and CSMs on a topic of interest offers users insights and choice. When exploring structure-function relationships, experimental structures of a protein or parts of it are likely to be more accurate and have a higher level of confidence compared to CSMs. Yet CSMs can provide great starting models for the millions of proteins and their complexes whose structures have not been experimentally determined. It should be noted that the quality of CSMs may not be uniform - parts of the CSM that are computed based on existing PDB structures and various other experimental data are likely to be more accurate and predicted with a higher level of confidence. Learn more about the quality and confidence of experimental structures and CSMs. As with experimental structures in the PDB, when using CSMs the accuracy and confidence of the 3D models should be considered in structure analysis and hypothesis development.
What CSMs are available?
The computed structure models from the following providers are integrated with RCSB.org:
- AlphaFold Protein Structure Database. These models are state-of-the-art protein structure predictions based on amino-acid sequences, using an AI system called AlphaFold2 (Jumper et al., 2021, Varadi M. et al., 2022). The first set of models included in RCSB.org is the pre-packaged collection of 999,255 models released on 01-Jun-2022 (V3, made public on 28-Jul-2022 at https://ftp.ebi.ac.uk/pub/databases/alphafold/), which encompasses four main groups of models (as listed on https://alphafold.ebi.ac.uk/download):
- Model organism proteomes: 326,175 protein structures from 48 different model organisms e.g., Arabidopsis, E. coli, fruit fly, human, soybean, and zebrafish.
- Global health proteomes: 238,274 protein structures from various disease-causing organisms, e.g., H. pylori, K. pneumoniae, M. tuberculosis, and P. falciparum.
- Swiss-Prot sequences: 542,380 protein structures, 430,961 of which are in addition to those already included from the model organisms and global health structures set.
- MANE (Matched Annotation from NCBI and EMBL-EBI) Select sequences: 17,334 protein structures, 3,844 of which are in addition to those already included from the other groups of models (a-c).
- ModelArchive: This database hosts user-submitted predictions of protein structures which were generated using a variety of approaches, e.g., homology modeling, ab initio, and deep learning techniques. The first batch of structures to be included in RCSB.org will include computed structures of core eukaryotic protein complexes produced by the Baker lab (Humphreys et al., 2021) released on 11-Nov-2021. Models included in this set were screened through paired multiple sequence alignments and were computed using a combination of RoseTTAFold (Baek et al., 2021) and AlphaFold2. Note that RoseTTAFold is a software tool that uses deep learning to quickly and accurately predict protein structures based on limited information.
- Set of complexes total 1,106 models (https://modelarchive.org/doi/10.5452/ma-bak-cepc)
How can you access the CSMs?
The following approaches are available to identify experimental structures and CSMs in RCSB.org and query for them.
Identifying Type of 3D Structure
Specific icons (dark blue flask icon for experimental structures and cyan colored computer icon for CSMs) are now used throughout the website to quickly and easily identify the source of 3D models selected for visualization and/or analysis.
Querying for Structures in RCSB.org
1. Options are available for structure queries to include CSMs alongside experimental structures (from the PDB) in the search results. When the toggle switch in the top search box is turned 'on' (i.e., is cyan-colored, as shown in Figure 1), CSMs are included in the search. Learn more about including/excluding CSMs in basic search. The Advanced Search Query builder also has a similar toggle switch to include CSMs (Figure 2).
|Figure 2: Advanced Search Query Builder options available form the RCSB.org home page|
2. New structure attributes for CSMs have been added to search options so that specific queries based on source database and confidence level can be made (Figure 3). Learn more about the Computed Structure Model Attributes.
|Figure 3: Structure Attributes (properties) available to search for CSMs using the Advanced Search Query Builder.|
3. Query results include icons to indicate whether matched structure(s) are experimental models or CSMs (Figure 4).
|Figure 4: Part of the Search Results page showing an experimental structure and a CSM, each marked with their respective icons (highlighted with a red outlined box).|
Note: Each CSM is assigned a specific ID in its source database (e.g., AlphaFold or Model archive). However, in order to enable compatibility of the IDs with many of our services, including all of our APIs and visualization tools, we identify CSMs on RCSB.org using a modified version of the ID. This ID is used on the structure summary page, in searching for structures, in the search results page, and in various tools for 3D structure visualization and analysis. For example, for the AlphaFold structure AF-B3EWR1-F1, the RCSB.org assigned CSM ID is AF_AFB3EWR1F1 and is used in the query results page as shown in Figure 4.
4. The default order of the search results is based on a relevancy score for the query criteria. The Refinements menu on the left of the query results page offers options to view only experimental structures or only CSMs in the search results (see Figure 5A). Learn more about the query results page and refinements. The order of the search results may also be changed according to a few options - e.g., view CSMs in the results first or last; order the CSMs by pLDDT scores (see Figure 5B).
|Figure 5: Options to refine search results. A. Check boxes to selectively exclude experimental structures or CSMs; B. Options to order the search results to prioritize experimental structures or CSMs.|
What can you do with the CSMs at RCSB.org?
You can search for, explore, visualize, analyze, and compare experimental structures and CSMs at RCSB.org.
- Search for 3D structures (experimental structures and CSMs) using a variety of search services, including searches based on attributes, sequences, sequence motifs, structures, and structure motifs.
- View individual CSM structure summary pages that provide a quick overview of quality based on confidence levels defined by pLDDT scores. Learn more here.
- Each CSM structure summary page has options to visualize and analyze its 3D structure in a manner identical to that provided for experimental structures in the PDB.
- Download the 3D coordinates of any specific CSM structure from either the query results page or structure summary page. Note that requests to download batches of multiple CSM structures should be directed to the relevant model source database.
- Group CSM structures together with experimental structures to generate group views for comparison and analysis.
- Perform comparative analyses of protein 3D structures to find similarities within a set of CSMs or between CSM and PDB structures using the pairwise structure alignment tool.
Models and Assembly
The CSMs available from the RCSB.org make no claims about predicting higher order oligomeric assemblies. However, to include CSMs in structure based query and analysis (e.g., Find similar assembly, Structure search, Structure motif search), the model coordinates of CSMs are also included as Assemblies - i.e., for CSMs the Model and Assembly coordinates are identical.
- Query for proteins (including CSMs) from the Mediterranean mussel (Mytilus galloprovincialis).
- Structure summary page for the Histone-4 protein of Mytilus galloprovincialis (from AlphaFoldDB).
- Query for high-quality (pLDDT > 90) computed structure models of human proteins.
- Query for 3D structures of myoglobin grouped by 30% sequence identity and displayed as groups.
- Query for mouse CSMs that do not have a corresponding experimental structure.
- Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., Wang, J., Cong, Q., Kinch, L. N., Schaeffer, R. D., Millán, C., Park, H., Adams, C., Glassman, C. R., DeGiovanni, A., Pereira, J. H., Rodrigues, A. V., van Dijk, A. A., Ebrecht, A. C., Opperman, D. J., … Baker, D. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373, 871–876. https://doi.org/10.1126/science.abj8754
- Jumper, J., Evans, R., Pritzel, A. et al. (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 . https://doi.org/10.1038/s41586-021-03819-2
- Humphreys, I. R., Pei, J., Baek, M., Krishnakumar, A., Anishchenko, I., Ovchinnikov, S., Zhang, J., Ness, T. J., Banjade, S., Bagde, S. R., Stancheva, V. G., Li, X. H., Liu, K., Zheng, Z., Barrero, D. J., Roy, U., Kuper, J., Fernández, I. S., Szakal, B., Branzei, D., … Baker, D. (2021). Computed structures of core eukaryotic protein complexes. Science, 374(6573), eabm4805. https://doi.org/10.1126/science.abm4805
- Varadi, M., Anyango, S., Deshpande, M., Nair, S., Natassia, C., Yordanova, G., Yuan, D., Stroe, O., Wood, G., Laydon, A., Žídek, A., Green, T., Tunyasuvunakool, K., Petersen, S., Jumper, J., Clancy, E., Green, R., Vora, A., Lutfi, M., Figurnov, M., Cowie, A., Hobbs, N., Kohli, P., Kleywegt, G., Birney, E., Hassabis, D., Velankar, S. (2022). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research, 50, D439–D444, https://doi.org/10.1093/nar/gkab1061