Pairwise Structure Alignment
What is Structure Alignment?
Structure alignment attempts to establish residue-residue correspondence between two or more macromolecular structures based on the optimal superposition of their shape and three-dimensional conformation. Structure alignment requires no prior knowledge of equivalent pairs of residues, does not rely on the sequence alignment, and the type of residues is ignored when the correspondence is established.
This tool presents options for pairwise structure alignment of proteins. In the case of pairwise alignment, structures are always compared in pairs. In contrast to multiple structure alignment (reviewed in Ma and Wang, 2014) that provides a global solution for three or more structures.
Different types of structural alignments and their rationales are described below.
Rigid Body Alignment
In a rigid body alignment, the relative orientations and positions of atoms within each structure remain fixed during the alignment process. In the resulting superposition, only the overall shapes of the structures are aligned. Rigid body alignments are well-suited for identification of structural equivalences between proteins that are closely evolutionarily related and thus have similar shapes.
In a flexible structure alignment relative mobility between domains or subdomains in each structure is accommodated. When superposition by rigid alignment alone does not yield meaningful results, introducing flexibility to structural alignment becomes useful for two main reasons:
- It helps compare two protein chains that have adopted different conformational states, e.g., due to post-translational modifications such as phosphorylation or interaction with other proteins/ligands.
- It also helps identify conserved regions in proteins that may have distant evolutionary relationship. For example one of these proteins may contain extra loops or truncations that alter relative orientation of different domains in the structures.
Most structure alignment algorithms assume that the structural units of two similar proteins appear in the same order (in the N-terminal to C-terminal direction) within their sequences. However, this assumption may not always be true. There are many examples of natural and designed proteins where the spatial arrangement of secondary structural elements or protein domains is maintained but the protein backbone connections between these structural elements are different - i.e., the proteins have different topologies.
One such example is circular permutation, where the relative locations of structural elements (and the N- and C-termini) within two proteins are different, but their overall shape and structure (e.g., secondary structural elements and their relative orientations) are conserved.
When is Structure Alignment useful?
Structure alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Pairwise protein structure comparison can be used for analysis of conformational changes on ligand binding, analysis of structural variation between proteins within an evolutionary family, and identification of common structural domains.
Structure Alignment Interface
The structure alignment tool provides a simple-to-use, web-accessible interface for performing a wide range of structural superpositions. The tool can be accessed from the “Analyze” section of the menu bar. The interface allows you to align one or more structures to a given reference structure. You can select up to 10 structures for comparison. First structure will be used as a reference, and all other structures will be aligned to it in a pairwise manner.
|The user interface allows selecting protein structures for structure alignment|
You can chose one of the following options to specify structures:
- A - use Entry ID to select an existing database entry (e.g. 1AOB)
- B - use Web Link to specify the location of the structure file on the web
- C - use File Upload to upload your own structure file with coordinates. Accepted file formats include PDBx/mmCIF - must have .cif or .bcif extension, and PDB - must have .pdb or .ent extension.
The Chain ID input field must be populated. Selected chains must be at least 10 residues long and the structure must contain the coordinates of at least the C-alpha backbone atoms.
When a valid Entry ID is provided, the selection of chain IDs will be available listing the proteins with sequences longer than 10 residues. For other options, chain ID must be typed in. Note that the chain IDs are case-sensitive.
When the structure is provided as a file in PDBx/mmCIF format, the chain ID should correspond to the
_label_asym_id assigned for each chain during the deposition. See this documentation article for more information on PDB identifiers for macromolecular chains.
If only a part of the polymeric chain should be compared, the segments of polymer chains can be chosen by specifying residue ranges using the PDB residue numbers (sequential numbers from 1 to N using
_label_seq_id). Note if you are matching residues based on the author specified residue numbers (e.g., reported in the manuscripts) you may have to first convert it to the
_label_seq_id. If no range is specified all residues of the chain are included in the alignment by default.
When at least 2 chains are selected, the Compare button becomes available to launch the structure alignment.
A number of algorithms are provided to perform pairwise structural alignments. Brief descriptions of these algorithms are included below:
|jFATCAT-rigid||The structure alignment algorithm Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists (FATCAT) allows for flexible protein structure comparison (Ye and Godzik, 2003, Li et al., 2020). This tool offers use of the Java port of the original FATCAT. The rigid flavor of the algorithm is used to run a rigid-body superposition that only considers alignments with matching sequence order. For most structures the performance of this structure alignment is similar to that of CE.|
|jFATCAT-flexible||The flexible flavor of FATCAT introduces twists (hinges) between different parts of the superposed proteins so that these parts are aligned independently. This makes it possible to effectively compare protein structures that undergo conformational changes in specific parts of the molecule such that global (rigid body) superposition cannot capture the underlying similarity between domains. For example, when the two polymers being compared are in different functional forms (e.g., bound to partner proteins/ligands), were crystallized under different conditions, or have mutations. The downside of this approach is that it can lead to false positive matches in unrelated structures, requiring that results be carefully reviewed|
|jCE||The original Combinatorial Extension (CE) algorithm (Shindyalov and Bourne, 1998) works by identifying segments of the two structures with similar local structure, and then combining those regions to align the maximum number of residues in order to keep the root mean squared deviations (rmsd) between the pair of structures low. This Java port of the original CE uses a rigid-body alignment algorithm. Relative orientations of atoms in the structures being compared are kept fixed during superposition. It assumes that aligned residues occur in the same order in both proteins (i.e., the alignment is sequence-order dependent).|
|jCE-CP||Some protein pairs are related by a circular permutation, i.e., the N-terminal part of one protein is related to the C-terminal part of the other or vice versa, or the topology of loops connecting secondary structural elements in a domain are different. Combinatorial Extension with Circular Permutations (CE-CP, Bilven et al., 2015) allows the structural comparison of such circularly permuted proteins.|
|TM-align||Sequence-independent protein structure comparison TM-align is sensitive to global topology (Zhang and Skolnick, 2005). It uses dynamic programming iterations to generate sequence-independent residue-to-residue alignments between template and model structures.|
|Smith-Waterman 3D||Smith and Waterman's 1981 algorithm aligns similar sequence segments using Blosum65 scoring matrix. The Smith-Waterman 3D is based on this algorithm and aligns two structures based on the sequence alignment. Note that this method works well for structures with significant sequence similarity and is faster than the structure-based methods. However, any errors in locating gaps, or a small number of badly aligned residues can lead to high RMSD in the resulting superposition.|
Below you see the pairwise structure alignment of hemoglobin subunit alpha (PDB ID 4HHB, chain A) and neuroglobin (PDB ID 1OJ6, chain A). The following information is reported about the superimposed molecules:
- Description lists the name of the molecule
- Organism lists scientific names of organisms associated with taxonomy codes aggregated by the NCBI Taxonomy Database
- Sequence Length is the total number of polymeric residues in the deposited sequence for a given chain
- Modeled Residues is the number of residues with coordinates that were used for structure alignment
Superposed structures are shown in 3D using the interactive molecular visualization tool, Mol*. The sequence alignment that results from structure alignment is shown above the Mol* viewer. Pairs of residues that are structurally equivalent are colored orange (the first structure) or blue (the second structure).
|The structure alignment results display: sequence alignment and superposed 3D structures|
The structure alignment methods report different scores to assess the quality of the alignment. Each score has its particular properties that may be of interest in specific cases:
- RMSD (root mean square deviation) is computed between aligned pairs of the backbone C-alpha atoms in superposed structures. The lower the RMSD, the better the structure alignment between the pair of structures. This is the most commonly reported metric when comparing two structures, but it is sensitive to the local structure deviation. If a few residues in a loop are not aligned, the RMSD value is large, even though the rest of the structure is well aligned
- TM-score (template modeling score) is a measure of topological similarity between the template and model structures (Xu and Zhang, 2010). The TM-score ranges between 0 and 1, where 1 indicates a perfect match and 0 is no match between the two structures. Scores < 0.2 usually indicate that the proteins are unrelated while those >0.5 generally have the same protein fold (e.g., classified by SCOP/CATH)
- Sequence Identity (sequence identity percentage) is the percent of paired residues in the alignment that are identical in sequence
- Equivalent Residues is the number of residue pairs that are structurally equivalent in the alignment
- Reference/Target Coverage is the fraction of residues matched by the superposition (related by spacial proximity) relative to the total number of modeled residues being aligned
View and Download Options
Options in the pull-down menu "Selecting a View" can be used to change what is currently displayed in the interactive Mol* viewer. The options are shown in the figure and listed below:
|Select View options and the 3 different views of the aligned structures to see only the aligned residues, the protein chains, or the entire models as 3D representation of the alignment results|
- Aligned Residues: these are residues within a distance cutoff, defined for the alignment method. Note that the aligned regions of the two structures are shown in orange and blue
- Polymer Chains: show the full protein chains, including any parts of the polymer chain that are not aligned. Regions of the polymer chain that are not aligned are colored in lighter shades of orange and blue
- Full Structures: shows the full content of the deposited entry for the two structures being compared - including polymers, carbohydrates, ligands and water molecules. Regions of the polymer chain and other polymer entities that are not aligned are colored in lighter shades of orange and blue
The 3D View can be expanded to the fullscreen to provide fine-grained control over the view. Mol* will create designated components for a given selection that can be toggled or removed. Built-in Mol* functionality is available to change coloring and representations. Using the Set Coloring menu option for any given component shown in the Mol* full screen, the coloring can be changed as desired or the original (structure alignment) coloring can be restored with Superpose coloring option.
|Full screen view of the Selected View of the structure alignment in Mol*. The aligned residues are shown in orange and blue, while the parts that are not aligned are shown in lighter shades of the same colors.|
Options in the "Export" pull-down menu can be used to download coordinates, sequences, and matrices used for the alignment. The options are shown in the figure and listed below:
- Superposed Structures - allows downloading the transformed atomic coordinates in mmCIF format for both structures after superposition
- Sequence Alignment - allows downloading the aligned sequences in FASTA format from the selected structure alignment
- Transformation Matrices - allows downloading JSON file with 4x4 transformation matrix in a column major (j * 4 + i indexing) format, used to superimpose the structures
Note that downloading the superposed structures will include only the coordinates of the structure that is currently loaded into the viewer (e.g. residues, chains or full structures). The superposed structures can also be downloaded from the Mol* user interface, under the Export panel.
In any structure alignment, the first structure (query) is assumed to be rigid. The second structure (target) is superposed on the query structure. The Transformation Matrices are the operations necessary to move the coordinates of the target structure to match the query structure. In rigid-body alignment the transformation matrices of the single block is saved, while in flexible (and circular permutation) alignments transformation matrices for each flexible region (blocks) are reported in the downloaded file. The transformation matrices can be downloaded.
Share Alignment Results
Copy Link option is available when both structures are selected by providing Entry IDs as an input. After clicking on Copy Link button, the alignment results URL will be copied into your clipboard and can be pasted into e-mail, document, spreadsheet, notepad, or any other file or web page.
Align Multiple Structures to a Given Reference
You can overlay one or more proteins onto a common reference structure by structurally aligning them. Up to 10 structures can be selected. This can be useful to produce superpositions of different domains on a full protein. This example combines AlphaFold model of human Hepatocyte nuclear factor 4-alpha (AF-P41235-F1, in orange) and 2 PDB structures: crystal structure of human HNF4α DNA binding domain in complex with DNA target (3CBB C[auth A], in blue) and a complex of HNF-4α bound to fatty acid ligand and SRC-1 coactivator peptide (1PZL A, in green).
With availability of Computed Structure Models (CSMs) from RCSB.org, this example can also be run using the RCSB.org assigned CSM ID for the AlphaFold structure (AF_AFP41235F1) instead of providing a web-link to access it (https://alphafold.ebi.ac.uk/files/AF-P41235-F1-model_v2.cif). Learn more about CSMs and the RCSB.org.
1. Rigid Body Structure Alignment
Alignment of the mammalian tubulin (1TUB.A) with a close structural homolog within prokaryotes, the bacterial cell division protein FtsZ (1FSZ.A), shows that these proteins are structurally similar (with reported RMSD 3.02) despite low sequence identity (13.5%).
|Structural alignment of the mammalian tubulin (1TUB.A, in orange) and the bacterial cell division protein FtsZ (1FSZ.A, in blue)|
2. Rigid Body vs Flexible Structure Alignment
The structures of calmodulin with and without calcium bound can be much better aligned using a flexible rather than a rigid-body alignment algorithm. Below is an example of two calmodulin structures: calcium-free (1CLL.A) and calcium-loaded (1QX5.A) aligned with jFATCAT-flexible (left) and jFATCAT-rigid (right) algorithms.
3. Sequential vs Circular Permutation Structure Alignment
The proteins in this example, Concanavalin A ( PDB ID 3cna, chain A or 3CNA.A) and peanut lectin (PDB ID 2pel chain A or 2PEL.A), are related by a circular permutation. The 3D folds of the two proteins are highly similar but the N- and C- termini are located at different positions. While sequence-order dependent jCP algorithm can only find part of the alignment, the jCE-CP algorithm can discover a full alignment.
- Bliven, S E, Bourne, P E, Prlić, A, (2015) Detection of circular permutations within protein structures using CE-CP. Bioinformatics, 31(8): 1316–1318. doi: 10.1093/bioinformatics/btu823 (CE-CP)
- Li, Z, Lukasz Jaroszewski, L, Iyer, M, Sedova, M, Godzik, A. (2020) FATCAT 2.0: towards a better understanding of the structural diversity of proteins Nucleic Acids Research, 48 (W1) W60–W64. doi:10.1093/nar/gkaa443 (FATCAT 2.0)
- Ma, J, and Wang, S (2014). Algorithms, Applications, and Challenges of Protein Structure Alignment. Advances In Protein Chemistry And Structural Biology 121-175. doi: 10.1016/B978-0-12-800168-4.00005-6
- Shindyalov, I N, Bourne, P E (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, Design and Selection, 11(9): 739–747. doi: 10.1093/protein/11.9.739 (CE)
- Smith, T F, Waterman, M S, (1981) Identification of common molecular subsequences, Journal of Molecular Biology. 147(1): 195-197, doi: 10.1016/0022-2836(81)90087-5 (for Smith-Waterman 3D)
- Ye Y, Godzik A (2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics, 19 Suppl 2:ii246-55. doi: 10.1093/bioinformatics/btg1086. (FATCAT)
- Zhang, Y, Skolnick, J (2005) TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Research, 33: 2302-2309. doi: 10.1093/nar/gki524 (TM-align)