Overview: Advanced Search
What is Advanced Search?
Besides the 3D coordinates of atoms in the structure, all PDB entries include a variety of meta-data about the experiment, polymer sequences, and ligands present in the structure. Information and annotations from other data resources are also connected to each PDB entry.
Why use Advanced Search?
RCSB PDB Advanced Search options allow you to query all data in the coordinate files and their associated annotations to rapidly find structures, polymers, and ligands relevant to the topic of interest.
Advanced Search Query Builder Interface
The Advanced Search Query Builder interface features powerful options for constructing complex searches and managing results. Attribute queries can be seamlessly combined with sequence and structure similarity searches.
Queries: Using the Advanced Search Options
The Advanced Search options allow you to construct complex composite queries to quickly find structures and/or information related to the topic of interest by
- defining the query either as specifically or as broadly as is appropriate for the search being performed.
- organizing search results in a manner that facilitates rapid identification of the matches of interest.
Criteria for Searching
RCSB PDB Advanced Search options allow you to query the archive in four distinct ways:
- Attribute Search: This option allows for three types of searches based on specific text-based or numerical properties of entries, assemblies, or ligands.
- Full Text searches of PDB entries and their associated annotations
- Structural Attributes searches of textual and numerical properties of PDB structure(s), their associated experimental details that relate to polymer molecules (e.g., name, identifiers)
- Chemical Attributes searches of the names and classifications of small molecules (ligands), inhibitors, drugs, etc. that are present in PDB structures
- Sequence Based Search: This option allows for searches based on the polymer sequences present in PDB structures.
- Sequence search uses the polymer sequences of all or a significant portion of the proteins and nucleic acids present in a structure.
- Sequence Motif search uses a short polymer query sequence.
- Structure Based Search: This option enables searches based on 3D structural alignments.
- Structure search is based on 3D shape.
- Structure Motif search is based on the local arrangement of a selected set of structural building blocks in a given PDB structure.
- Chemical Search: This option allows for searches based on chemical information (e.g., chemical formula and descriptors like SMILES, InChI).
Composite and complex Boolean queries can be constructed using the 'AND', 'OR', and 'NOT' options available on the Advanced Search interface. These operators can be applied to specific attributes or groups of attributes and used as follows:
- AND: identify structures, polymers, ligands, or assemblies that meet all the specified criteria.
- OR: identify structures, polymers, ligands, or assemblies that meet any one of the specified criteria.
- NOT: exclude structures, polymers, ligands, or assemblies that meet the specified criteria.
Different search types (i.e., attribute-based, sequence-based, structure-based, and chemical information-based searches) may be combined to construct complex composite queries. As you assemble a composite query, you have the option of running subqueries to assess how many structures, polymers, etc. match those selected portions of the overall query.
Search Result Return Options
PDB archive-wide searches can return two kinds of results:
- Structural data: PDB entry, entity, and/or assembly information pertaining to one or more structures in the archive.
- Molecular definitions: chemical component dictionary and BIRD molecule matches to specific search criteria.
Before you launch a search you can select the type of results you wish to see. For example, do you wish to find PDB structures or entries, polymer entities (e.g., that match a sequence), assembly of biological macromolecules, or small molecule ligands? These can be specified by selecting the appropriate Return Options presented as a pulldown menu in the lower left corner of the Advanced Search Query Builder.
Several options for the types of search results are available in the pulldown, with the default option being “Structures”.
- Structures - PDB entries, designated with a 4-character alphanumeric identifier, e.g. 1Q2W. See grouping options for Structures.
- Polymer Entities - distinct (chemically unique) polymeric molecules present in PDB entries, specifically proteins (polypeptides), DNA (polydeoxyribonucleotide), and RNA (polyribonucleotide). See grouping options for Structures.
- Non-polymer Entities - small chemicals (enzyme cofactors, ligands, ions, etc) defined as non-polymers in the coordinate file. A non-polymer Entity ID is a combination of a PDB ID and entity ID, e.g., 4HHB_3. This option can be useful when searching for small chemicals within their macromolecular context, for instance covalently bound ligands such as ATP in 2CCH (entity 2CCH_4).
- Assemblies - the macromolecular quaternary structure believed to be the functional form of the molecule (also referred to as the “biological unit”)
- Molecular Definitions - includes standard and modified amino acids (e.g., ALA) and nucleotides (e.g., A), small molecule ligands (e.g., ATP, or HEM) as they are defined in the wwPDB Chemical Component Dictionary (CCD), and peptide-like molecules as they are defined in the Biologically Interesting molecule Reference Dictionary (BIRD), (e.g., PRD_000010).
The Results Page
Depending on the Return options selected, the search results page displays lists of structures or PDB entries, polymer entities, ligands, assemblies etc.
- View - the search results can be viewed in a variety of formats:
- Summary - every search hit is displayed with an image and summary information.
- Gallery - every search hit is displayed with an image only.
- Compact - every search hit is displayed with summary information only.
Options for Grouping Results
The PDB archive includes multiple structures of some proteins, providing snapshots of the structure, interactions, and functions of these particular proteins under different conditions. While this redundancy provides a deeper understanding of the biology of these macromolecules, they may present some challenges in bioinformatics analyses. When queries return many results that are the same protein or similar proteins, it is helpful to be able to remove redundancy, group, and organize search results, for three main reasons:
- Remove undesirable biases - which may be introduced if a result set has high numbers of similar and homologous proteins.
- Have smaller datasets of distinct representatives - which may be important as the size of the PDB continues to grow.
- Draw attention to the full range of query matches - by hiding away redundant results, relevant results lower in the result list become more prominent.
Redundancy occurs at many levels (such as the level of sequence or structure similarity), and a variety of different grouping methods can be applied to PDB data in order to provide a non-redundant view. Available grouping options can be selected from the “grouped by” dropdown at the bottom of the Advanced Search Query Builder. Options for grouping results include:
One option to group results is by Group Deposition ID. Group Deposition ID is a common identifier assigned to a group of PDB entries deposited as a collection via RCSB PDB Group Deposition server. For example, most entries in the group may have the same protein(s), but with different bound ligands.
- Selecting this option for the search will list one representative structure for each of the Grouped depositions, while all other structures with the same Group Deposition ID are hidden.
- Structures in the results that do not have any Group Deposition ID are not listed in the grouped results.
Once the query with grouping option is run the search results show one representative per group. Various criteria are available for selecting the representative structure:
- Resolution: Best - the best or highest resolution structure (lowest number in the experimental structure resolution) for the refined structural model. Structures with no resolution have lower ranking compared to structures having assigned resolution values.
- Entry All Residues: Most - the largest total count of monomers for all polymer entity instances reported per deposited structure model.
- Entry Modeled Residues: Most - the largest total count of monomers with reported coordinate data for all polymer entity instances reported per deposited structure model.
- Entry Chain Count: Most - the largest total count of polymer entity instances per deposited structure model.
- Score: Best - the most relevant for a given search query.
Grouping Polymer Entities
When “Return” is set to “Polymer Entities”, the search results can be grouped in the following ways:
- By Sequence Identity - polymer entities in the result list can be matched by using specific sequence identity criteria (from 100% to 30%).
- By UniProt ID - polymer chains in the result set are matched by the UniProt Accession associated with the polymer sequence.
Once the query with grouping option is run the search results show only one representative per group is displayed and all others in the same group are hidden.
The representative within the group is selected based on a few available criteria in the corresponding drop-down menu:
- Resolution: Best - polymer from the best or highest resolution structure (lowest number in the experimental structure resolution) for the refined structural model. Structures with no resolution have lower ranking compared to structures having assigned resolution values.
- Entity Residue Count: Most - the longest residue (monomer) sequences of the experimental sample (includes modeled and unmodeled residues).
- Release Date: Newest - the most recent date of initial structure release.
- Score: Best - the most relevant for a given search query.
- Coverage: Largest (only for UniProt Accession groups) - the largest percent coverage of the entire UniProt sequence.