New Access to the PDB archive
RCSB PDB has expanded our data storage capacity through the Amazon Web Services (AWS) Open Data Sponsorship Program. The AWS program is providing more than 100 terabytes of storage for no-cost delivery of Protein Data Bank information to millions of scientists, educators, and students around the world working in fundamental biology, biomedicine, bioenergy, and bioengineering/biotechnology.
Access the PDB archive over AWS at https://s3.rcsb.org/. Data are organized following the PDB FTP tree structure, and are updated weekly. The collection of annual and milestone snapshots of the archive are also available at AWS.
“For more than five decades, the global Protein Data Bank has enabled basic, translational, and clinical research by providing open access to three-dimensional (3D) biostructure information at the atomic level,” said Stephen K. Burley, M.D., D.Phil., director of the RCSB PDB, founding director of Rutgers Institute for Quantitative Biomedicine, University Professor and Henry Rutgers Chair at Rutgers University. “Open access to Protein Data Bank information is central to accelerating scientific discoveries for the benefit of all humanity.”
The AWS Open Data Sponsorship Program covers the cost of storage and egress for publicly available, high-value, cloud-optimized datasets to successful applicants. Working with data providers, Amazon aims to provide open access to data by making it available for analysis on AWS; develop new cloud-native techniques, formats, and tools that lower the cost of working with data; and encourage the development of communities that benefit from access to shared datasets.
“Access to open data sets is improving the way the scientific community can collaborate and accelerate life-changing discoveries,” said Andreia Pierce, Head of AWS Research. “The Protein Data Bank provides a vast and diverse repository for researchers in government, academia, and industry to use to develop diagnostics, vaccines, drugs, and other therapeutic treatments. AWS can help provide the Protein Data Bank the capacity to scale up to meet the increasing demand to continue to provide free and open access information and unlock the latest analytic capabilities.”
The Protein Data Bank archive currently houses nearly 190,000 experimentally-determined 3D structures of proteins, DNA and RNA that are freely available with no limitations on usage. The archive is jointly managed by the Worldwide Protein Data Bank partnership, involving data centers in the United States, Europe, and Asia. The U.S. data center is operated by the RCSB PDB at Rutgers, the University of California, San Diego-San Diego Supercomputer Center, and the University of California, San Francisco.
“The Protein Data Bank plays an important role in facilitating discovery and development of lifechanging drugs,” added Burley, who also co-leads the Cancer Pharmacology Research Program at Rutgers Cancer Institute of New Jersey. “Freely available 3D biostructure data constitute a public good with far-reaching impacts on patients and their families.”
The RCSB PDB has been operating the United States data center for the global Protein Data Bank for more than 20 years. Burley is an expert in structural biology, molecular biophysics, computational biology, data science, structure-guided/fragment-based drug discovery, and clinical medicine/oncology.
Researchers using the structure data stored in the Protein Data Bank have published more than two million scientific papers, some of which have helped researchers and pharmaceutical companies tackle major health challenges, including heart disease, cancer, diabetes, Alzheimer’s disease, HIV-AIDS, and most recently, the COVID-19 pandemic.
"The sponsorship of the RCSB Protein Data Bank by the AWS Open Data Sponsorship Program is fantastic. It will provide key data storage and distribution support for one of the most valued and scientifically impactful data resources in the biological sciences," says Dr. Rommie Amaro Professor and Endowed Chair at University of California San Diego.
"It will also open up new avenues for scientific collaboration based around cloud computing services within and outside AWS, making it easier for scientists to do better science faster and with fewer lo-gistical hurdles, particularly in the areas of computational biology and biochemistry, molecular dynam-ics simulations, and artificial intelligence," added Amaro.
Through the Open Data Sponsorship Program, AWS has sponsored access to petabytes of data, including satellite imagery, climate and weather data, genomic data, and data used for natural language processing. The full list of publicly available datasets is available on the Registry of Open Data on AWS.