NIH Cloud Platform Interoperability Effort

National Center for Biotechnology Information (NCBI)

https://www.ncbi.nlm.nih.gov/

NCBI hosts and manages the Database of Genotypes and Phenotypes (dbGaP) and NIH’s Sequence Read Archive (SRA). dbGaP provides and manages access to protected data related to human studies that have investigated the interaction of genotype and phenotype. In partnership with the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NCBI has made the entire corpus of SRA and computational tools accessible on the cloud (commercial and open access) in addition to NCBI’s local servers.

NCBI is actively developing technical solutions that modernize and streamline secure access to controlled-access data via GA4GH Data Repository Service (DRS) and NIH Researcher Authorization Service (RAS).

The central goal is to create an equitable and interoperable ecosystem where NIH-funded data is FAIR (findable, accessible, interoperable, and reusable) and NCBI is also an engaged partner in the development of community-driven solutions to provide secure access to protected data in the federated data access landscape.

Funder

This work was supported by the National Library of Medicine (NLM), NIH Office of Data Science Strategy (ODSS) and STRIDES.

PIs

NCBI

Kim Pruitt, Valerie Schneider, Kurt McDaniel, Ravinder Eskandary, Kurt Rodarmer Sr, Lon Phan, Mike Feolo, dbGaP Team, Rodney Brister, Yuriy Skripchenko, SRA team.

OER

Julia Slutsman’s team.

OSP

Taunton Paine’s team.

CIT

Jeff Erickson’s and Rebeka Rosen’s teams.

ODSS

Coordination support.

Institutions

NIH Institutes and Centers, NIH GDS Taskforce including representatives of IC Genomic Program Administrators and Data Access Committees.

Acknowledgement

We are grateful to submitting researchers for sharing their data in dbGaP and SRA, and to researchers who request access to these data to further scientific knowledge.

Datasets

dbGap Data

Studies1,865
Subjects~2.9 Million
Samples~3.4 Million
Phenotype: Variables370,825
Phenotype: Values~2.5 Billion
Study Documents7,120
Association Analyses7,883
Genotype Assays (array)~2 Million
Genotype Assays (imputed)543,137
Genotype Assays (seq derived)399,269
Sequence (WGS SRA)178,288
Sequence (WXS SRA)271,447
Sequence (RNAseq SRA)86,879
Epigenomic (SRA)~35,000

See dbGaP Summary Stats; numbers change daily.

SRA Data

Public Sequence Data (Number of Records)

Data FormatPublic Dataset (Hot)Commercial (Hot)Commercial (Cold)Open Data Program (Hot)Commercial (Hot)Commercial (Cold)
Source0014.2M0014.2M
SRA Normalized775,6197.5M5.9M13.4M825,1260
SRA Lite08.0M0000

Controlled-Access Sequence Data (Number of Records)

Data FormatPublic Dataset (Hot)Commercial (Hot)Commercial (Cold)Open Data Program (Hot)Commercial (Hot)Commercial (Cold)
Source00749,91500749,801
SRA Normalized0608,671141,2440705,10144,700
SRA Lite01,7510000

Services

  • Entrez/ SOLR indexing public metadata.
  • Cross linking accessions across dbGaP, BioProject, BioSample and SRA dbs.

Request

  • dbGaP Controlled Access system
  • PI Selection / Reporting
  • DAC Review

Data Access/Download

  • RAS Clearinghouse (CLR): Performs auditing and validation of RAS passport tokens and dbGaP permissions on behalf of RAS clients.
  • IDX: Performs object id exchange between SRA INSDC-style accessions and DRS ids.
  • DRS: Resolves DRS ids to object location and validates access authorization via Clearinghouse.
  • FHIR API: Provides access to dbGaP study level metadata.

Tools

  • SRA Data Locator (SDL): supports finding data in appropriate locations
  • SRA Toolkit: SRA Toolkit supports retrieval and conversion of data from into requested file format (FASTQ, ….)
  • Cloud Data Delivery Service (CDDS): Service to request data in cold storage to be delivered to researchers' cloud bucket.
  • ElasticBLAST: Handles large sequence-based queries. Cloud native, Alpha versions on AWS and GCP.
  • BLAST+ in Docker: Cloud-based BLAST.

A Comprehensive list of NCBI Tools can be found here: All Resources - Site Guide - NCBI (nih.gov)

Authentication

  • Prospective users must have NIH eRA, NIH Login or GSA login.gov account.
  • RAS IDs are used as an authentication mechanism for controlled-access data and to gain access to dbGaP authorizations.

Authorization

  • Authenticated users submit Data Access Requests through dbGaP Controlled Access system.
  • Approvals are delivered to RAS as pre-authorizations to access data. These are encapsulated within RAS passport tokens.
  • DRS approves requests for data access based upon consultation with the RAS Clearinghouse and permissions within RAS passport token.

Indexing

  • Data objects are assigned permanent globally unique NCBI Accessions to allow for Access or Download across tools.
  • Data cross-links are maintained across dbGaP, BioProject, BioSample and SRA.
  • Datasets are identified through faceted search for public object-level metadata.

Data Models

The data in dbGaP are organized as a hierarchical structure of studies. Accessioned objects within dbGaP include studies, phenotypes (as variables and datasets), various molecular assay data SNP and Expression Array, Sequence, and Epigenomic marks, analyses, and documents

See: dbGaP Data Model

Architecture

NCBI Architecture

Improve this pageContent guide