NIH Cloud Platform Interoperability Effort

NHLBI BioData Catalyst (BDC)

https://biodatacatalyst.nhlbi.nih.gov

NHLBI BioData Catalyst is a cloud-based ecosystem providing tools, applications, and workflows in secure workspaces. The ecosystem is a dynamic resource that allows researchers to find, access, share, store, and compute on large scale datasets.

By increasing access to NHLBI datasets and innovative data analysis capabilities, BioData Catalyst accelerates efficient biomedical research that drives discovery and scientific advancement, leading to novel diagnostic tools, therapeutics and prevention strategies for heart, lung, blood, and sleep disorders.

NHLBI BioData Catalyst is open to all researchers. Join the community at https://biodatacatalyst.nhlbi.nih.gov/contact/ecosystem

Funder

National Heart, Lung, and Blood Institute, Gary Gibbons (Director), Alastair Thomson (CIO), Jon Kaltman (Program Officer).

PIs

PIs: Ahalt, Avillach, Boyles, Bradford, Cox, Davis-Dusenbery, Krishnamurthy, Grossman, Manning, Paten, Philippakis.

Institutions

Institutions: The Broad Institute, Harvard Medical School, RTI International, Seven Bridges Genomics, University of California, Santa Cruz, University of Chicago, UNC-CH/RENCI, Vanderbilt University Medical Center.

Data

  • ~3.5 petabytes of data representing ~400K study participants

75 TOPMed studies

  • Includes multi-sample VCFs, CRAMs, and phenotype files. Freeze 8 data for 37 of those studies. In process of adding new studies and additional Freeze8 data. 23 parent studies
  • TOPMed Combined Exchange Area for Freeze 8 studies (~9.5 TB of data)
  • 1000 Genomes Project
  • BioLINCC Training Datasets
  • ORCHID

Coming soon

  • Additional TOPMed Freeze 8 studies
  • Pediatric Cardiac Genomics Consortium data
  • Additional COVID-19 data

For more detailed information, see About BioData Catalyst Datasets.

Tools

BioData Catalyst is being designed to support workflows for batch data analysis, notebooks for interactive analysis, and apps/services for web apps. Users can bring their own workflows and notebooks.

Workflows

BioData Catalyst supports workflows written in CWL and WDL. Highlights include workflows from the DCC and TOPMed alignment and variant calling.

Notebooks

RStudio and Jupyter Notebooks are supported with examples leveraging BioData Catalyst for image visualization, machine learning, and GPU acceleration.

Apps/Services

Access and Search Clinical and WGS annotated data via PIC-SURE User Interface and API.

Authentication/Authorization

  • eRA Commons IDs are used for controlled access data via Data Commons Framework Services (DCFC).

  • Expecting integration with NIH RAS through DCFC integration for authentication and authorization.

  • DCFS’ dbGaP integration is used to streamline access for those with completed dbGaP applications.

Indexing

  • Data objects are assigned permanent globally unique IDs (GUIDs) to allow for access across tools, without requiring copies be created and transferred.

  • Datasets can be located through text-based and faceted/tagged search. Semantic Search under development.

Data Modeling

Working toward data interoperability using standards from GA4GH and CD2H with FHIR and BioLink as meta models

Cloud Credits

NHLBI currently provides $500 in cloud credits to new users of BioData Catalyst on BioData Catalyst Powered by Seven Bridges or BioData Catalyst Powered by Terra. Users can also use AWS or GCP accounts or apply for additional credits via NHLBI BioData Catalyst Cloud Credit Program.

Architecture

BDC Architecture

Improve this pageContent guide