NIH Cloud Platform Interoperability Effort

Overview of Participating Platforms

NHGRI AnVIL

https://anvilproject.org

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space, or AnVIL, is NHGRI's genomic data resource that leverages a cloud-based infrastructure for democratizing genomic data access, sharing, and computing across large genomic, and genomic-related data sets.

In addition to downloading copies of data to local computers and servers, users will have the option to work with data in a secure cloud environment, where they can also use common bioinformatics tools and packages and develop and share their own software tools. Learn more about AnVIL.

NHLBI BioData Catalyst

https://biodatacatalyst.nhlbi.nih.gov

NHLBI BioData Catalyst is a cloud-based platform providing tools, applications, and workflows in secure workspaces. By increasing access to NHLBI datasets and innovative data analysis capabilities, BioData Catalyst accelerates efficient biomedical research that drives discovery and scientific advancement, leading to novel diagnostic tools, therapeutics, and prevention strategies for heart, lung, blood, and sleep disorders.

Though the primary goal of the BioData Catalyst project is to build a data science ecosystem, at its core, this is a people-centric endeavor. BioData Catalyst is also building a community of practice working collaboratively to solve technical and scientific challenges. Learn more about BioData Catalyst.

NCI Cancer Research Data Commons (CRDC)

https://datacommons.cancer.gov

The goal of the National Cancer Institute’s Cancer Research Data Commons (CRDC) is to empower researchers to accelerate data-driven scientific discovery by connecting diverse datasets with analytical tools in the cloud. The CRDC is built upon an expandable data science infrastructure that provides secure access to many different data across scientific domains via Data Commons Framework.

The CRDC enables users to search and aggregate data across repositories via the Cancer Data Aggregator using a common data model developed by the Center for Cancer Data Harmonization.

Users can access CRDC data using NCI Cloud Resources (Broad FireCloud, Seven Bridges Cancer Genomics Cloud, and Institute for Systems Biology Cancer Genomics Cloud) that bring data and computational power together to enable cancer research and discovery.

NCI Cloud Resources eliminate the need for researchers to download and store extremely large data sets by allowing them to bring analysis tools to the data in the cloud. The platforms also provide access to on-demand computational capacity to analyze these data.

The ability to combine diverse data types and perform cross-domain analysis of large cancer datasets can lead to new discoveries in cancer prevention, treatment, and diagnosis, further supporting the goals of precision medicine and the Cancer Moonshot℠.

The CRDC will encompass and connect multiple cloud-based data repositories and serve as a central location to support public data sharing for NCI-funded programs. Learn more about CRDC.

NIH Common Fund - Kids First Data Resource Center

https://kidsfirstdrc.org

The vision of the NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program (“Kids First”) is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.”

The program continues to generate and share whole-genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects.

In 2018, Kids First launched the Gabriella Miller Kids First Data Resource Center (Kids First DRC), charged with building a large-scale data platform supporting clinical and genetic data from these patients and their families in order to accelerate discovery and ultimately clinical impact.

The Kids First DRC enables scientists to rapidly explore shared genetic pathways and associated clinical datasets underlying diverse pediatric conditions occurring throughout development, empowering cross-disease discovery with the aim of improving preventative measures, diagnostics, and therapeutic interventions on behalf of affected children and their families.

Researchers can search, access, aggregate, and analyze these data through the Kids First Data Resource Portal. Additionally, by using cloud-based individual workspaces in CAVATICA, a data analysis and sharing computation platform, researchers can cross-analyze Kids First data with data from other efforts, such as NCI’s TARGET program and consortia-based datasets like the Children’s Brain Tumor Network (CBTN).

CAVATICA is a cloud-based infrastructure originally developed for supporting pediatric disease research, but can support the analytics of all forms of controlled-access data in a cloud environment.

CAVATICA is powered by the Seven Bridges Platform, which meets or exceeds all NIH requirements for dbGaP or similarly controlled-access data on both Amazon Web Services (AWS) and/or the Google Compute Platform (GCP). Please see the Seven Bridges Compliance White Paper for a full description of CAVATICA's security and compliance features.

For NIH Kids First data, both the Kids First Data Resource Portal and CAVATICA support user authentication and authorization to controlled-access datasets via integration with the Gen3-powered Bionimbus Trusted Partnership for access and distribution (KFDRC Framework Services). Learn more about Kids First.

National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM)

https://www.ncbi.nlm.nih.gov/

NCBI hosts and manages the Database of Genotypes and Phenotypes (dbGaP) and NIH’s Sequence Read Archive (SRA). dbGaP provides and manages access to protected data related to human studies that have investigated the interaction of genotype and phenotype. SRA is the largest archive for public and controlled-access next-generation sequencing data.

In partnership with the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NCBI has made the entire corpus of SRA and computational tools accessible on the cloud (commercial and open access) in addition to NCBI’s local servers.

The central goal is to create an equitable and interoperable ecosystem where NIH-funded data is FAIR (findable, accessible, interoperable, and reusable) and NCBI is also an engaged partner in the development of community-driven solutions to provide secure access to protected data in the federated data access landscape.

NCBI is actively engaged in efforts to define technical solutions to modernize and streamline secure access to controlled-access data via GA4GH Data Repository Service (DRS) and NIH Researcher Authorization Service (RAS) initiative. Learn more about NCBI.

Improve this pageContent guide