NIH Cloud Platform Interoperability Effort

NHGRI Analysis and Visualization Lab-space (AnVIL)

https://anvilproject.org/

CRDC Architecture

The NHGRI AnVIL is a cloud based environment that hosts high value data sets and commonly used bioinformatics tools in a secure FedRAMP-certified environment that can scale to meet the computational needs of researchers.

The AnVIL will provide automated data access with the DUOS platform. AnVIL users have access to extensive training materials spanning from general genomics to advanced analysis methods using modular and open access Massively Open Online Courses (MOOCs).

Funder

National Human Genome Research Institute, Eric Green (director), Valentina Di Francesco (program officer), Ken Wiley (program officer).

PIs

Schatz, Philippakis, O’Connor, Grossman, Morgan, Paten, Nekrutenko, Carroll, Goecks, Hall (Ira), Hall (Jennifer), Tan, Hansen, Overby Taylor, Carey, Afgan, Leek, Ellrott, Waldron, Wang, Banks, Lawson, O’Donnel, Luria.

Institutions

The Broad Institute, Johns Hopkins University, University of Chicago, Penn State University, University of California, Santa Cruz, Oregon Health and Sciences University, Harvard University, Vanderbilt University, Roswell Park Comprehensive Cancer Center, Washington University, City University of New York, American Heart Association, Carnegie Inst. for Science, Yale University.

Datasets

AnVIL hosts CCDG, CMG, GTEx, and 1000 Genomes. As data sets are added to these groups they will become available on AnVIL. EMERGE will become available on AnVIL in 2022.

SourceCohortsSamplesParticipantsSize (TB)
CCDG198270,135256,3182,582
CMG3915,73214,97361
1000 Genomes13,2023,20273
GTEx (v8)117,382979182
Convergent Neuro23043045
HPRC15747160
PAGE469069017
WGSPD151,5049,943176
T2T103,219503
eMERGE*PendingPendingPendingPending
All Datasets252309,006289,6753,759

Tools

  • AnVIL uses the Terra platform to launch and run tools on Google Cloud Platform within a FedRAMP-certified secure boundary.
  • Users can currently run batch analysis with WDL and interactive analysis with:
    • Bioconductor,
    • Galaxy,
    • seqr,
    • Jupyter Notebooks supporting Python, R and RStudio.
  • The Dockstore workflow repository is integrated with Terra, providing access to hundreds of published workflows.
  • AnVIL enables users to bring their own tools to the platform.

Authentication

  • Both Google emails and NIH RAS IDs are used as an authentication mechanism for controlled access data.

Authorization

  • Consortium and developer whitelists are maintained to provide access to data.
  • General users submit DARs through dbGAP.

Indexing

  • Data objects are assigned permanent globally unique IDs (GUIDs) to allow for access across tools, without requiring copies be created and transferred.
  • Datasets are identified through faceted search over phenotypic data.

Data Models

  • AnVIL has adopted the Terra Interoperability data model (TIM), an expandable data model to support multiple, diverse data models.

Security

  • FedRAMP Certified
  • FedRAMP 1 ATO

Architecture

AnVIL Architecture

Improve this pageContent guide