← Diligence index  ·  View raw .md

Data Transfer — Datamade Genomics

This document describes how genomic data moves between a client lab and Datamade Genomics infrastructure — from initial sample upload through pipeline processing to results delivery.

Datamade Genomics never ships hardware, installs software at client sites, or requires special credentials. Everything moves over standard internet protocols into Google Cloud Storage (GCS), which serves as the secure staging layer between client labs and our DGX compute infrastructure.

Status (2026-05-11)

Section Shipped today Forthcoming
Method A (Signed URL) Live — wired via genomics-upload helper + per-prefix signed PUT; receipt callback at /api/upload-complete OIDC workforce-federated tokens for PHI (compliance redline-fix)
Method B (gsutil / BYOB) Live — per-prefix conditional IAM Terraform module at integrations/byob/ None
Method C (Globus) Operator runbook at integrations/globus/OPERATOR_RUNBOOK.md; endpoint provisioned on first Globus customer Globus webhook → upload-complete callback
Auto-pipeline trigger on manifest.json arrival Not yet — Pages Function callback fires today on customer's explicit completion ping GCS Object Finalize → Pub/Sub → Cloud Function → POST /jobs
AI report draft generation Not built Per-customer roadmap item
Somatic / RNA pipelines Not built — only DEEPVARIANT_GERMLINE is locked today (see PIPELINE_LOCK.md) Each is a separate validation matrix

Architecture Summary

CLIENT LAB                    DATAMADE GENOMICS
──────────                    ─────────────────

FASTQ files                   GCS Intake Bucket
(raw sequencing output)  →    gs://datamade-genomics-delivery/
                                customers/<sha(email)[:16]>/uploads/
                                      ↓
                              DGX Spark / GB300
                              (Parabricks + RAPIDS + LLM)
                                      ↓
                              GCS Results Bucket (same)
                                customers/<sha(email)[:16]>/jobs/<job_id>/outputs/
                                      ↓
Annotated VCF            ←    Pre-signed Download URL
+ CRAM file                   (delivered via portal email)
+ AI Report Draft

All data is encrypted in transit (TLS 1.3) and at rest (AES-256). Google's HIPAA BAA covers all GCS operations. A signed BAA between Datamade Genomics and the client lab is required before any patient data is transferred.

What the Lab Provides

For each sequencing job, the lab provides:

1. FASTQ files — raw sequencing output from the instrument - Paired-end: sample_R1.fq.gz and sample_R2.fq.gz - Typically 100–400 GB per whole genome sample at 100x coverage

2. Sample manifest — a small JSON metadata file - Filled in by the lab, uploaded alongside FASTQ files - Tells the pipeline what to run and where to send results - Template provided by Datamade Genomics

{
  "job_id": "DMGX-UKGC-20260317-001",
  "client_id": "ukgc_001",
  "sample_id": "SAMPLE_XYZ",
  "pipeline": "germline",
  "coverage": "30x",
  "reference": "GRCh38",
  "files": {
    "r1": "sample_R1.fq.gz",
    "r2": "sample_R2.fq.gz"
  },
  "notify_email": "genomics@uky.edu",
  "report_type": "vcf_and_report"
}

What Datamade Genomics Provides

For each job, Datamade Genomics provides:

1. Signed upload URLs — time-limited, single-use URLs generated per job - One URL per file (R1, R2, manifest) - Valid for up to 12 hours (impersonation cap; HMAC keys unlock 7 days) - No credentials required to use - Delivered via email to the lab contact

2. Job confirmation — automated email on manifest receipt - Confirms files received with SHA-256 equality assertion - Provides estimated completion time - Includes countersigned ingestion_id for chain-of-custody records

3. Results notification — automated email on pipeline completion - Contains signed download URL to portal page (valid 12 hours) - Lists output files included - Notes any QC flags - Available via email or Globus push (academic labs)

Three Upload Methods

The lab can upload using whichever method is most convenient. All three land in the same GCS intake bucket. We route customers to the fastest path for their stack at intake — see the current_data_location question on the contact form.

Method A — GCS Signed URL (default for on-prem / NAS / small labs)

Best for: Labs without GCP exposure. No software setup required beyond curl or our genomics-upload helper.

Datamade Genomics emails a manifest with signed URLs. The lab runs the genomics-upload helper, which picks the most robust transfer tool installed (gcloud storage cp if present — resumable, parallel, CRC32C; falls back to curl --retry).

Lab runs:

# 1. download the helper (single bash file, no install)
curl -fsSL https://genomics.datamade.ai/genomics-upload.sh -o genomics-upload && chmod +x genomics-upload

# 2. save the manifest from your welcome email as manifest.json next to your FASTQs

# 3. run
./genomics-upload manifest.json

The helper computes SHA-256 locally, uploads, then pings our /api/upload-complete Pages Function with an HMAC-verified callback so the customer receives a countersigned ingestion receipt within seconds.

PHI note: today's signed-URL pattern is appropriate for GIAB / non-PHI pilot data. For paid PHI workflows, the compliance roadmap is OIDC workforce-federated upload tokens (customer's IdP → GCP Workload Identity Federation → 1h scoped credential bound to the authenticated user) — this is in flight and will replace bearer URLs for PHI.

Method B — gsutil Direct Upload (BYOB)

Best for: Labs already using GCP or familiar with Google Cloud tools.

Datamade Genomics provides a per-prefix conditional IAM template the lab applies in their own Terraform tree. We grant our worker SA scoped read on a specific prefix in their bucket — they keep the data, we just read it. Zero data movement, zero egress when their bucket is in the same GCS region as our compute.

Lab runs (Terraform — module at integrations/byob/datamade_byob.tf):

module "datamade_byob" {
  source            = "./datamade_byob"
  customer_bucket   = "acme-genomics-clinical"
  ingress_prefix    = "datamade-ingress/"
  access_expires_at = "2026-11-11T00:00:00Z"
}
terraform init && terraform plan && terraform apply

Forward the terraform output (worker SA, ingress URI, expiry, binding etag) to the Datamade onboarding thread.

What the lab needs: - Terraform 1.5+ (or gsutil equivalent in integrations/byob/README.md) - A GCS bucket holding the FASTQs

The IAM binding is per-prefix and time-bounded with a resource.name.startsWith(...) condition and a request.time < timestamp(...) expiration — passes CAP least-privilege review. We never grant write; only GET on the explicit object paths the lab sends us.

Method C — Globus Transfer

Best for: Academic labs and research institutions with an existing Globus endpoint (most universities).

The lab uses their existing Globus endpoint to push files directly into the Datamade Genomics GCS bucket via the Globus Google Cloud Storage connector. The lab initiates the transfer from their familiar Globus web interface — no new tools required.

What the lab needs: - An existing Globus endpoint at their institution (most universities already have this) - The Datamade Genomics Globus collection name (provided on setup)

What Datamade Genomics needs: - One-time Globus GCS connector configuration (operator runbook at integrations/globus/OPERATOR_RUNBOOK.md; ~30–45 min to provision when first Globus customer arrives) - Note: Transfer of patient-identifiable (PHI) data requires the lab's institution to have a paid Globus Commercial/Plus subscription with BAA-equivalent controls (most major UK universities and NHS Genomic Laboratory Hubs already have this in place as a campus-wide agreement).

Lab workflow: 1. Log into app.globus.org with institutional credentials 2. Select their endpoint as source 3. Enter Datamade Genomics collection as destination 4. Select files and click Start 5. Globus handles the rest — resumable, fault-tolerant, audited

Automated Pipeline Trigger

Once the manifest.json file arrives in the intake bucket, the pipeline starts automatically — no human intervention required.

manifest.json arrives in GCS intake bucket
              ↓
GCS event notification fires
              ↓
Cloud Run job triggered
              ↓
Files pulled from GCS to DGX via VPC
              ↓
Parabricks pipeline runs
  └── BWA-MEM alignment
  └── Duplicate marking
  └── BQSR
  └── HaplotypeCaller / DeepVariant
              ↓
RAPIDS annotation
  └── ClinVar join
  └── OMIM join
  └── gnomAD frequency lookup
  └── ACMG classification
              ↓
LLM report draft generation
              ↓
Results written to GCS results bucket
              ↓
Signed download URL generated (12h expiry)
              ↓
Completion email sent to notify_email in manifest

Wall-clock times below are for compute only on dedicated DGX/GB300 hardware and exclude data transfer time. Total end-to-end time (upload → results email) is typically 2–6 hours depending on file size and network.

Estimated turnaround times:

Pipeline Coverage Wall Clock
Germline WGS 30x ~10–15 min
Germline WGS 100x ~25–30 min
Somatic tumor/normal 50x/30x ~35–45 min
RNA-seq Standard ~15–20 min
+ AI report draft Any +3–5 min

Results Delivery

When the pipeline completes, the lab receives an email with a signed download URL valid for 12 hours. Academic labs that prefer Globus can also request results pushed back via their existing Globus endpoint — same checksums, resumable, no signed-URL handling.

Output package contents:

File Description Size
[job_id]_annotated.vcf.gz Fully annotated variant calls ~4 GB
[job_id].cram Compressed aligned reads ~100 GB
[job_id]_qc_report.html Coverage, duplication, alignment metrics ~1 MB
[job_id]_report_draft.pdf AI-assisted clinical report draft* ~500 KB
[job_id]_manifest_echo.json Confirmed job parameters ~1 KB

*AI report draft requires geneticist review and sign-off before clinical use. Datamade Genomics generates the draft only.

Download command:

# Download full results package
gsutil -m cp -r gs://datamade-genomics-delivery/[client_id]/[job_id]/ ./results/

# Or use the signed URL from the email directly in a browser

Data Retention Policy

Data Type Retention Action After
Raw FASTQ (intake) 30 days Cryptographic deletion
CRAM (aligned) 90 days Cryptographic deletion
VCF + annotations 90 days Cryptographic deletion
QC reports 90 days Cryptographic deletion
Audit logs 365 days Archived to cold storage
Job metadata 7 years Retained per HIPAA

Labs are responsible for downloading and storing their own results before the retention window expires. Datamade Genomics does not archive patient data beyond the retention period.

Security and Compliance

UK / NHS-specific note: For UK-based labs operating under the NHS Genomic Medicine Service or Genomics England frameworks, all transfers also comply with UK GDPR, NHS Digital Data Security and Protection Toolkit (DSPT), and ISO 27001 standards. The Google Cloud BAA covers the US HIPAA elements, while our client BAA and technical controls address UK requirements.

Control Implementation
Encryption in transit TLS 1.3 — enforced on all GCS endpoints
Encryption at rest AES-256 — GCS default
HIPAA BAA Google Cloud BAA covers all GCS operations
Client BAA Signed per client before any data transfer
Signed URL expiry 12hr upload / 12hr download (HMAC keys unlock 7-day TTL)
Access scope Per-prefix conditional IAM — no standing access
Audit logging All GCS access logged to BigQuery
Network isolation DGX compute on private VPC — no public internet exposure
PHI handling Minimum necessary — job ID used as identifier, not patient name
Chain of custody Countersigned ingestion_id receipt within seconds of upload-complete

Onboarding Checklist

Before the first sample transfer, the following must be confirmed:

Support

For transfer issues, job status, or questions:

Datamade Genomics Email: hello@datamade.ai Emergency: Available per BAA agreement

For upload failures, include: - Job ID from manifest - Upload method used (A, B, or C) - Error message received - File sizes and names attempted

Cross-references