← Diligence index · View raw .md

Data Transfer — Datamade Genomics

This document describes how genomic data moves between a client lab and Datamade Genomics infrastructure — from initial sample upload through pipeline processing to results delivery.

Datamade Genomics never ships hardware, installs software at client sites, or requires special credentials. Everything moves over standard internet protocols into Google Cloud Storage (GCS), which serves as the secure staging layer between client labs and our DGX compute infrastructure.

Status (2026-05-11)

Section	Shipped today	Forthcoming
Method A (Signed URL)	Live — wired via `genomics-upload` helper + per-prefix signed PUT; receipt callback at `/api/upload-complete`	OIDC workforce-federated tokens for PHI (compliance redline-fix)
Method B (`gsutil` / BYOB)	Live — per-prefix conditional IAM Terraform module at `integrations/byob/`	None
Method C (Globus)	Operator runbook at `integrations/globus/OPERATOR_RUNBOOK.md`; endpoint provisioned on first Globus customer	Globus webhook → upload-complete callback
Auto-pipeline trigger on `manifest.json` arrival	Not yet — Pages Function callback fires today on customer's explicit completion ping	GCS Object Finalize → Pub/Sub → Cloud Function → `POST /jobs`
AI report draft generation	Not built	Per-customer roadmap item
Somatic / RNA pipelines	Not built — only `DEEPVARIANT_GERMLINE` is locked today (see `PIPELINE_LOCK.md`)	Each is a separate validation matrix

Architecture Summary

CLIENT LAB                    DATAMADE GENOMICS
──────────                    ─────────────────

FASTQ files                   GCS Intake Bucket
(raw sequencing output)  →    gs://datamade-genomics-delivery/
                                customers/<sha(email)[:16]>/uploads/
                                      ↓
                              DGX Spark / GB300
                              (Parabricks + RAPIDS + LLM)
                                      ↓
                              GCS Results Bucket (same)
                                customers/<sha(email)[:16]>/jobs/<job_id>/outputs/
                                      ↓
Annotated VCF            ←    Pre-signed Download URL
+ CRAM file                   (delivered via portal email)
+ AI Report Draft

All data is encrypted in transit (TLS 1.3) and at rest (AES-256). Google's HIPAA BAA covers all GCS operations. A signed BAA between Datamade Genomics and the client lab is required before any patient data is transferred.

What the Lab Provides

For each sequencing job, the lab provides:

1. FASTQ files — raw sequencing output from the instrument - Paired-end: sample_R1.fq.gz and sample_R2.fq.gz - Typically 100–400 GB per whole genome sample at 100x coverage

2. Sample manifest — a small JSON metadata file - Filled in by the lab, uploaded alongside FASTQ files - Tells the pipeline what to run and where to send results - Template provided by Datamade Genomics

{
  "job_id": "DMGX-UKGC-20260317-001",
  "client_id": "ukgc_001",
  "sample_id": "SAMPLE_XYZ",
  "pipeline": "germline",
  "coverage": "30x",
  "reference": "GRCh38",
  "files": {
    "r1": "sample_R1.fq.gz",
    "r2": "sample_R2.fq.gz"
  },
  "notify_email": "genomics@uky.edu",
  "report_type": "vcf_and_report"
}

What Datamade Genomics Provides

For each job, Datamade Genomics provides:

1. Signed upload URLs — time-limited, single-use URLs generated per job - One URL per file (R1, R2, manifest) - Valid for up to 12 hours (impersonation cap; HMAC keys unlock 7 days) - No credentials required to use - Delivered via email to the lab contact

2. Job confirmation — automated email on manifest receipt - Confirms files received with SHA-256 equality assertion - Provides estimated completion time - Includes countersigned ingestion_id for chain-of-custody records

3. Results notification — automated email on pipeline completion - Contains signed download URL to portal page (valid 12 hours) - Lists output files included - Notes any QC flags - Available via email or Globus push (academic labs)

Three Upload Methods

The lab can upload using whichever method is most convenient. All three land in the same GCS intake bucket. We route customers to the fastest path for their stack at intake — see the current_data_location question on the contact form.

Method A — GCS Signed URL (default for on-prem / NAS / small labs)

Best for: Labs without GCP exposure. No software setup required beyond curl or our genomics-upload helper.

Datamade Genomics emails a manifest with signed URLs. The lab runs the genomics-upload helper, which picks the most robust transfer tool installed (gcloud storage cp if present — resumable, parallel, CRC32C; falls back to curl --retry).

Lab runs:

# 1. download the helper (single bash file, no install)
curl -fsSL https://genomics.datamade.ai/genomics-upload.sh -o genomics-upload && chmod +x genomics-upload

# 2. save the manifest from your welcome email as manifest.json next to your FASTQs

# 3. run
./genomics-upload manifest.json

The helper computes SHA-256 locally, uploads, then pings our /api/upload-complete Pages Function with an HMAC-verified callback so the customer receives a countersigned ingestion receipt within seconds.

PHI note: today's signed-URL pattern is appropriate for GIAB / non-PHI pilot data. For paid PHI workflows, the compliance roadmap is OIDC workforce-federated upload tokens (customer's IdP → GCP Workload Identity Federation → 1h scoped credential bound to the authenticated user) — this is in flight and will replace bearer URLs for PHI.

Method B — gsutil Direct Upload (BYOB)

Best for: Labs already using GCP or familiar with Google Cloud tools.

Datamade Genomics provides a per-prefix conditional IAM template the lab applies in their own Terraform tree. We grant our worker SA scoped read on a specific prefix in their bucket — they keep the data, we just read it. Zero data movement, zero egress when their bucket is in the same GCS region as our compute.

Lab runs (Terraform — module at integrations/byob/datamade_byob.tf):

module "datamade_byob" {
  source            = "./datamade_byob"
  customer_bucket   = "acme-genomics-clinical"
  ingress_prefix    = "datamade-ingress/"
  access_expires_at = "2026-11-11T00:00:00Z"
}

terraform init && terraform plan && terraform apply

Forward the terraform output (worker SA, ingress URI, expiry, binding etag) to the Datamade onboarding thread.

What the lab needs: - Terraform 1.5+ (or gsutil equivalent in integrations/byob/README.md) - A GCS bucket holding the FASTQs

The IAM binding is per-prefix and time-bounded with a resource.name.startsWith(...) condition and a request.time < timestamp(...) expiration — passes CAP least-privilege review. We never grant write; only GET on the explicit object paths the lab sends us.

Method C — Globus Transfer

Best for: Academic labs and research institutions with an existing Globus endpoint (most universities).

The lab uses their existing Globus endpoint to push files directly into the Datamade Genomics GCS bucket via the Globus Google Cloud Storage connector. The lab initiates the transfer from their familiar Globus web interface — no new tools required.

What the lab needs: - An existing Globus endpoint at their institution (most universities already have this) - The Datamade Genomics Globus collection name (provided on setup)

What Datamade Genomics needs: - One-time Globus GCS connector configuration (operator runbook at integrations/globus/OPERATOR_RUNBOOK.md; ~30–45 min to provision when first Globus customer arrives) - Note: Transfer of patient-identifiable (PHI) data requires the lab's institution to have a paid Globus Commercial/Plus subscription with BAA-equivalent controls (most major UK universities and NHS Genomic Laboratory Hubs already have this in place as a campus-wide agreement).

Lab workflow: 1. Log into app.globus.org with institutional credentials 2. Select their endpoint as source 3. Enter Datamade Genomics collection as destination 4. Select files and click Start 5. Globus handles the rest — resumable, fault-tolerant, audited

Automated Pipeline Trigger

Once the manifest.json file arrives in the intake bucket, the pipeline starts automatically — no human intervention required.

manifest.json arrives in GCS intake bucket
              ↓
GCS event notification fires
              ↓
Cloud Run job triggered
              ↓
Files pulled from GCS to DGX via VPC
              ↓
Parabricks pipeline runs
  └── BWA-MEM alignment
  └── Duplicate marking
  └── BQSR
  └── HaplotypeCaller / DeepVariant
              ↓
RAPIDS annotation
  └── ClinVar join
  └── OMIM join
  └── gnomAD frequency lookup
  └── ACMG classification
              ↓
LLM report draft generation
              ↓
Results written to GCS results bucket
              ↓
Signed download URL generated (12h expiry)
              ↓
Completion email sent to notify_email in manifest

Wall-clock times below are for compute only on dedicated DGX/GB300 hardware and exclude data transfer time. Total end-to-end time (upload → results email) is typically 2–6 hours depending on file size and network.

Estimated turnaround times:

Pipeline	Coverage	Wall Clock
Germline WGS	30x	~10–15 min
Germline WGS	100x	~25–30 min
Somatic tumor/normal	50x/30x	~35–45 min
RNA-seq	Standard	~15–20 min
+ AI report draft	Any	+3–5 min

Results Delivery

When the pipeline completes, the lab receives an email with a signed download URL valid for 12 hours. Academic labs that prefer Globus can also request results pushed back via their existing Globus endpoint — same checksums, resumable, no signed-URL handling.

Output package contents:

File	Description	Size
`[job_id]_annotated.vcf.gz`	Fully annotated variant calls	~4 GB
`[job_id].cram`	Compressed aligned reads	~100 GB
`[job_id]_qc_report.html`	Coverage, duplication, alignment metrics	~1 MB
`[job_id]_report_draft.pdf`	AI-assisted clinical report draft*	~500 KB
`[job_id]_manifest_echo.json`	Confirmed job parameters	~1 KB

*AI report draft requires geneticist review and sign-off before clinical use. Datamade Genomics generates the draft only.

Download command:

# Download full results package
gsutil -m cp -r gs://datamade-genomics-delivery/[client_id]/[job_id]/ ./results/

# Or use the signed URL from the email directly in a browser

Data Retention Policy

Data Type	Retention	Action After
Raw FASTQ (intake)	30 days	Cryptographic deletion
CRAM (aligned)	90 days	Cryptographic deletion
VCF + annotations	90 days	Cryptographic deletion
QC reports	90 days	Cryptographic deletion
Audit logs	365 days	Archived to cold storage
Job metadata	7 years	Retained per HIPAA

Labs are responsible for downloading and storing their own results before the retention window expires. Datamade Genomics does not archive patient data beyond the retention period.

Security and Compliance

UK / NHS-specific note: For UK-based labs operating under the NHS Genomic Medicine Service or Genomics England frameworks, all transfers also comply with UK GDPR, NHS Digital Data Security and Protection Toolkit (DSPT), and ISO 27001 standards. The Google Cloud BAA covers the US HIPAA elements, while our client BAA and technical controls address UK requirements.

Control	Implementation
Encryption in transit	TLS 1.3 — enforced on all GCS endpoints
Encryption at rest	AES-256 — GCS default
HIPAA BAA	Google Cloud BAA covers all GCS operations
Client BAA	Signed per client before any data transfer
Signed URL expiry	12hr upload / 12hr download (HMAC keys unlock 7-day TTL)
Access scope	Per-prefix conditional IAM — no standing access
Audit logging	All GCS access logged to BigQuery
Network isolation	DGX compute on private VPC — no public internet exposure
PHI handling	Minimum necessary — job ID used as identifier, not patient name
Chain of custody	Countersigned `ingestion_id` receipt within seconds of upload-complete

Onboarding Checklist

Before the first sample transfer, the following must be confirmed:

[ ] Client BAA executed (Datamade ↔ Lab)
[ ] Per-tenant prefix provisioned in gs://datamade-genomics-delivery/customers/<hash>/
[ ] notify_email in manifest verified
[ ] Upload method confirmed (A, B, or C)
[ ] For Method B: BYOB Terraform module applied; terraform output forwarded
[ ] For Method C: Globus collection shared and tested
[ ] Test transfer completed with non-PHI file (GIAB sample preferred)
[ ] Pilot samples agreed (5 samples, no charge)
[ ] UK GDPR / NHS DSPT readiness confirmed (for UK clients)

Support

For transfer issues, job status, or questions:

Datamade Genomics Email: hello@datamade.ai Emergency: Available per BAA agreement

For upload failures, include: - Job ID from manifest - Upload method used (A, B, or C) - Error message received - File sizes and names attempted

Cross-references

integrations/byob/ — per-prefix BYOB Terraform module
integrations/globus/OPERATOR_RUNBOOK.md — Globus endpoint provisioning
integrations/terra/parabricks_deepvariant_germline.wdl — Terra/Dockstore WDL (in-flight for the Terra-resident segment)
scripts/customer_onboard.py — operator-side onboarding-pack generator
scripts/genomics_upload.sh — customer-side upload helper