This document describes how genomic data moves between a client lab and Datamade Genomics infrastructure — from initial sample upload through pipeline processing to results delivery.
Datamade Genomics never ships hardware, installs software at client sites, or requires special credentials. Everything moves over standard internet protocols into Google Cloud Storage (GCS), which serves as the secure staging layer between client labs and our DGX compute infrastructure.
| Section | Shipped today | Forthcoming |
|---|---|---|
| Method A (Signed URL) | Live — wired via genomics-upload helper + per-prefix signed PUT; receipt callback at /api/upload-complete |
OIDC workforce-federated tokens for PHI (compliance redline-fix) |
Method B (gsutil / BYOB) |
Live — per-prefix conditional IAM Terraform module at integrations/byob/ |
None |
| Method C (Globus) | Operator runbook at integrations/globus/OPERATOR_RUNBOOK.md; endpoint provisioned on first Globus customer |
Globus webhook → upload-complete callback |
Auto-pipeline trigger on manifest.json arrival |
Not yet — Pages Function callback fires today on customer's explicit completion ping | GCS Object Finalize → Pub/Sub → Cloud Function → POST /jobs |
| AI report draft generation | Not built | Per-customer roadmap item |
| Somatic / RNA pipelines | Not built — only DEEPVARIANT_GERMLINE is locked today (see PIPELINE_LOCK.md) |
Each is a separate validation matrix |
CLIENT LAB DATAMADE GENOMICS
────────── ─────────────────
FASTQ files GCS Intake Bucket
(raw sequencing output) → gs://datamade-genomics-delivery/
customers/<sha(email)[:16]>/uploads/
↓
DGX Spark / GB300
(Parabricks + RAPIDS + LLM)
↓
GCS Results Bucket (same)
customers/<sha(email)[:16]>/jobs/<job_id>/outputs/
↓
Annotated VCF ← Pre-signed Download URL
+ CRAM file (delivered via portal email)
+ AI Report Draft
All data is encrypted in transit (TLS 1.3) and at rest (AES-256). Google's HIPAA BAA covers all GCS operations. A signed BAA between Datamade Genomics and the client lab is required before any patient data is transferred.
For each sequencing job, the lab provides:
1. FASTQ files — raw sequencing output from the instrument
- Paired-end: sample_R1.fq.gz and sample_R2.fq.gz
- Typically 100–400 GB per whole genome sample at 100x coverage
2. Sample manifest — a small JSON metadata file - Filled in by the lab, uploaded alongside FASTQ files - Tells the pipeline what to run and where to send results - Template provided by Datamade Genomics
{
"job_id": "DMGX-UKGC-20260317-001",
"client_id": "ukgc_001",
"sample_id": "SAMPLE_XYZ",
"pipeline": "germline",
"coverage": "30x",
"reference": "GRCh38",
"files": {
"r1": "sample_R1.fq.gz",
"r2": "sample_R2.fq.gz"
},
"notify_email": "genomics@uky.edu",
"report_type": "vcf_and_report"
}
For each job, Datamade Genomics provides:
1. Signed upload URLs — time-limited, single-use URLs generated per job - One URL per file (R1, R2, manifest) - Valid for up to 12 hours (impersonation cap; HMAC keys unlock 7 days) - No credentials required to use - Delivered via email to the lab contact
2. Job confirmation — automated email on manifest receipt
- Confirms files received with SHA-256 equality assertion
- Provides estimated completion time
- Includes countersigned ingestion_id for chain-of-custody records
3. Results notification — automated email on pipeline completion - Contains signed download URL to portal page (valid 12 hours) - Lists output files included - Notes any QC flags - Available via email or Globus push (academic labs)
The lab can upload using whichever method is most convenient. All three land in the same
GCS intake bucket. We route customers to the fastest path for their stack at intake — see
the current_data_location question on the contact form.
Best for: Labs without GCP exposure. No software setup required beyond curl or our
genomics-upload helper.
Datamade Genomics emails a manifest with signed URLs. The lab runs the genomics-upload
helper, which picks the most robust transfer tool installed (gcloud storage cp if
present — resumable, parallel, CRC32C; falls back to curl --retry).
Lab runs:
# 1. download the helper (single bash file, no install)
curl -fsSL https://genomics.datamade.ai/genomics-upload.sh -o genomics-upload && chmod +x genomics-upload
# 2. save the manifest from your welcome email as manifest.json next to your FASTQs
# 3. run
./genomics-upload manifest.json
The helper computes SHA-256 locally, uploads, then pings our /api/upload-complete Pages
Function with an HMAC-verified callback so the customer receives a countersigned ingestion
receipt within seconds.
PHI note: today's signed-URL pattern is appropriate for GIAB / non-PHI pilot data. For paid PHI workflows, the compliance roadmap is OIDC workforce-federated upload tokens (customer's IdP → GCP Workload Identity Federation → 1h scoped credential bound to the authenticated user) — this is in flight and will replace bearer URLs for PHI.
Best for: Labs already using GCP or familiar with Google Cloud tools.
Datamade Genomics provides a per-prefix conditional IAM template the lab applies in their own Terraform tree. We grant our worker SA scoped read on a specific prefix in their bucket — they keep the data, we just read it. Zero data movement, zero egress when their bucket is in the same GCS region as our compute.
Lab runs (Terraform — module at integrations/byob/datamade_byob.tf):
module "datamade_byob" {
source = "./datamade_byob"
customer_bucket = "acme-genomics-clinical"
ingress_prefix = "datamade-ingress/"
access_expires_at = "2026-11-11T00:00:00Z"
}
terraform init && terraform plan && terraform apply
Forward the terraform output (worker SA, ingress URI, expiry, binding etag) to the
Datamade onboarding thread.
What the lab needs:
- Terraform 1.5+ (or gsutil equivalent in integrations/byob/README.md)
- A GCS bucket holding the FASTQs
The IAM binding is per-prefix and time-bounded with a resource.name.startsWith(...)
condition and a request.time < timestamp(...) expiration — passes CAP least-privilege
review. We never grant write; only GET on the explicit object paths the lab sends us.
Best for: Academic labs and research institutions with an existing Globus endpoint (most universities).
The lab uses their existing Globus endpoint to push files directly into the Datamade Genomics GCS bucket via the Globus Google Cloud Storage connector. The lab initiates the transfer from their familiar Globus web interface — no new tools required.
What the lab needs: - An existing Globus endpoint at their institution (most universities already have this) - The Datamade Genomics Globus collection name (provided on setup)
What Datamade Genomics needs:
- One-time Globus GCS connector configuration (operator runbook at
integrations/globus/OPERATOR_RUNBOOK.md;
~30–45 min to provision when first Globus customer arrives)
- Note: Transfer of patient-identifiable (PHI) data requires the lab's institution to
have a paid Globus Commercial/Plus subscription with BAA-equivalent controls (most
major UK universities and NHS Genomic Laboratory Hubs already have this in place as a
campus-wide agreement).
Lab workflow: 1. Log into app.globus.org with institutional credentials 2. Select their endpoint as source 3. Enter Datamade Genomics collection as destination 4. Select files and click Start 5. Globus handles the rest — resumable, fault-tolerant, audited
Once the manifest.json file arrives in the intake bucket, the pipeline starts automatically — no human intervention required.
manifest.json arrives in GCS intake bucket
↓
GCS event notification fires
↓
Cloud Run job triggered
↓
Files pulled from GCS to DGX via VPC
↓
Parabricks pipeline runs
└── BWA-MEM alignment
└── Duplicate marking
└── BQSR
└── HaplotypeCaller / DeepVariant
↓
RAPIDS annotation
└── ClinVar join
└── OMIM join
└── gnomAD frequency lookup
└── ACMG classification
↓
LLM report draft generation
↓
Results written to GCS results bucket
↓
Signed download URL generated (12h expiry)
↓
Completion email sent to notify_email in manifest
Wall-clock times below are for compute only on dedicated DGX/GB300 hardware and exclude data transfer time. Total end-to-end time (upload → results email) is typically 2–6 hours depending on file size and network.
Estimated turnaround times:
| Pipeline | Coverage | Wall Clock |
|---|---|---|
| Germline WGS | 30x | ~10–15 min |
| Germline WGS | 100x | ~25–30 min |
| Somatic tumor/normal | 50x/30x | ~35–45 min |
| RNA-seq | Standard | ~15–20 min |
| + AI report draft | Any | +3–5 min |
When the pipeline completes, the lab receives an email with a signed download URL valid for 12 hours. Academic labs that prefer Globus can also request results pushed back via their existing Globus endpoint — same checksums, resumable, no signed-URL handling.
Output package contents:
| File | Description | Size |
|---|---|---|
[job_id]_annotated.vcf.gz |
Fully annotated variant calls | ~4 GB |
[job_id].cram |
Compressed aligned reads | ~100 GB |
[job_id]_qc_report.html |
Coverage, duplication, alignment metrics | ~1 MB |
[job_id]_report_draft.pdf |
AI-assisted clinical report draft* | ~500 KB |
[job_id]_manifest_echo.json |
Confirmed job parameters | ~1 KB |
*AI report draft requires geneticist review and sign-off before clinical use. Datamade Genomics generates the draft only.
Download command:
# Download full results package
gsutil -m cp -r gs://datamade-genomics-delivery/[client_id]/[job_id]/ ./results/
# Or use the signed URL from the email directly in a browser
| Data Type | Retention | Action After |
|---|---|---|
| Raw FASTQ (intake) | 30 days | Cryptographic deletion |
| CRAM (aligned) | 90 days | Cryptographic deletion |
| VCF + annotations | 90 days | Cryptographic deletion |
| QC reports | 90 days | Cryptographic deletion |
| Audit logs | 365 days | Archived to cold storage |
| Job metadata | 7 years | Retained per HIPAA |
Labs are responsible for downloading and storing their own results before the retention window expires. Datamade Genomics does not archive patient data beyond the retention period.
UK / NHS-specific note: For UK-based labs operating under the NHS Genomic Medicine Service or Genomics England frameworks, all transfers also comply with UK GDPR, NHS Digital Data Security and Protection Toolkit (DSPT), and ISO 27001 standards. The Google Cloud BAA covers the US HIPAA elements, while our client BAA and technical controls address UK requirements.
| Control | Implementation |
|---|---|
| Encryption in transit | TLS 1.3 — enforced on all GCS endpoints |
| Encryption at rest | AES-256 — GCS default |
| HIPAA BAA | Google Cloud BAA covers all GCS operations |
| Client BAA | Signed per client before any data transfer |
| Signed URL expiry | 12hr upload / 12hr download (HMAC keys unlock 7-day TTL) |
| Access scope | Per-prefix conditional IAM — no standing access |
| Audit logging | All GCS access logged to BigQuery |
| Network isolation | DGX compute on private VPC — no public internet exposure |
| PHI handling | Minimum necessary — job ID used as identifier, not patient name |
| Chain of custody | Countersigned ingestion_id receipt within seconds of upload-complete |
Before the first sample transfer, the following must be confirmed:
gs://datamade-genomics-delivery/customers/<hash>/notify_email in manifest verifiedterraform output forwardedFor transfer issues, job status, or questions:
Datamade Genomics Email: hello@datamade.ai Emergency: Available per BAA agreement
For upload failures, include: - Job ID from manifest - Upload method used (A, B, or C) - Error message received - File sizes and names attempted
integrations/byob/ — per-prefix BYOB Terraform moduleintegrations/globus/OPERATOR_RUNBOOK.md — Globus endpoint provisioningintegrations/terra/parabricks_deepvariant_germline.wdl — Terra/Dockstore WDL (in-flight for the Terra-resident segment)scripts/customer_onboard.py — operator-side onboarding-pack generatorscripts/genomics_upload.sh — customer-side upload helper