# Data Transfer — Datamade Genomics

This document describes how genomic data moves between a client lab and Datamade Genomics
infrastructure — from initial sample upload through pipeline processing to results delivery.

Datamade Genomics never ships hardware, installs software at client sites, or requires
special credentials. Everything moves over standard internet protocols into Google Cloud
Storage (GCS), which serves as the secure staging layer between client labs and our DGX
compute infrastructure.

## Status (2026-05-11)

| Section | Shipped today | Forthcoming |
|---|---|---|
| **Method A (Signed URL)** | Live — wired via `genomics-upload` helper + per-prefix signed PUT; receipt callback at `/api/upload-complete` | OIDC workforce-federated tokens for PHI (compliance redline-fix) |
| **Method B (`gsutil` / BYOB)** | Live — per-prefix conditional IAM Terraform module at [`integrations/byob/`](https://github.com/datamade/genomics-analysis/tree/main/integrations/byob/) | None |
| **Method C (Globus)** | Operator runbook at [`integrations/globus/OPERATOR_RUNBOOK.md`](https://github.com/datamade/genomics-analysis/tree/main/integrations/globus/OPERATOR_RUNBOOK.md); endpoint provisioned on first Globus customer | Globus webhook → upload-complete callback |
| **Auto-pipeline trigger on `manifest.json` arrival** | Not yet — Pages Function callback fires today on customer's explicit completion ping | GCS Object Finalize → Pub/Sub → Cloud Function → `POST /jobs` |
| **AI report draft generation** | Not built | Per-customer roadmap item |
| **Somatic / RNA pipelines** | Not built — only `DEEPVARIANT_GERMLINE` is locked today (see [`PIPELINE_LOCK.md`](https://github.com/datamade/genomics-analysis/blob/main/docs/clinical-readiness/technical/PIPELINE_LOCK.md)) | Each is a separate validation matrix |

## Architecture Summary

```
CLIENT LAB                    DATAMADE GENOMICS
──────────                    ─────────────────

FASTQ files                   GCS Intake Bucket
(raw sequencing output)  →    gs://datamade-genomics-delivery/
                                customers/<sha(email)[:16]>/uploads/
                                      ↓
                              DGX Spark / GB300
                              (Parabricks + RAPIDS + LLM)
                                      ↓
                              GCS Results Bucket (same)
                                customers/<sha(email)[:16]>/jobs/<job_id>/outputs/
                                      ↓
Annotated VCF            ←    Pre-signed Download URL
+ CRAM file                   (delivered via portal email)
+ AI Report Draft
```

All data is encrypted in transit (TLS 1.3) and at rest (AES-256). Google's HIPAA BAA covers
all GCS operations. A signed BAA between Datamade Genomics and the client lab is required
before any patient data is transferred.

## What the Lab Provides

For each sequencing job, the lab provides:

**1. FASTQ files** — raw sequencing output from the instrument
- Paired-end: `sample_R1.fq.gz` and `sample_R2.fq.gz`
- Typically 100–400 GB per whole genome sample at 100x coverage

**2. Sample manifest** — a small JSON metadata file
- Filled in by the lab, uploaded alongside FASTQ files
- Tells the pipeline what to run and where to send results
- Template provided by Datamade Genomics

```json
{
  "job_id": "DMGX-UKGC-20260317-001",
  "client_id": "ukgc_001",
  "sample_id": "SAMPLE_XYZ",
  "pipeline": "germline",
  "coverage": "30x",
  "reference": "GRCh38",
  "files": {
    "r1": "sample_R1.fq.gz",
    "r2": "sample_R2.fq.gz"
  },
  "notify_email": "genomics@uky.edu",
  "report_type": "vcf_and_report"
}
```

## What Datamade Genomics Provides

For each job, Datamade Genomics provides:

**1. Signed upload URLs** — time-limited, single-use URLs generated per job
- One URL per file (R1, R2, manifest)
- Valid for up to 12 hours (impersonation cap; HMAC keys unlock 7 days)
- No credentials required to use
- Delivered via email to the lab contact

**2. Job confirmation** — automated email on manifest receipt
- Confirms files received with SHA-256 equality assertion
- Provides estimated completion time
- Includes countersigned `ingestion_id` for chain-of-custody records

**3. Results notification** — automated email on pipeline completion
- Contains signed download URL to portal page (valid 12 hours)
- Lists output files included
- Notes any QC flags
- Available via email or Globus push (academic labs)

## Three Upload Methods

The lab can upload using whichever method is most convenient. All three land in the same
GCS intake bucket. We route customers to the fastest path for their stack at intake — see
the `current_data_location` question on the contact form.

### Method A — GCS Signed URL (default for on-prem / NAS / small labs)

**Best for:** Labs without GCP exposure. No software setup required beyond curl or our
`genomics-upload` helper.

Datamade Genomics emails a manifest with signed URLs. The lab runs the `genomics-upload`
helper, which picks the most robust transfer tool installed (`gcloud storage cp` if
present — resumable, parallel, CRC32C; falls back to `curl --retry`).

**Lab runs:**

```bash
# 1. download the helper (single bash file, no install)
curl -fsSL https://genomics.datamade.ai/genomics-upload.sh -o genomics-upload && chmod +x genomics-upload

# 2. save the manifest from your welcome email as manifest.json next to your FASTQs

# 3. run
./genomics-upload manifest.json
```

The helper computes SHA-256 locally, uploads, then pings our `/api/upload-complete` Pages
Function with an HMAC-verified callback so the customer receives a countersigned ingestion
receipt within seconds.

> **PHI note**: today's signed-URL pattern is appropriate for GIAB / non-PHI pilot data.
> For paid PHI workflows, the compliance roadmap is OIDC workforce-federated upload tokens
> (customer's IdP → GCP Workload Identity Federation → 1h scoped credential bound to the
> authenticated user) — this is in flight and will replace bearer URLs for PHI.

### Method B — gsutil Direct Upload (BYOB)

**Best for:** Labs already using GCP or familiar with Google Cloud tools.

Datamade Genomics provides a per-prefix conditional IAM template the lab applies in their
own Terraform tree. We grant our worker SA scoped read on a specific prefix in *their*
bucket — they keep the data, we just read it. **Zero data movement, zero egress** when
their bucket is in the same GCS region as our compute.

**Lab runs** (Terraform — module at [`integrations/byob/datamade_byob.tf`](https://github.com/datamade/genomics-analysis/tree/main/integrations/byob/datamade_byob.tf)):

```hcl
module "datamade_byob" {
  source            = "./datamade_byob"
  customer_bucket   = "acme-genomics-clinical"
  ingress_prefix    = "datamade-ingress/"
  access_expires_at = "2026-11-11T00:00:00Z"
}
```

```bash
terraform init && terraform plan && terraform apply
```

Forward the `terraform output` (worker SA, ingress URI, expiry, binding etag) to the
Datamade onboarding thread.

**What the lab needs:**
- Terraform 1.5+ (or gsutil equivalent in `integrations/byob/README.md`)
- A GCS bucket holding the FASTQs

The IAM binding is per-prefix and time-bounded with a `resource.name.startsWith(...)`
condition and a `request.time < timestamp(...)` expiration — passes CAP least-privilege
review. We never grant write; only GET on the explicit object paths the lab sends us.

### Method C — Globus Transfer

**Best for:** Academic labs and research institutions with an existing Globus endpoint
(most universities).

The lab uses their existing Globus endpoint to push files directly into the Datamade
Genomics GCS bucket via the Globus Google Cloud Storage connector. The lab initiates the
transfer from their familiar Globus web interface — no new tools required.

**What the lab needs:**
- An existing Globus endpoint at their institution (most universities already have this)
- The Datamade Genomics Globus collection name (provided on setup)

**What Datamade Genomics needs:**
- One-time Globus GCS connector configuration (operator runbook at
  [`integrations/globus/OPERATOR_RUNBOOK.md`](https://github.com/datamade/genomics-analysis/tree/main/integrations/globus/OPERATOR_RUNBOOK.md);
  ~30–45 min to provision when first Globus customer arrives)
- Note: Transfer of patient-identifiable (PHI) data requires the lab's institution to
  have a paid Globus Commercial/Plus subscription with BAA-equivalent controls (most
  major UK universities and NHS Genomic Laboratory Hubs already have this in place as a
  campus-wide agreement).

**Lab workflow:**
1. Log into app.globus.org with institutional credentials
2. Select their endpoint as source
3. Enter Datamade Genomics collection as destination
4. Select files and click Start
5. Globus handles the rest — resumable, fault-tolerant, audited

## Automated Pipeline Trigger

Once the manifest.json file arrives in the intake bucket, the pipeline starts
automatically — no human intervention required.

```
manifest.json arrives in GCS intake bucket
              ↓
GCS event notification fires
              ↓
Cloud Run job triggered
              ↓
Files pulled from GCS to DGX via VPC
              ↓
Parabricks pipeline runs
  └── BWA-MEM alignment
  └── Duplicate marking
  └── BQSR
  └── HaplotypeCaller / DeepVariant
              ↓
RAPIDS annotation
  └── ClinVar join
  └── OMIM join
  └── gnomAD frequency lookup
  └── ACMG classification
              ↓
LLM report draft generation
              ↓
Results written to GCS results bucket
              ↓
Signed download URL generated (12h expiry)
              ↓
Completion email sent to notify_email in manifest
```

> _**Wall-clock times below are for compute only on dedicated DGX/GB300 hardware and
> exclude data transfer time. Total end-to-end time (upload → results email) is typically
> 2–6 hours depending on file size and network.**_

**Estimated turnaround times:**

| Pipeline | Coverage | Wall Clock |
|----------|----------|------------|
| Germline WGS | 30x | ~10–15 min |
| Germline WGS | 100x | ~25–30 min |
| Somatic tumor/normal | 50x/30x | ~35–45 min |
| RNA-seq | Standard | ~15–20 min |
| + AI report draft | Any | +3–5 min |

## Results Delivery

When the pipeline completes, the lab receives an email with a signed download URL valid
for 12 hours. **Academic labs that prefer Globus can also request results pushed back via
their existing Globus endpoint** — same checksums, resumable, no signed-URL handling.

**Output package contents:**

| File | Description | Size |
|------|-------------|------|
| `[job_id]_annotated.vcf.gz` | Fully annotated variant calls | ~4 GB |
| `[job_id].cram` | Compressed aligned reads | ~100 GB |
| `[job_id]_qc_report.html` | Coverage, duplication, alignment metrics | ~1 MB |
| `[job_id]_report_draft.pdf` | AI-assisted clinical report draft* | ~500 KB |
| `[job_id]_manifest_echo.json` | Confirmed job parameters | ~1 KB |

*AI report draft requires geneticist review and sign-off before clinical use. Datamade
Genomics generates the draft only.

**Download command:**

```bash
# Download full results package
gsutil -m cp -r gs://datamade-genomics-delivery/[client_id]/[job_id]/ ./results/

# Or use the signed URL from the email directly in a browser
```

## Data Retention Policy

| Data Type | Retention | Action After |
|-----------|-----------|--------------|
| Raw FASTQ (intake) | 30 days | Cryptographic deletion |
| CRAM (aligned) | 90 days | Cryptographic deletion |
| VCF + annotations | 90 days | Cryptographic deletion |
| QC reports | 90 days | Cryptographic deletion |
| Audit logs | 365 days | Archived to cold storage |
| Job metadata | 7 years | Retained per HIPAA |

Labs are responsible for downloading and storing their own results before the retention
window expires. Datamade Genomics does not archive patient data beyond the retention
period.

## Security and Compliance

> **UK / NHS-specific note**: For UK-based labs operating under the NHS Genomic Medicine
> Service or Genomics England frameworks, all transfers also comply with UK GDPR, NHS
> Digital Data Security and Protection Toolkit (DSPT), and ISO 27001 standards. The
> Google Cloud BAA covers the US HIPAA elements, while our client BAA and technical
> controls address UK requirements.

| Control | Implementation |
|---------|---------------|
| Encryption in transit | TLS 1.3 — enforced on all GCS endpoints |
| Encryption at rest | AES-256 — GCS default |
| HIPAA BAA | Google Cloud BAA covers all GCS operations |
| Client BAA | Signed per client before any data transfer |
| Signed URL expiry | 12hr upload / 12hr download (HMAC keys unlock 7-day TTL) |
| Access scope | Per-prefix conditional IAM — no standing access |
| Audit logging | All GCS access logged to BigQuery |
| Network isolation | DGX compute on private VPC — no public internet exposure |
| PHI handling | Minimum necessary — job ID used as identifier, not patient name |
| Chain of custody | Countersigned `ingestion_id` receipt within seconds of upload-complete |

## Onboarding Checklist

Before the first sample transfer, the following must be confirmed:

- [ ] Client BAA executed (Datamade ↔ Lab)
- [ ] Per-tenant prefix provisioned in `gs://datamade-genomics-delivery/customers/<hash>/`
- [ ] `notify_email` in manifest verified
- [ ] Upload method confirmed (A, B, or C)
- [ ] For Method B: BYOB Terraform module applied; `terraform output` forwarded
- [ ] For Method C: Globus collection shared and tested
- [ ] Test transfer completed with non-PHI file (GIAB sample preferred)
- [ ] Pilot samples agreed (5 samples, no charge)
- [ ] UK GDPR / NHS DSPT readiness confirmed (for UK clients)

## Support

For transfer issues, job status, or questions:

**Datamade Genomics**
Email: hello@datamade.ai
Emergency: Available per BAA agreement

For upload failures, include:
- Job ID from manifest
- Upload method used (A, B, or C)
- Error message received
- File sizes and names attempted

## Cross-references

- [`integrations/byob/`](https://github.com/datamade/genomics-analysis/tree/main/integrations/byob/) — per-prefix BYOB Terraform module
- [`integrations/globus/OPERATOR_RUNBOOK.md`](https://github.com/datamade/genomics-analysis/tree/main/integrations/globus/OPERATOR_RUNBOOK.md) — Globus endpoint provisioning
- [`integrations/terra/parabricks_deepvariant_germline.wdl`](https://github.com/datamade/genomics-analysis/tree/main/integrations/terra/parabricks_deepvariant_germline.wdl) — Terra/Dockstore WDL (in-flight for the Terra-resident segment)
- [`scripts/customer_onboard.py`](https://github.com/datamade/genomics-analysis/tree/main/scripts/customer_onboard.py) — operator-side onboarding-pack generator
- [`scripts/genomics_upload.sh`](https://github.com/datamade/genomics-analysis/tree/main/scripts/genomics_upload.sh) — customer-side upload helper
