Data Pipeline — Technical Reference
End-to-end flow (top-down, with teams)
CI/CD: GitHub Actions (ci.yml, cd.yml) → Terraform. Alerting: CloudWatch 5% / 100 rows; Step failure → SNS. Dotted lines: owns / reviews / consumes. More: https://ohpen.stephenadei.nl
Organization name in bucket examples is illustrative; patterns are reusable across institutions.
Teams: Platform Domain Finance Quality
Technical Implementation Details
Bucket names in the table below are example deployment names; the same patterns apply for any institution.
| Component | Technical Implementation Details |
|---|---|
| Architecture Overview | Flow: S3 (Bronze: raw CSV) → EventBridge → Step Functions → Glue/Lambda → S3 (Silver/Gold/Quarantine: Parquet) → Athena. Separate buckets: ohpen-bronze, ohpen-silver, ohpen-gold, ohpen-quarantine, ohpen-artifacts. |
| 1. Data Ingestion | Source: S3 ohpen-bronze, e.g. bronze/mortgages/transactions/ (ingest_date, run_id).Format: CSV. Trigger: S3 object-created → EventBridge → Step Functions → ETL; or schedule (e.g. daily 02:00 UTC). |
| 2. Transformation Logic | Language: Python (Pandas/PyArrow) or Glue PySpark. Validation: Null checks on required columns; Currency vs ISO-4217. Partitioning: Parquet by year/month from TransactionTimestamp (tx_date). |
| 3. Data Lake Structure | ohpen-bronze (/raw): Immutable CSVs. ohpen-silver (/processed): Validated Parquet (year/month, schema_v, run_id). ohpen-gold: Business aggregates (account_balances, monthly_reports). ohpen-quarantine: Quarantine (review/retry) + Condemned (audit; human approval). ohpen-artifacts: ETL scripts, Athena results, manifests. |
| 4. Schema Evolution | Strategy: Schema-on-Read. Catalog: Terraform-defined tables (no Crawler). Handling: Additive-only; versioned paths (schema_v=v1, v2); catalog via IaC/runbook. Parquet: missing columns → null. |
| 5. CI/CD & Monitoring | Pipeline: GitHub Actions (ci.yml: lint, tests, build; cd.yml: deploy, Terraform apply). Infra: Terraform (S3, IAM, Glue, Step Functions, EventBridge, Lambda, Athena, CloudWatch, SNS). Alerting: Quarantine rate >5%, quarantined rows >100; Step failure → SNS. |