Skip to main content

Handout

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Core narrative (use this line everywhere)

The design assumes an append-only raw layer and deterministic transformations, so curated outputs can be reproduced or recomputed for historical periods when necessary, while keeping raw inputs immutable.

High-level pipeline diagram

Solution Components

  • ETL Pipeline: ETL Implementation (available in code package)
  • Reads CSV from AWS S3.
  • Validates required fields + currency allowlist + timestamp parsing.
  • Writes Parquet partitioned by year/month, and isolates each run under run_id=....
  • Writes invalid records to quarantine with validation_error.
  • Writes _SUCCESS with run metrics.
  • Data Lake Architecture: Data Lake Architecture + diagram (excluded from documentation)
  • Bronze / Silver / Gold + Quarantine layout.
  • Schema evolution strategy: Parquet-only (Iceberg is a future enhancement).
  • Analytical Querying & SQL: Balance History Query
  • Month-end "as-of" balances for Jan–Mar 2024.
  • Generates account×month spine and left-joins, so missing months stay NULL (as in example).
  • DevOps & Automation: CI/CD Workflow
    • CI workflow: GitHub Actions (available in code package)
    • Terraform infrastructure: Infrastructure as Code (available in code package)
  • Technical Documentation: Executive Summary, Design Decisions Summary.

How to demo quickly (local)

From projects/ohpen-case-2026/:

Example bucket names (deployment-specific):

  1. Set up AWS credentials and run ETL:
source .venv/bin/activate
export AWS_ACCESS_KEY_ID=<your-aws-access-key>
export AWS_SECRET_ACCESS_KEY=<your-aws-secret-key>
export AWS_REGION=us-east-1

cd tasks/data_ingestion_transformation && PYTHONPATH=src python3 -m etl.ingest_transactions \
--input-bucket ohpen-bronze \
--input-key transactions/transactions.csv \
--output-bucket ohpen-silver \
--output-prefix transactions \
--quarantine-bucket ohpen-quarantine \
--quarantine-prefix transactions

Expected: valid rows written to the Silver bucket under transactions/run_id=.../year=YYYY/month=MM/ and invalid rows to the Quarantine bucket under quarantine/.../run_id=.../.


Monitoring & Observability

ETL pipeline

  • Metrics to emit (structured logs + CloudWatch/Datadog):
  • run_id, input_rows, valid_rows, quarantined_rows
  • duration_seconds, bytes_read, bytes_written
  • quarantine_by_reason (e.g., Invalid Currency, Missing required fields, Invalid Timestamp)
  • Alerts:
  • Job failure (non-zero exit, no _SUCCESS)
  • Quarantine rate spike (e.g., quarantined_rows / input_rows > 5%)
  • Volume anomaly (too few or too many rows vs baseline)
  • Run completeness:
  • _SUCCESS marker with metrics JSON is the "commit" signal for consumers.
  • Optional (describe-only): manifest.json listing files + counts + schema hash.

Data Lake health

  • Storage & access:
  • Bucket versioning, lifecycle policies (Bronze retention vs Silver retention).
  • S3 4xx/5xx error rates, request latency, throttling.
  • Data governance checks (automated):
  • "No public access" enforcement.
  • Tagging compliance (env, owner, cost-center).
  • Glue Catalog drift (expected partitions present, table schema matches expectations).

SQL / Athena usage

  • Cost & performance:
  • Track Athena "bytes scanned" per query (FinOps).
  • Track query runtime and failure rate (operational reliability).
  • Quality signals:
  • Monitor for sudden increase in NULL month-ends (may indicate missing ingestion for a period).

CI/CD + deployments

  • CI health:
  • Validation pass rate, time-to-green, lint failures.
  • PR lead time / deployment frequency (DORA-style).
  • Release safety:
  • Artifact versioning (git SHA tag) so backfills can run with a known version.
  • Terraform plan/apply drift detection (fail pipeline on unexpected infra drift).

Documentation & Stakeholder Updates

  • Operational cadence:
  • Scheduled stakeholder updates for key pipelines (monthly/weekly).
  • Post-incident comms template for failures or data corrections.
  • Doc freshness:
  • Treat the technical reference as living documentation; update on schema changes or new consumers.

This platform is designed as a reusable reference; the same controls and patterns apply for other institutions.


See also

© 2026 Stephen AdeiCC BY 4.0