Skip to main content

Executive Summary (Ohpen Data Engineer Case Study)

Overview

Summary

This solution implements a finance-grade, history-safe data pipeline for transaction analytics using a cloud-native architecture built on AWS object storage (S3) and serverless compute (AWS Glue, Amazon Athena):

  • Task 1: A Python ETL reads CSV from S3, validates the data, writes partitioned Parquet, and quarantines invalid rows with audit metadata. See assumptions and edge cases for design details.
  • Task 2: A lake architecture (Raw/Processed/Aggregates + Quarantine) designed for auditability and scalable Athena analytics, with Parquet-only schema evolution (Iceberg is a future enhancement).
  • Task 3: A performant SQL query for month-end balance history (Q1 2024) designed for large tables (100M rows) with partition pruning in mind. See SQL breakdown for design rationale.
  • Task 4: CI/CD design and artifacts to validate changes, provision infra (Terraform), and deploy safely with backfill support and monitoring.
  • Task 5: A stakeholder-facing status email and a one-page technical summary.

Assumptions

  • Batch-first ingestion: Streaming is not required by the case; it can be added upstream if latency requirements change.
  • History matters: Raw data is treated as immutable; curated outputs are reproducible via run isolation (run_id) and deterministic transforms.
  • Negative amounts are allowed: withdrawals/refunds are valid; fraud/outlier detection is out of scope.
  • Time semantics: ingestion time (ingest_date) is used for raw landing; business/event time (TransactionTimestamp) drives processed partitions.

Trade-offs (intentional)

  • Fast validation over full platform realism: SQL correctness is validated with DuckDB tests for speed; production would add engine-specific integration tests.
  • Simple ETL engine: Python/pandas is sufficient for the case; production scaling would move execution to AWS Glue Spark jobs while keeping the same contracts.
  • Template config: We ship a config.yaml template to demonstrate code/config separation, but keep runtime wiring out of scope for the case.

Scope boundaries (what we intentionally did not build)

  • No real-time trading / millisecond SLAs.
  • No full deduplication strategy without a defined business key and idempotency contract.
  • No FX conversion or valuation framework (requires rates, cutoffs, rounding policy).
  • No full governance stack implementation (Lake Formation policies, lineage tooling), but the design supports it.

Where to find deliverables

© 2026 Stephen AdeiCC BY 4.0