Executive Summary (Ohpen Data Engineer Case Study)
Overview
Summary
This solution implements a finance-grade, history-safe data pipeline for transaction analytics using a cloud-native architecture built on AWS object storage (S3) and serverless compute (AWS Glue, Amazon Athena):
- Task 1: A Python ETL reads CSV from S3, validates the data, writes partitioned Parquet, and quarantines invalid rows with audit metadata. See assumptions and edge cases for design details.
- Task 2: A lake architecture (Raw/Processed/Aggregates + Quarantine) designed for auditability and scalable Athena analytics, with Parquet-only schema evolution (Iceberg is a future enhancement).
- Task 3: A performant SQL query for month-end balance history (Q1 2024) designed for large tables (100M rows) with partition pruning in mind. See SQL breakdown for design rationale.
- Task 4: CI/CD design and artifacts to validate changes, provision infra (Terraform), and deploy safely with backfill support and monitoring.
- Task 5: A stakeholder-facing status email and a one-page technical summary.
Assumptions
- Batch-first ingestion: Streaming is not required by the case; it can be added upstream if latency requirements change.
- History matters: Raw data is treated as immutable; curated outputs are reproducible via run isolation (
run_id) and deterministic transforms. - Negative amounts are allowed: withdrawals/refunds are valid; fraud/outlier detection is out of scope.
- Time semantics: ingestion time (
ingest_date) is used for raw landing; business/event time (TransactionTimestamp) drives processed partitions.
Trade-offs (intentional)
- Fast validation over full platform realism: SQL correctness is validated with DuckDB tests for speed; production would add engine-specific integration tests.
- Simple ETL engine: Python/pandas is sufficient for the case; production scaling would move execution to AWS Glue Spark jobs while keeping the same contracts.
- Template config: We ship a
config.yamltemplate to demonstrate code/config separation, but keep runtime wiring out of scope for the case.
Scope boundaries (what we intentionally did not build)
- No real-time trading / millisecond SLAs.
- No full deduplication strategy without a defined business key and idempotency contract.
- No FX conversion or valuation framework (requires rates, cutoffs, rounding policy).
- No full governance stack implementation (Lake Formation policies, lineage tooling), but the design supports it.
Where to find deliverables
- Task 1 ETL: ETL Implementation
- Task 1 assumptions/edge cases: Assumptions and Edge Cases
- Task 2 architecture: Data Lake Architecture
- Task 3 SQL: Balance History Query
- Task 4 CI/CD: CI/CD Workflow (+ GitHub Actions workflow)
- Task 5 docs: Communication Documentation
- Interview handout: Handout
- Testing guide: Testing Documentation