Design Decisions Summary
© 2026 Stephen Adei. All rights reserved.
Code standards: Python follows PEP 8, PEP 257, and PEP 484/526; architecture aligns with AWS Well-Architected Framework and arc42. All content on this site (ohpen.stephenadei.nl) is the intellectual property of Stephen Adei. This documentation is protected by copyright law. See License for terms. Unauthorized copying or use is strictly prohibited.
Overview
This document explains why key architectural decisions were made, what alternatives were considered, what risks were identified with mitigations, and how the solution aligns with AWS Well-Architected Framework principles.
For what was built (deliverables), see the Executive Summary. Scope: OLAP analytics only; Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions). Extended context: Architecture Decision Records, Executive Summary. Scripts repository: see Repository & scripts (Gitea).
Key Decisions & Trade-offs
The table below shows major architectural choices, their rationale, and the trade-offs accepted:
| Decision | Why | Trade-off |
|---|---|---|
| Parquet-only (not Iceberg/Delta) | Case requirements met with Parquet; Iceberg adds complexity without immediate benefit for batch OLAP workload | Simpler schema evolution vs advanced features (time travel, ACID, partition evolution) |
| Year/month partitioning | Aligns with month-end reporting queries; enables 95% scan reduction for Q1 queries on 5-year retention scenario | Read optimization (95% less data scanned) vs write amplification (small files if low volume) |
| Serverless (Glue/Athena) over always-on (EMR/Redshift) | Batch workload, not real-time; pay-per-query matches usage pattern; no idle costs | Cost efficiency ($27/month baseline) vs sub-second latency (not required for month-end reports) |
| Quarantine + Condemned layers | FinTech audit requires perpetual retention of invalid data with retry tracking; circuit breaker prevents infinite loops | Operational complexity (2 extra layers) vs compliance/auditability (mandatory for financial data) |
| Pandas + PySpark implementations | Pandas for < 10M records (fast iteration, simpler debugging), PySpark for 10M+ (horizontal scalability, distributed processing) | Development time (2 implementations) vs performance headroom (10x-100x speedup at scale) |
| run_id isolation (not overwrites) | Safe backfills, parallel runs, audit trail; idempotent reruns without data loss; each run writes to unique path | Storage cost (multiple versions per partition) vs operational safety (no partial overwrites, replayable) |
| Explicit lifecycle rules (not Intelligent-Tiering) | Predictable access patterns (recent data hot, old data cold); low object counts (~50 files/month/bucket); 90-day → Glacier transition | Manual tuning (explicit rules) vs set-and-forget (Intelligent-Tiering adds $0.0025/1000 objects, not cost-effective at current scale) |
GenAI: GenAI is used in quarantine, reports, catalog, and SQL docs; additional opportunities (validation hints, runbooks, NL-to-SQL, cost narratives) are listed in GenAI in the Ohpen Case & Opportunities.
Risks & Mitigations
The table below shows identified risks, their impact, mitigations, and where mitigations are implemented:
| Risk | Impact | Mitigation | Implementation |
|---|---|---|---|
| Data quality degradation | Invalid transactions in Silver layer, downstream analytics polluted | Validation + quarantine layer with error codes; promotion gate checks run summary before Silver publish | validator.py (validation rules), Lambda read_run_summary (promotion gate) |
| Late-arriving data | Month-end reports incomplete if data arrives after processing | Bronze immutable (append-only); backfill playbook; run_id isolation enables safe reprocessing without overwriting | run_id design pattern, backfill procedure in runbook |
| Schema evolution breaks queries | Athena queries fail on new columns or type changes | Additive-only schema changes (new nullable columns); versioned paths (schema_v=v1, schema_v=v2); backward-compatible Parquet | Architecture doc schema evolution strategy, schema_v in S3 paths |
| Cost spikes (runaway queries) | Athena full-table scans from missing WHERE clause | Partition pruning enforced (year/month columns); CloudWatch alarms for scan > 200GB/day; query cost monitoring | Athena alarm (Terraform), partition design, SQL best practices documentation |
| Deployment failures | Infrastructure downtime from bad Terraform apply | Automated rollback on smoke test failures; Terraform state versioning in S3; manual rollback playbook | CD workflow with smoke tests, rollback script, state versioning |
| Quarantine growth | Storage costs escalate if validation errors not resolved | Condemned layer for permanent failures (no retry loop); lifecycle policy to Glacier after 5 years; circuit breaker at 100 errors | Loop prevention logic, lifecycle policy in Terraform, circuit breaker threshold |
Production Deployment Roadmap
For the complete production readiness roadmap, see Production Readiness Roadmap.
Key next steps: SLAs & monitoring, data contracts, Lake Formation governance, incremental processing, Iceberg evaluation, cost optimization.
AWS Well-Architected Framework Mapping
This solution addresses all six AWS Well-Architected Framework pillars. The table below shows how key implementations map to each pillar:
| Pillar | Key Implementation | Location |
|---|---|---|
| Security | KMS encryption (gold/quarantine buckets), IAM least privilege (prefix-scoped policies), CloudTrail audit trail (management + data events), S3 versioning (accidental deletion protection), OIDC authentication (no static keys in CI/CD) | main.tf lines 131-144 (encryption), 262-358 (IAM), 1031-1064 (CloudTrail); .github/workflows/cd.yml line 36 (OIDC) |
| Cost Optimization | Partition pruning (95% scan reduction for Q1 queries on 5-year retention), lifecycle policies (Glacier after 90 days), serverless architecture (no idle cost), Parquet compression (10x reduction vs CSV), cost anomaly alarms | balance_history_2024_q1.sql (partition pruning), main.tf lines 184-258 (lifecycle policies), CloudWatch alarms for Athena scan cost |
| Performance Efficiency | PySpark optimizations (vectorized operations, broadcast joins, adaptive execution), Parquet columnar format (predicate pushdown), ~128MB file sizing (Spark/Athena sweet spot), year/month partitioning (query-aligned) | ingest_transactions_spark.py lines 101-103 (adaptive execution), s3_operations_spark.py lines 122-128 (file sizing), partition design |
| Reliability | Quarantine + condemned layers (error isolation), SQS DLQ (3 retries then dead-letter queue), run_id isolation (idempotency, safe reruns), Glue auto-retry (exponential backoff), automated rollback (smoke tests + rollback script), S3 versioning | validator.py (quarantine logic), main.tf lines 731-737 (Glue retry), 1088-1107 (SQS DLQ), CD workflow (rollback) |
| Operational Excellence | CI/CD automation with comprehensive validation, Infrastructure-as-code (Terraform), structured logging (run_id correlation across services), CloudWatch metrics/alarms (with RunId and ExecutionArn dimensions), traceability design doc (AWS-native identifiers) | .github/workflows/ci.yml (CI), .github/workflows/cd.yml (CD), main.tf (IaC), TRACEABILITY_DESIGN.md (observability patterns) |
| Sustainability | Lifecycle policies (auto-Glacier transition reduces power consumption), Parquet compression (storage efficiency, less data transferred), partition pruning (compute efficiency, 95% less processing), serverless architecture (no idle resources, scale to zero) | main.tf lifecycle rules (lines 184-258), Parquet compression (all writes), partition pruning (query patterns), serverless design (Glue, Athena, Lambda) |
Traceability and Auditability: Non-Negotiable
Traceability and auditability have top priority. Changes that would deteriorate them are not adopted (e.g. replacing Step Functions execution identity with Glue-only run_id, or introducing a second state store such as DynamoDB for loop prevention). Alternatives considered—including DynamoDB for quarantine state, Glue-derived run identity, and additive options such as execution start time or Glue lineage—are documented in Traceability Design under Design Boundaries: Traceability and Auditability First. The decision was to change nothing for run identity, enrichment, and loop prevention; the current design remains the single source of truth.
Framework Philosophy
Pillars inform trade-offs but business requirements drive architecture.
For example, the quarantine layer choice prioritizes Reliability (audit trail, error isolation, retry tracking) and Security (compliance, perpetual retention) over Cost Optimization (storage overhead for invalid data). This is the correct trade-off for FinTech, where auditability and compliance are non-negotiable.
Similarly, choosing serverless over always-on compute prioritizes Cost Optimization (pay-per-query, no idle cost) and Sustainability (scale to zero) over Performance Efficiency (sub-second latency). This aligns with the batch OLAP workload where month-end reports can tolerate 30-second query times.
For implementation details and code, see the Executive Summary deliverables section.
Implementation Details
Each design decision is implemented across multiple documents:
Data Lake Architecture:
- Data Lake Architecture - Complete medallion structure, error handling layers, schema evolution
- Governance Diagrams - Ownership boundaries and approval workflows
- Traceability Design - Run identity propagation and observability patterns
- Tooling & Controls - AWS services analysis and selection rationale
ETL Pipeline:
- ETL Flow - Pandas and PySpark implementations
- PySpark Implementation Summary - Performance optimizations
SQL Analytics:
- SQL Query Breakdown - Partition pruning and query optimization
CI/CD & Security:
- CI/CD Workflow - OIDC authentication, two-role separation, automated rollback
- IAM Security Design - Least privilege policies
Observability:
- Traceability Design - Run identity propagation and CloudWatch patterns
See also
- Executive Summary - Business case and solution overview
- Data Lake Architecture - Implementation of these design decisions
- ETL Flow - How design decisions are implemented in ETL pipeline
- CI/CD Workflow - Security and deployment decisions in practice
- Architecture Decision Records - Individual ADRs documenting specific decisions
Deep link targets
Stable anchors for cross-document links (see table above for content):