Design Decisions Summary

© 2026 Stephen Adei. All rights reserved.
Code standards: Python follows PEP 8, PEP 257, and PEP 484/526; architecture aligns with AWS Well-Architected Framework and arc42. All content on this site (ohpen.stephenadei.nl) is the intellectual property of Stephen Adei. This documentation is protected by copyright law. See License for terms. Unauthorized copying or use is strictly prohibited.

Overview

This document explains why key architectural decisions were made, what alternatives were considered, what risks were identified with mitigations, and how the solution aligns with AWS Well-Architected Framework principles.

For what was built (deliverables), see the Executive Summary. Scope: OLAP analytics only; Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions). Extended context: Architecture Decision Records, Executive Summary. Scripts repository: see Repository & scripts (Gitea).

Key Decisions & Trade-offs

The table below shows major architectural choices, their rationale, and the trade-offs accepted:

Decision	Why	Trade-off
Parquet-only (not Iceberg/Delta)	Case requirements met with Parquet; Iceberg adds complexity without immediate benefit for batch OLAP workload	Simpler schema evolution vs advanced features (time travel, ACID, partition evolution)
Year/month partitioning	Aligns with month-end reporting queries; enables 95% scan reduction for Q1 queries on 5-year retention scenario	Read optimization (95% less data scanned) vs write amplification (small files if low volume)
Serverless (Glue/Athena) over always-on (EMR/Redshift)	Batch workload, not real-time; pay-per-query matches usage pattern; no idle costs	Cost efficiency ($27/month baseline) vs sub-second latency (not required for month-end reports)
Quarantine + Condemned layers	FinTech audit requires perpetual retention of invalid data with retry tracking; circuit breaker prevents infinite loops	Operational complexity (2 extra layers) vs compliance/auditability (mandatory for financial data)
Pandas + PySpark implementations	Pandas for < 10M records (fast iteration, simpler debugging), PySpark for 10M+ (horizontal scalability, distributed processing)	Development time (2 implementations) vs performance headroom (10x-100x speedup at scale)
run_id isolation (not overwrites)	Safe backfills, parallel runs, audit trail; idempotent reruns without data loss; each run writes to unique path	Storage cost (multiple versions per partition) vs operational safety (no partial overwrites, replayable)
Explicit lifecycle rules (not Intelligent-Tiering)	Predictable access patterns (recent data hot, old data cold); low object counts (~50 files/month/bucket); 90-day → Glacier transition	Manual tuning (explicit rules) vs set-and-forget (Intelligent-Tiering adds $0.0025/1000 objects, not cost-effective at current scale)

GenAI: GenAI is used in quarantine, reports, catalog, and SQL docs; additional opportunities (validation hints, runbooks, NL-to-SQL, cost narratives) are listed in GenAI in the Ohpen Case & Opportunities.

Risks & Mitigations

The table below shows identified risks, their impact, mitigations, and where mitigations are implemented:

Risk	Impact	Mitigation	Implementation
Data quality degradation	Invalid transactions in Silver layer, downstream analytics polluted	Validation + quarantine layer with error codes; promotion gate checks run summary before Silver publish	`validator.py` (validation rules), Lambda `read_run_summary` (promotion gate)
Late-arriving data	Month-end reports incomplete if data arrives after processing	Bronze immutable (append-only); backfill playbook; `run_id` isolation enables safe reprocessing without overwriting	`run_id` design pattern, backfill procedure in runbook
Schema evolution breaks queries	Athena queries fail on new columns or type changes	Additive-only schema changes (new nullable columns); versioned paths (`schema_v=v1`, `schema_v=v2`); backward-compatible Parquet	Architecture doc schema evolution strategy, `schema_v` in S3 paths
Cost spikes (runaway queries)	Athena full-table scans from missing WHERE clause	Partition pruning enforced (year/month columns); CloudWatch alarms for scan > 200GB/day; query cost monitoring	Athena alarm (Terraform), partition design, SQL best practices documentation
Deployment failures	Infrastructure downtime from bad Terraform apply	Automated rollback on smoke test failures; Terraform state versioning in S3; manual rollback playbook	CD workflow with smoke tests, rollback script, state versioning
Quarantine growth	Storage costs escalate if validation errors not resolved	Condemned layer for permanent failures (no retry loop); lifecycle policy to Glacier after 5 years; circuit breaker at 100 errors	Loop prevention logic, lifecycle policy in Terraform, circuit breaker threshold

Production Deployment Roadmap

For the complete production readiness roadmap, see Production Readiness Roadmap.

Key next steps: SLAs & monitoring, data contracts, Lake Formation governance, incremental processing, Iceberg evaluation, cost optimization.

AWS Well-Architected Framework Mapping

This solution addresses all six AWS Well-Architected Framework pillars. The table below shows how key implementations map to each pillar:

Pillar	Key Implementation	Location
Security	KMS encryption (gold/quarantine buckets), IAM least privilege (prefix-scoped policies), CloudTrail audit trail (management + data events), S3 versioning (accidental deletion protection), OIDC authentication (no static keys in CI/CD)	`main.tf` lines 131-144 (encryption), 262-358 (IAM), 1031-1064 (CloudTrail); `.github/workflows/cd.yml` line 36 (OIDC)
Cost Optimization	Partition pruning (95% scan reduction for Q1 queries on 5-year retention), lifecycle policies (Glacier after 90 days), serverless architecture (no idle cost), Parquet compression (10x reduction vs CSV), cost anomaly alarms	`balance_history_2024_q1.sql` (partition pruning), `main.tf` lines 184-258 (lifecycle policies), CloudWatch alarms for Athena scan cost
Performance Efficiency	PySpark optimizations (vectorized operations, broadcast joins, adaptive execution), Parquet columnar format (predicate pushdown), ~128MB file sizing (Spark/Athena sweet spot), year/month partitioning (query-aligned)	`ingest_transactions_spark.py` lines 101-103 (adaptive execution), `s3_operations_spark.py` lines 122-128 (file sizing), partition design
Reliability	Quarantine + condemned layers (error isolation), SQS DLQ (3 retries then dead-letter queue), `run_id` isolation (idempotency, safe reruns), Glue auto-retry (exponential backoff), automated rollback (smoke tests + rollback script), S3 versioning	`validator.py` (quarantine logic), `main.tf` lines 731-737 (Glue retry), 1088-1107 (SQS DLQ), CD workflow (rollback)
Operational Excellence	CI/CD automation with comprehensive validation, Infrastructure-as-code (Terraform), structured logging (`run_id` correlation across services), CloudWatch metrics/alarms (with `RunId` and `ExecutionArn` dimensions), traceability design doc (AWS-native identifiers)	`.github/workflows/ci.yml` (CI), `.github/workflows/cd.yml` (CD), `main.tf` (IaC), `TRACEABILITY_DESIGN.md` (observability patterns)
Sustainability	Lifecycle policies (auto-Glacier transition reduces power consumption), Parquet compression (storage efficiency, less data transferred), partition pruning (compute efficiency, 95% less processing), serverless architecture (no idle resources, scale to zero)	`main.tf` lifecycle rules (lines 184-258), Parquet compression (all writes), partition pruning (query patterns), serverless design (Glue, Athena, Lambda)

Traceability and Auditability: Non-Negotiable

Traceability and auditability have top priority. Changes that would deteriorate them are not adopted (e.g. replacing Step Functions execution identity with Glue-only run_id, or introducing a second state store such as DynamoDB for loop prevention). Alternatives considered—including DynamoDB for quarantine state, Glue-derived run identity, and additive options such as execution start time or Glue lineage—are documented in Traceability Design under Design Boundaries: Traceability and Auditability First. The decision was to change nothing for run identity, enrichment, and loop prevention; the current design remains the single source of truth.

Framework Philosophy

Pillars inform trade-offs but business requirements drive architecture.

For example, the quarantine layer choice prioritizes Reliability (audit trail, error isolation, retry tracking) and Security (compliance, perpetual retention) over Cost Optimization (storage overhead for invalid data). This is the correct trade-off for FinTech, where auditability and compliance are non-negotiable.

Similarly, choosing serverless over always-on compute prioritizes Cost Optimization (pay-per-query, no idle cost) and Sustainability (scale to zero) over Performance Efficiency (sub-second latency). This aligns with the batch OLAP workload where month-end reports can tolerate 30-second query times.

For implementation details and code, see the Executive Summary deliverables section.

Implementation Details

Each design decision is implemented across multiple documents:

Data Lake Architecture:

Data Lake Architecture - Complete medallion structure, error handling layers, schema evolution
Governance Diagrams - Ownership boundaries and approval workflows
Traceability Design - Run identity propagation and observability patterns
Tooling & Controls - AWS services analysis and selection rationale

ETL Pipeline:

ETL Flow - Pandas and PySpark implementations
PySpark Implementation Summary - Performance optimizations

SQL Analytics:

SQL Query Breakdown - Partition pruning and query optimization

CI/CD & Security:

CI/CD Workflow - OIDC authentication, two-role separation, automated rollback
IAM Security Design - Least privilege policies

Observability:

Traceability Design - Run identity propagation and CloudWatch patterns

Deep link targets

Stable anchors for cross-document links (see table above for content):

Design Decisions Summary

Overview

Key Decisions & Trade-offs

Risks & Mitigations

Production Deployment Roadmap

AWS Well-Architected Framework Mapping

Traceability and Auditability: Non-Negotiable

Framework Philosophy

Implementation Details

See also

Deep link targets

Parquet only (not Iceberg/Delta)

Year/month partitioning

Pandas + PySpark implementations

Quarantine + Condemned layers

Serverless (Glue/Athena) over always-on (EMR/Redshift)

run_id isolation (Step Functions execution ARN)

Explicit lifecycle rules (not Intelligent-Tiering)

OIDC authentication

Serverless over always-on

Overview​

Key Decisions & Trade-offs​

Risks & Mitigations​

Production Deployment Roadmap​

AWS Well-Architected Framework Mapping​

Traceability and Auditability: Non-Negotiable​

Framework Philosophy​

Implementation Details​

See also​

Deep link targets​

Parquet only (not Iceberg/Delta)​

Year/month partitioning​

Pandas + PySpark implementations​

Quarantine + Condemned layers​

Serverless (Glue/Athena) over always-on (EMR/Redshift)​

run_id isolation (Step Functions execution ARN)​

Explicit lifecycle rules (not Intelligent-Tiering)​

OIDC authentication​

Serverless over always-on​

Overview

Key Decisions & Trade-offs

Risks & Mitigations

Production Deployment Roadmap

AWS Well-Architected Framework Mapping

Traceability and Auditability: Non-Negotiable

Framework Philosophy

Implementation Details

See also

Deep link targets

Parquet only (not Iceberg/Delta)

Year/month partitioning

Pandas + PySpark implementations

Quarantine + Condemned layers

Serverless (Glue/Athena) over always-on (EMR/Redshift)

run_id isolation (Step Functions execution ARN)

Explicit lifecycle rules (not Intelligent-Tiering)

OIDC authentication

Serverless over always-on