Skip to main content

Design Decisions Summary

© 2026 Stephen Adei. All rights reserved.
Code standards: Python follows PEP 8, PEP 257, and PEP 484/526; architecture aligns with AWS Well-Architected Framework and arc42. All content on this site (ohpen.stephenadei.nl) is the intellectual property of Stephen Adei. This documentation is protected by copyright law. See License for terms. Unauthorized copying or use is strictly prohibited.

Overview

This document explains why key architectural decisions were made, what alternatives were considered, what risks were identified with mitigations, and how the solution aligns with AWS Well-Architected Framework principles.

For what was built (deliverables), see the Executive Summary. Scope: OLAP analytics only; Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions). Extended context: Architecture Decision Records, Executive Summary. Scripts repository: see Repository & scripts (Gitea).


Key Decisions & Trade-offs

The table below shows major architectural choices, their rationale, and the trade-offs accepted:

DecisionWhyTrade-off
Parquet-only (not Iceberg/Delta)Case requirements met with Parquet; Iceberg adds complexity without immediate benefit for batch OLAP workloadSimpler schema evolution vs advanced features (time travel, ACID, partition evolution)
Year/month partitioningAligns with month-end reporting queries; enables 95% scan reduction for Q1 queries on 5-year retention scenarioRead optimization (95% less data scanned) vs write amplification (small files if low volume)
Serverless (Glue/Athena) over always-on (EMR/Redshift)Batch workload, not real-time; pay-per-query matches usage pattern; no idle costsCost efficiency ($27/month baseline) vs sub-second latency (not required for month-end reports)
Quarantine + Condemned layersFinTech audit requires perpetual retention of invalid data with retry tracking; circuit breaker prevents infinite loopsOperational complexity (2 extra layers) vs compliance/auditability (mandatory for financial data)
Pandas + PySpark implementationsPandas for < 10M records (fast iteration, simpler debugging), PySpark for 10M+ (horizontal scalability, distributed processing)Development time (2 implementations) vs performance headroom (10x-100x speedup at scale)
run_id isolation (not overwrites)Safe backfills, parallel runs, audit trail; idempotent reruns without data loss; each run writes to unique pathStorage cost (multiple versions per partition) vs operational safety (no partial overwrites, replayable)
Explicit lifecycle rules (not Intelligent-Tiering)Predictable access patterns (recent data hot, old data cold); low object counts (~50 files/month/bucket); 90-day → Glacier transitionManual tuning (explicit rules) vs set-and-forget (Intelligent-Tiering adds $0.0025/1000 objects, not cost-effective at current scale)

GenAI: GenAI is used in quarantine, reports, catalog, and SQL docs; additional opportunities (validation hints, runbooks, NL-to-SQL, cost narratives) are listed in GenAI in the Ohpen Case & Opportunities.


Risks & Mitigations

The table below shows identified risks, their impact, mitigations, and where mitigations are implemented:

RiskImpactMitigationImplementation
Data quality degradationInvalid transactions in Silver layer, downstream analytics pollutedValidation + quarantine layer with error codes; promotion gate checks run summary before Silver publishvalidator.py (validation rules), Lambda read_run_summary (promotion gate)
Late-arriving dataMonth-end reports incomplete if data arrives after processingBronze immutable (append-only); backfill playbook; run_id isolation enables safe reprocessing without overwritingrun_id design pattern, backfill procedure in runbook
Schema evolution breaks queriesAthena queries fail on new columns or type changesAdditive-only schema changes (new nullable columns); versioned paths (schema_v=v1, schema_v=v2); backward-compatible ParquetArchitecture doc schema evolution strategy, schema_v in S3 paths
Cost spikes (runaway queries)Athena full-table scans from missing WHERE clausePartition pruning enforced (year/month columns); CloudWatch alarms for scan > 200GB/day; query cost monitoringAthena alarm (Terraform), partition design, SQL best practices documentation
Deployment failuresInfrastructure downtime from bad Terraform applyAutomated rollback on smoke test failures; Terraform state versioning in S3; manual rollback playbookCD workflow with smoke tests, rollback script, state versioning
Quarantine growthStorage costs escalate if validation errors not resolvedCondemned layer for permanent failures (no retry loop); lifecycle policy to Glacier after 5 years; circuit breaker at 100 errorsLoop prevention logic, lifecycle policy in Terraform, circuit breaker threshold

Production Deployment Roadmap

For the complete production readiness roadmap, see Production Readiness Roadmap.

Key next steps: SLAs & monitoring, data contracts, Lake Formation governance, incremental processing, Iceberg evaluation, cost optimization.


AWS Well-Architected Framework Mapping

This solution addresses all six AWS Well-Architected Framework pillars. The table below shows how key implementations map to each pillar:

PillarKey ImplementationLocation
SecurityKMS encryption (gold/quarantine buckets), IAM least privilege (prefix-scoped policies), CloudTrail audit trail (management + data events), S3 versioning (accidental deletion protection), OIDC authentication (no static keys in CI/CD)main.tf lines 131-144 (encryption), 262-358 (IAM), 1031-1064 (CloudTrail); .github/workflows/cd.yml line 36 (OIDC)
Cost OptimizationPartition pruning (95% scan reduction for Q1 queries on 5-year retention), lifecycle policies (Glacier after 90 days), serverless architecture (no idle cost), Parquet compression (10x reduction vs CSV), cost anomaly alarmsbalance_history_2024_q1.sql (partition pruning), main.tf lines 184-258 (lifecycle policies), CloudWatch alarms for Athena scan cost
Performance EfficiencyPySpark optimizations (vectorized operations, broadcast joins, adaptive execution), Parquet columnar format (predicate pushdown), ~128MB file sizing (Spark/Athena sweet spot), year/month partitioning (query-aligned)ingest_transactions_spark.py lines 101-103 (adaptive execution), s3_operations_spark.py lines 122-128 (file sizing), partition design
ReliabilityQuarantine + condemned layers (error isolation), SQS DLQ (3 retries then dead-letter queue), run_id isolation (idempotency, safe reruns), Glue auto-retry (exponential backoff), automated rollback (smoke tests + rollback script), S3 versioningvalidator.py (quarantine logic), main.tf lines 731-737 (Glue retry), 1088-1107 (SQS DLQ), CD workflow (rollback)
Operational ExcellenceCI/CD automation with comprehensive validation, Infrastructure-as-code (Terraform), structured logging (run_id correlation across services), CloudWatch metrics/alarms (with RunId and ExecutionArn dimensions), traceability design doc (AWS-native identifiers).github/workflows/ci.yml (CI), .github/workflows/cd.yml (CD), main.tf (IaC), TRACEABILITY_DESIGN.md (observability patterns)
SustainabilityLifecycle policies (auto-Glacier transition reduces power consumption), Parquet compression (storage efficiency, less data transferred), partition pruning (compute efficiency, 95% less processing), serverless architecture (no idle resources, scale to zero)main.tf lifecycle rules (lines 184-258), Parquet compression (all writes), partition pruning (query patterns), serverless design (Glue, Athena, Lambda)

Traceability and Auditability: Non-Negotiable

Traceability and auditability have top priority. Changes that would deteriorate them are not adopted (e.g. replacing Step Functions execution identity with Glue-only run_id, or introducing a second state store such as DynamoDB for loop prevention). Alternatives considered—including DynamoDB for quarantine state, Glue-derived run identity, and additive options such as execution start time or Glue lineage—are documented in Traceability Design under Design Boundaries: Traceability and Auditability First. The decision was to change nothing for run identity, enrichment, and loop prevention; the current design remains the single source of truth.

Framework Philosophy

Pillars inform trade-offs but business requirements drive architecture.

For example, the quarantine layer choice prioritizes Reliability (audit trail, error isolation, retry tracking) and Security (compliance, perpetual retention) over Cost Optimization (storage overhead for invalid data). This is the correct trade-off for FinTech, where auditability and compliance are non-negotiable.

Similarly, choosing serverless over always-on compute prioritizes Cost Optimization (pay-per-query, no idle cost) and Sustainability (scale to zero) over Performance Efficiency (sub-second latency). This aligns with the batch OLAP workload where month-end reports can tolerate 30-second query times.


For implementation details and code, see the Executive Summary deliverables section.


Implementation Details

Each design decision is implemented across multiple documents:

Data Lake Architecture:

ETL Pipeline:

SQL Analytics:

CI/CD & Security:

Observability:


See also


Stable anchors for cross-document links (see table above for content):

Parquet only (not Iceberg/Delta)

Year/month partitioning

Pandas + PySpark implementations

Quarantine + Condemned layers

Serverless (Glue/Athena) over always-on (EMR/Redshift)

run_id isolation (Step Functions execution ARN)

Explicit lifecycle rules (not Intelligent-Tiering)

OIDC authentication

Serverless over always-on

© 2026 Stephen AdeiCC BY 4.0