Architecture Boundaries
© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
This document captures the explicit assumptions and edge cases considered for the data lake architecture design.
Architecture Assumptions
Workload Type
- OLAP (Analytical) Workload: Architecture optimized for analytical processing, not transactional (OLTP)
- Batch Processing: ETL runs on scheduled intervals (daily/monthly), not real-time streaming
- Append-Only Pattern: Data is immutable and time-partitioned for analytical queries
- Object Storage Pattern: S3 + Athena for scalable analytical workloads, not relational databases for transactions
Data Volume & Scale
- Current Scale: ~1.5M transactions/month, ~500MB/month raw CSV → ~50MB/month Parquet
- Scalability: Architecture designed to handle 10x growth without redesign
- Storage: S3 scales to exabytes (no practical limit)
- Query Pattern: Monthly reporting, ad-hoc analytics (not real-time operational queries)
Layer Transformation Patterns
-
Bronze → Silver: 1:1 relationship
- Assumption: One raw source produces one validated dataset
- Rationale: Single source of truth ensures data consistency
-
Silver → Gold: 1:N relationship
- Assumption: One validated dataset can produce multiple business aggregations
- Rationale: Different business use cases require different aggregations from the same source
- Examples:
account_balances,monthly_reports,transaction_summaries
Schema Evolution
- Parquet-Only Approach: Iceberg is a future enhancement, not implemented
- Versioning Strategy: Schema changes versioned via
schema_vfor backward compatibility - Migration Pattern: New columns added via schema evolution, not breaking changes
- Backward Compatibility: Old schema versions remain queryable for historical data
Governance & Ownership
- Bronze Layer: Platform Team ownership (immutable audit trail)
- Silver Layer: Domain Teams ownership (validated analytics)
- Gold Layer: Business/Finance ownership (business contracts, reporting)
- Approval Workflows: Schema changes require Domain/Business approval + Platform implementation
Operational Assumptions
Data Retention
- Bronze Layer: Immutable, append-only (no deletion policy assumed)
- Silver Layer: Lifecycle policies transition to Glacier after 90 days
- Quarantine Layer: Indefinite retention for financial audit; transitions to Glacier after 5 years for cost
- Condemned Data: Perpetual retention (no automatic deletion); transitions to Glacier after 5 years
Partitioning Strategy
- Bronze: Partitioned by
ingest_date(YYYY-MM-DD) for audit trail - Silver: Partitioned by
year/month(transaction-time partition) for query performance - Gold: Partitioned by
as_of_month(YYYY-MM) for reporting periods - Assumption: Partition pruning enables efficient queries on large datasets (100M+ rows)
Data Quality
- Quarantine Rate: Assumed < 5% of input rows (alert threshold)
- Circuit Breaker: Halts pipeline if >100 same errors/hour (prevents infinite retry cycles)
- Human Approval: Required before promoting Silver to production or reprocessing condemned data
Infrastructure
- AWS Services: S3, Glue, Athena, Step Functions, EventBridge
- Multi-AZ: AWS handles availability (99.99% SLA assumed)
- Cost Optimization: Lifecycle policies, partition pruning, Parquet compression
Edge Cases & Failure Scenarios
Missing Critical Components
- If
_SUCCESSMarker is Missing: Consumers cannot identify complete runs, leading to incomplete queries - If
_LATEST.jsonis Missing: No authoritative pointer to current dataset, making promotion/rollback impossible - If Glue Data Catalog is Missing: Athena cannot locate tables, all SQL queries fail
- If Quarantine Layer is Removed: Invalid data is silently dropped, audit trail lost
Data Loss Scenarios
- If Bronze Layer is Overwritten: Historical audit trail lost; backfills become impossible
- If
run_idIsolation is Removed: Reruns overwrite previous outputs, causing data loss - If Schema Versioning is Removed: Schema changes break existing consumers; no backward compatibility
Performance Edge Cases
- Partition Skew: Uneven data distribution across partitions may impact query performance
- Large Partition Sizes: Very large partitions (>1GB) may slow down queries despite pruning
- Cross-Partition Queries: Queries spanning many partitions may have higher costs and latency
Operational Edge Cases
- Concurrent Writes: Multiple ETL runs writing to same partition (mitigated by
run_idisolation) - Schema Drift: Source schema changes without versioning break downstream consumers
- Backfill Conflicts: Backfills for overlapping time periods require careful coordination
System Resilience Principles
- Fail-Safe Defaults: System fails in a safe state (no partial data published)
- Defense in Depth: Multiple layers of validation prevent single points of failure
- Auditability: All operations are logged (CloudTrail, CloudWatch)
- Immutability: Critical layers (Bronze) are append-only
- Idempotency: Reruns are safe; each run writes to unique paths
See also
- Data Lake Architecture - Core storage design and medallion structure
- System Architecture Overview - End-to-end system view
- ETL Flow - Ingestion logic and flow
- SQL Breakdown - SQL analytics logic
- Runtime Scenarios - Operational behavior under various scenarios