Skip to main content

Architecture Boundaries

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This document captures the explicit assumptions and edge cases considered for the data lake architecture design.


Architecture Assumptions

Workload Type

  • OLAP (Analytical) Workload: Architecture optimized for analytical processing, not transactional (OLTP)
  • Batch Processing: ETL runs on scheduled intervals (daily/monthly), not real-time streaming
  • Append-Only Pattern: Data is immutable and time-partitioned for analytical queries
  • Object Storage Pattern: S3 + Athena for scalable analytical workloads, not relational databases for transactions

Data Volume & Scale

  • Current Scale: ~1.5M transactions/month, ~500MB/month raw CSV → ~50MB/month Parquet
  • Scalability: Architecture designed to handle 10x growth without redesign
  • Storage: S3 scales to exabytes (no practical limit)
  • Query Pattern: Monthly reporting, ad-hoc analytics (not real-time operational queries)

Layer Transformation Patterns

  • Bronze → Silver: 1:1 relationship

    • Assumption: One raw source produces one validated dataset
    • Rationale: Single source of truth ensures data consistency
  • Silver → Gold: 1:N relationship

    • Assumption: One validated dataset can produce multiple business aggregations
    • Rationale: Different business use cases require different aggregations from the same source
    • Examples: account_balances, monthly_reports, transaction_summaries

Schema Evolution

  • Parquet-Only Approach: Iceberg is a future enhancement, not implemented
  • Versioning Strategy: Schema changes versioned via schema_v for backward compatibility
  • Migration Pattern: New columns added via schema evolution, not breaking changes
  • Backward Compatibility: Old schema versions remain queryable for historical data

Governance & Ownership

  • Bronze Layer: Platform Team ownership (immutable audit trail)
  • Silver Layer: Domain Teams ownership (validated analytics)
  • Gold Layer: Business/Finance ownership (business contracts, reporting)
  • Approval Workflows: Schema changes require Domain/Business approval + Platform implementation

Operational Assumptions

Data Retention

  • Bronze Layer: Immutable, append-only (no deletion policy assumed)
  • Silver Layer: Lifecycle policies transition to Glacier after 90 days
  • Quarantine Layer: Indefinite retention for financial audit; transitions to Glacier after 5 years for cost
  • Condemned Data: Perpetual retention (no automatic deletion); transitions to Glacier after 5 years

Partitioning Strategy

  • Bronze: Partitioned by ingest_date (YYYY-MM-DD) for audit trail
  • Silver: Partitioned by year/month (transaction-time partition) for query performance
  • Gold: Partitioned by as_of_month (YYYY-MM) for reporting periods
  • Assumption: Partition pruning enables efficient queries on large datasets (100M+ rows)

Data Quality

  • Quarantine Rate: Assumed < 5% of input rows (alert threshold)
  • Circuit Breaker: Halts pipeline if >100 same errors/hour (prevents infinite retry cycles)
  • Human Approval: Required before promoting Silver to production or reprocessing condemned data

Infrastructure

  • AWS Services: S3, Glue, Athena, Step Functions, EventBridge
  • Multi-AZ: AWS handles availability (99.99% SLA assumed)
  • Cost Optimization: Lifecycle policies, partition pruning, Parquet compression

Edge Cases & Failure Scenarios

Missing Critical Components

  • If _SUCCESS Marker is Missing: Consumers cannot identify complete runs, leading to incomplete queries
  • If _LATEST.json is Missing: No authoritative pointer to current dataset, making promotion/rollback impossible
  • If Glue Data Catalog is Missing: Athena cannot locate tables, all SQL queries fail
  • If Quarantine Layer is Removed: Invalid data is silently dropped, audit trail lost

Data Loss Scenarios

  • If Bronze Layer is Overwritten: Historical audit trail lost; backfills become impossible
  • If run_id Isolation is Removed: Reruns overwrite previous outputs, causing data loss
  • If Schema Versioning is Removed: Schema changes break existing consumers; no backward compatibility

Performance Edge Cases

  • Partition Skew: Uneven data distribution across partitions may impact query performance
  • Large Partition Sizes: Very large partitions (>1GB) may slow down queries despite pruning
  • Cross-Partition Queries: Queries spanning many partitions may have higher costs and latency

Operational Edge Cases

  • Concurrent Writes: Multiple ETL runs writing to same partition (mitigated by run_id isolation)
  • Schema Drift: Source schema changes without versioning break downstream consumers
  • Backfill Conflicts: Backfills for overlapping time periods require careful coordination

System Resilience Principles

  1. Fail-Safe Defaults: System fails in a safe state (no partial data published)
  2. Defense in Depth: Multiple layers of validation prevent single points of failure
  3. Auditability: All operations are logged (CloudTrail, CloudWatch)
  4. Immutability: Critical layers (Bronze) are append-only
  5. Idempotency: Reruns are safe; each run writes to unique paths

See also

© 2026 Stephen AdeiCC BY 4.0