Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Data Lake Architecture (Task 2 — High-Level Overview)

This document gives the high-level data lake design. For folder structure, schema evolution, safe publishing, failure modes, ownership, and runbooks, see the Data Lake Architecture Details reference.


Design at a glance

  • Medallion architecture: Bronze (raw) → Silver (validated) → Gold (business aggregates), with dedicated S3 buckets per layer.
  • Error handling: Quarantine (invalid rows, retry tracking) and Condemned (max attempts exceeded, human approval for reprocessing).
  • Partitioning: Bronze ingest_date; Silver year/month; Gold as_of_month.
  • Safe publishing: run_id isolation, _SUCCESS markers, _LATEST.json and current/ for stable consumption.
  • Governance: Schema versioning (schema_v), additive-only evolution, ownership per layer (Platform, Domain, Business).

High-level flow

LayerPurposePartitionFormat
BronzeRaw, immutable audit trailingest_dateCSV.gz
SilverValidated, quality-assuredyear/monthParquet
GoldBusiness aggregates, reportingas_of_monthParquet
QuarantineInvalid rows, retry metadataingest_dateParquet
CondemnedNo further retries; human approvalunder quarantineParquet

Key design decisions

  • 1:1 Bronze → Silver: One raw source produces one validated dataset.
  • 1:N Silver → Gold: One Silver dataset feeds multiple Gold aggregations (e.g. account_balances, monthly_reports).
  • Run isolation: Every run uses a unique run_id; no overwrites; promotion via _LATEST.json and current/ after validation.
  • Schema evolution: Additive-only, versioned paths (schema_v=v1, v2); Parquet-only today; Iceberg optional later.

Where to go next

© 2026 Stephen AdeiCC BY 4.0