Skip to main content

Module: Architecture Summary

Purpose

This module provides a high-level architecture summary for stakeholder communications. It includes architecture overview, key components, and data flow.

Use in

  • CTO communication (architecture overview)
  • Executive summaries (technical overview)
  • Architecture review meetings
  • Technical documentation

High-Level Architecture

Raw CSV (S3) → Metadata Enrichment → Loop Prevention → ETL (AWS Glue) → Validated Parquet (S3) → Analytics (Athena)

Quarantine (Invalid Data) + Condemned (Max Attempts/Duplicates)

Data Flow

  1. Ingestion: Raw CSV files land in S3 (Bronze layer — immutable audit trail). Task 1.
  2. Transformation: ETL pipeline validates data, writes validated Parquet files (Silver layer). Task 1.
  3. Analytics: Business reporting queries run via Athena (Silver layer queries, Gold layer structure designed in Task 2, SQL aggregation implemented in Task 3)
  4. Quarantine: Invalid data preserved for audit and review with loop prevention (max 3 attempts: attempt_count < 3 allows retry; attempt_count >= 3 condemned, duplicate detection, circuit breaker). Task 1.
  5. Condemned: Rows exceeding max attempts (attempt_count >= 3) or exact duplicates moved to condemned layer (no automatic retries, perpetual retention for financial audit, human review and approval required before reprocessing or deletion). Task 1.

Key Design Principles

Bronze/Silver/Gold Medallion Architecture

  • Bronze: Immutable raw data (audit trail). Task 1.
  • Silver: Validated, analytics-ready data. Task 1.
  • Gold: Business contracts (reporting-ready) - Task 2: Complete Architecture Design (Gold layer structure, governance, and ownership model are best described in Task 2), Task 3: SQL Aggregation Pattern

Run Isolation

  • Each ETL run writes to unique path (run_id)
  • Enables safe backfills, prevents data corruption
  • Full audit trail for reproducibility

Schema Versioning

  • All Silver/Gold paths include schema_v (v1, v2, etc.)
  • Enables schema evolution without breaking consumers
  • Backward compatibility maintained

Technology Stack

  • Storage: AWS S3 (object storage, scalable, cost-effective)
  • ETL Engine: AWS Glue (Spark Job, Version 4.0) - PySpark implementation recommended for production; Pandas (Python Shell) available for development/testing
  • Data Format: Parquet (columnar, compressed, optimized for analytics)
  • Query Engine: Amazon Athena (serverless SQL, pay-per-query)
  • Infrastructure: Terraform (version-controlled, reproducible)

Scalability

  • Current Scale: ~1.5M transactions/month
  • Designed For: 10x growth (15M transactions/month) without redesign
  • Storage: S3 scales to exabytes (no limit)
  • Compute: Auto-scales with data volume (serverless)

Last Updated

January 2026

Owner

Data Platform Team


Communication Modules

Task Documentation

Technical Documentation

© 2026 Stephen AdeiCC BY 4.0