Skip to main content

Module: Technical Details

Purpose

This module provides technical details about the data platform architecture, technology stack, scalability, reliability, and developer impact. It is designed for technical stakeholders (CTO, engineering leads, architects).

Use in

  • CTO communication (technical architecture overview)
  • Engineering team communications (technical specifications)
  • Architecture review meetings
  • Technical documentation

Architecture Overview

High-Level Design

Raw CSV (S3) → Metadata Enrichment → Loop Prevention → ETL (AWS Glue) → Validated Parquet (S3) → Analytics (Athena)

Quarantine (Invalid Data) + Condemned (Max Attempts/Duplicates)

Key Architectural Decisions

  1. Bronze/Silver/Gold Medallion Architecture

    • Rationale: Industry-standard pattern, clear separation of concerns
    • Bronze: Immutable raw data (audit trail) ✅ Task 1
    • Silver: Validated, analytics-ready data ✅ Task 1
    • Gold: Business contracts (reporting-ready) - Task 2: Complete Architecture Design (Gold layer structure, governance, and ownership model are best described in Task 2), Task 3: SQL Aggregation Pattern
  2. Run Isolation via run_id

    • Rationale: Enables safe backfills, prevents data corruption
    • Pattern: Each ETL run writes to unique path (run_id=YYYYMMDDTHHMMSSZ)
    • Benefit: Can reprocess any historical period without risk
  3. Schema Versioning (schema_v)

    • Rationale: Enables schema evolution without breaking consumers
    • Pattern: All Silver/Gold paths include schema_v=v1/, schema_v=v2/, etc.
    • Benefit: Backward compatibility, safe migrations
  4. Serverless Architecture (AWS Glue + Athena)

    • Rationale: No infrastructure to manage, auto-scales, cost-effective
    • Alternative Considered: EMR clusters (rejected - higher cost, operational overhead)

Technology Stack

ComponentTechnologyRationale
StorageAWS S3Object storage, cost-effective, scalable
ETL EngineAWS Glue (Spark Job, Version 4.0)Serverless, distributed processing, scales automatically. PySpark implementation recommended for production; Pandas (Python Shell) available for development/testing
Data FormatParquet (Snappy)Columnar, compressed, optimized for analytics
Query EngineAmazon AthenaServerless SQL, no infrastructure, pay-per-query
OrchestrationAWS Step FunctionsEvent-driven, serverless workflow
IaCTerraformVersion-controlled infrastructure, reproducible

Scalability & Performance

Current Scale

  • Transaction Volume: ~1.5M transactions/month
  • Data Size: ~500MB/month raw CSV → ~50MB/month Parquet (10x compression)
  • Query Pattern: Monthly reporting, ad-hoc analytics

Scalability Design

  • Storage: S3 scales to exabytes (no limit)
  • Compute: AWS Glue auto-scales (no manual capacity planning)
  • Query Performance: < 30 seconds for month-end reports (100M row table) - Note: Estimated based on partition pruning analysis. Actual performance depends on query complexity and data distribution.
  • Scalability Test: ✅ Tested with 100M row simulation, partition pruning reduces scan by 95%+

Reliability & Resilience

Failure Modes & Handling

  1. ETL Job Failure: Rerun with new run_id (safe, no data loss)
  2. Data Quality Issues: Quarantine + alert, investigate and backfill. Enhanced loop prevention prevents infinite retry cycles (max 3 attempts: attempt_count < 3 allows retry; attempt_count >= 3 condemned, duplicate detection, circuit breaker halts pipeline if >100 same errors/hour). Human review and approval required before reprocessing condemned data.
  3. Infrastructure Failure: AWS handles (multi-AZ, 99.99% SLA)

Data Loss Prevention

  • Immutable Bronze: Raw data never overwritten
  • Run Isolation: Failed runs don't corrupt previous runs
  • S3 Versioning: Can recover deleted files (if enabled)
  • Audit Trail: Full lineage via run_id + CloudWatch logs

Reliability Target: 99.5% uptime (batch processing, not real-time)

Developer Impact

Engineering Requirements

  • Build Phase: 2 FTE × 3 months
  • Skills Required: Python, AWS (Glue, S3, Athena), SQL, Terraform
  • Ongoing: 0.5 FTE total (0.2 FTE Data Engineer + 0.1 FTE DevOps + 0.1 FTE Infrastructure + 0.1 FTE Operations)

Developer Experience

  • Infrastructure as Code (Terraform)
  • CI/CD pipeline (GitHub Actions)
  • Local testing support (MinIO)
  • Comprehensive documentation

Last Updated

January 2026

Owner

Data Platform Team


Communication Modules

Task Documentation

Technical Documentation

© 2026 Stephen AdeiCC BY 4.0