Module: Architecture Summary
Purpose
This module provides a high-level architecture summary for stakeholder communications. It includes architecture overview, key components, and data flow.
Use in
- CTO communication (architecture overview)
- Executive summaries (technical overview)
- Architecture review meetings
- Technical documentation
High-Level Architecture
Raw CSV (S3) → Metadata Enrichment → Loop Prevention → ETL (AWS Glue) → Validated Parquet (S3) → Analytics (Athena)
↓
Quarantine (Invalid Data) + Condemned (Max Attempts/Duplicates)
Data Flow
- Ingestion: Raw CSV files land in S3 (Bronze layer — immutable audit trail). Task 1.
- Transformation: ETL pipeline validates data, writes validated Parquet files (Silver layer). Task 1.
- Analytics: Business reporting queries run via Athena (Silver layer queries, Gold layer structure designed in Task 2, SQL aggregation implemented in Task 3)
- Quarantine: Invalid data preserved for audit and review with loop prevention (max 3 attempts: attempt_count < 3 allows retry; attempt_count >= 3 condemned, duplicate detection, circuit breaker). Task 1.
- Condemned: Rows exceeding max attempts (attempt_count >= 3) or exact duplicates moved to condemned layer (no automatic retries, perpetual retention for financial audit, human review and approval required before reprocessing or deletion). Task 1.
Key Design Principles
Bronze/Silver/Gold Medallion Architecture
- Bronze: Immutable raw data (audit trail). Task 1.
- Silver: Validated, analytics-ready data. Task 1.
- Gold: Business contracts (reporting-ready) - Task 2: Complete Architecture Design (Gold layer structure, governance, and ownership model are best described in Task 2), Task 3: SQL Aggregation Pattern
Run Isolation
- Each ETL run writes to unique path (
run_id) - Enables safe backfills, prevents data corruption
- Full audit trail for reproducibility
Schema Versioning
- All Silver/Gold paths include
schema_v(v1, v2, etc.) - Enables schema evolution without breaking consumers
- Backward compatibility maintained
Technology Stack
- Storage: AWS S3 (object storage, scalable, cost-effective)
- ETL Engine: AWS Glue (Spark Job, Version 4.0) - PySpark implementation recommended for production; Pandas (Python Shell) available for development/testing
- Data Format: Parquet (columnar, compressed, optimized for analytics)
- Query Engine: Amazon Athena (serverless SQL, pay-per-query)
- Infrastructure: Terraform (version-controlled, reproducible)
Scalability
- Current Scale: ~1.5M transactions/month
- Designed For: 10x growth (15M transactions/month) without redesign
- Storage: S3 scales to exabytes (no limit)
- Compute: Auto-scales with data volume (serverless)
Last Updated
January 2026
Owner
Data Platform Team
Related Documentation
Communication Modules
- Communication Modules Overview - All available modules
- Technical Details Module - Detailed technical implementation
- Project Overview Module - Project description and context
Task Documentation
- Data Lake Architecture - Complete architecture design
- ETL Pipeline - ETL design and implementation
- SQL Query - SQL analytics query
- CI/CD Workflow - CI/CD design
Technical Documentation
- AWS Services Analysis - Service selection rationale
- PySpark Optimization - Performance considerations