Financial Data Pipeline & Data Lake Architecture
Technical Implementation Summary
Note: This is an alternative/simplified version. For the complete technical documentation with enhanced quarantine logic (metadata enrichment, loop prevention, circuit breaker, condemned layer), see Technical Reference.
End-to-end financial data pipeline processing 100M+ transactions using a Bronze/Silver/Gold medallion architecture. Transforms raw CSV data into validated, partitioned Parquet with automated validation, enhanced quarantine logic (with loop prevention and circuit breaker), and schema evolution support. Deployed via CI/CD with comprehensive monitoring and governance.
Architecture Context:
- Data Ingestion: Implements Bronze → Silver transformation logic
- Data Lake Architecture: Complete design covering all layers (Bronze/Silver/Gold + Quarantine + Condemned), including folder organization, promotion workflows, and business metric definitions
- Analytical Query Design: Implements Silver → Gold aggregation patterns via SQL queries
Core Principle: Append-only raw layer and deterministic transformations, so curated outputs can be reproduced or recomputed for historical periods when necessary, while keeping raw inputs immutable.
System Architecture
Data Lake Structure (Medallion Architecture)
Key Path Patterns
- Bronze:
bronze/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/file.csv.gz - Silver:
silver/{domain}/{dataset}/year={YYYY}/month={MM}/schema_v={vN}/run_id={...}/part-*.parquet - Gold:
gold/{domain}/{dataset}/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet - Quarantine:
quarantine/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet - Condemned:
quarantine/{domain}/{dataset}/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet
Architecture Overview: The core storage design covers all layers (Bronze/Silver/Gold + Quarantine + Condemned), including folder organization, promotion workflows, and business metric definitions. Analytical query patterns show how to aggregate Silver → Gold via SQL queries.
Data Flow & Validation Pipeline
Validation Rules
- Null Checks: All required fields (
TransactionID,CustomerID,TransactionAmount,Currency,TransactionTimestamp) must be non-null - Currency Validation: Currency code must be in ISO-4217 allowlist
- Amount Type Check:
TransactionAmountmust be parseable as numeric (negative amounts allowed) - Timestamp Parsing:
TransactionTimestampmust be parseable as ISO-8601 - Duplicate Detection: No duplicate
CustomerID+TransactionTimestamp(date part) combinations
Quarantine Behavior: Invalid rows are never silently dropped; all invalid rows preserved with validation_error column and metadata (row_hash, attempt_count, retry_history, ingest_date, run_id, source_file). Rows exceeding max attempts (attempt_count >= 3) or exact duplicates are moved to condemned layer (no automatic retries). Human review and approval required before reprocessing or deletion (see Human Validation Policy).
Loop Prevention: TransactionID deduplication (optional, via Silver layer scan), duplicate detection in quarantine history, attempt limits (max 3 attempts: attempt_count < 3 allows retry; attempt_count >= 3 condemned), and circuit breaker (>100 same errors/hour) prevent infinite retry loops and ensure transaction-level idempotency. Human review and approval required before reprocessing condemned data.
Human Validation: For financial data compliance, human approval is required before promoting Silver layer data to production and before deleting condemned data (see Human Validation Policy).
Schema Evolution Strategy
Storage format: Parquet-only with additive changes and schema registry in Glue Data Catalog. This is sufficient for current scale and requirements.
Future Enhancement: Apache Iceberg tables via AWS Glue for advanced schema evolution (add/rename/reorder columns without rewriting data files). Iceberg is a future consideration, not currently implemented.
Compatibility Policy
- Backward-compatible reads: Updated consumers can read both old and new data
- Forward-compatible writes: Updated producers may emit additional fields without breaking older consumers
- Additive changes only: New columns must be nullable/optional
CI/CD Pipeline
Key Artifacts
- ETL Implementation
- Python Dependencies
- Terraform Infrastructure
- CI/CD Workflow
- Configuration Template (template)
Backfill Safety
- Determinism: Rerunning same input produces exact same counts
- Partitioning: Timestamps map to correct
year=YYYY/month=MMfolder - Quarantine: Invalid rows never silently dropped
Failure Handling
Failed runs do not update _LATEST.json or current/ prefix. Each rerun uses new run_id (timestamp-based).
Monitoring & Observability
Structured Logging
All logs are JSON format for CloudWatch Logs Insights queries:
fields @timestamp, run_id, metric_value.quarantine_rate
| filter metric_name = "ETLComplete"
Run Completeness
_SUCCESS marker with metrics JSON is the "commit" signal for consumers. Consumers only read runs with _SUCCESS markers present.
Performance Optimization
| Aspect | Implementation | Impact |
|---|---|---|
| Format | CSV → Parquet (Snappy compression) | 60-80% size reduction |
| Partitioning | year/month hierarchy | 95% query scan reduction (Athena partition pruning) |
| Compression | Snappy codec | Fast read/write balance |
| Processing | Streaming reads (pandas) | Handles large files efficiently |
| Run Isolation | run_id per execution | Safe backfills, no overwrites |
Athena Query Optimization
Partitioning by year/month minimizes scan cost (Athena charges per TB scanned).
Technical Stack
Ownership & Governance
| Layer | Owner | Steward | Responsibility |
|---|---|---|---|
| Bronze | Data Platform Team | Ingestion Lead | Immutability, raw data preservation |
| Silver | Domain Teams | Domain Analyst | Validation logic, schema evolution |
| Gold | Business (Finance) | Finance Controller | Business contracts, reporting accuracy |
| Quarantine | Data Quality Team | Data Quality Lead | Audit trail maintenance, resolution |
Access Control
- Bronze: Platform team only (read/write), Compliance (read-only for audit)
- Silver: Platform team (write), Domain teams (write), Analysts (read)
- Gold: Platform team (write), Business/Analysts (read via Athena)
- Quarantine: Platform team (write), Data Quality team (read/write), Compliance (read-only)
Schema Change Approval
- Silver: Domain Analyst + Platform review
- Gold: Finance Controller + Platform review
- All changes versioned via
schema_vfor backward compatibility
Runbook
How to Rerun/Backfill
- Trigger via AWS Step Functions or EventBridge with specific
bronzepartition - Specify
ingest_datepartition to reprocess (e.g.,ingest_date=2026-01-21) - New run writes to unique
run_idpath (e.g.,run_id=20260128_BACKFILL) - After validation, update
_LATEST.jsonandcurrent/prefix to point to new run
Where Logs Are
- CloudWatch Logs:
/aws-glue/jobs/output(namespace:Ohpen/ETL) - Structured JSON format for CloudWatch Logs Insights queries
How to Inspect Quarantine
- Query
quarantine/S3 path:s3://quarantine-bucket/quarantine/mortgages/transactions/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet - Check
validation_errorcolumn for error categories (SCHEMA_ERROR, NULL_VALUE_ERROR, TYPE_ERROR, CURRENCY_ERROR, TIMESTAMP_ERROR) - Check
attempt_countandrow_hashfor loop prevention analysis - Query
quarantine/condemned/for auto-condemned rows (DUPLICATE_FAILURE, MAX_ATTEMPTS) - Use Athena or pandas to query quarantine Parquet files
Escalation Path
- Infrastructure Issues (P1): Data Platform Team → On-call Engineer
- Data Quality Issues (P2): Data Quality Team → Domain Teams
- Business Metric Issues (P3): Domain Teams → Business
Links & References
Code & Infrastructure
- ETL Code: ETL Implementation
- IaC (Terraform): Terraform Infrastructure
- CI/CD Workflow: CI/CD Workflow Documentation
Documentation
- Data Lake Architecture: Architecture Design
- Validation Rules: Assumptions and Edge Cases
- Schema Evolution: Architecture Design (Section 4)
- CI/CD Details: CI/CD Workflow Documentation
Assumptions & Known Limitations
Assumptions
- Batch-first ingestion: Daily batches (streaming is optional upstream extension)
- History matters: Raw data is immutable; curated outputs are reproducible via
run_idisolation - Negative amounts allowed: Withdrawals/refunds are valid (fraud/outlier detection out of scope)
- Schema changes communicated: Schema evolution requires Domain/Business approval
Deployment Boundaries
- Cross-run TransactionID deduplication available (optional, via
--silver-bucketand--silver-prefixarguments) - Within-run duplicates flagged (CustomerID + date combinations)
- No outlier detection or fraud/risk rules
- No currency conversion (multi-currency reporting requires FX rates)
- No real-time trading / millisecond SLAs
Edge Cases Handled
Missing required fields, Invalid currencies, Invalid amounts, Malformed timestamps, Duplicate account/date combinations, Empty input, All rows invalid
Author: Platform Data Engineering Team | Date: January 2026 | Version: 1.0
Related Documentation
- Technical Reference - Complete documentation with enhanced features
- Stakeholder Email - Project summary communication
- Data Lake Architecture - Core storage design
- CI/CD Workflow - Automated deployment pipeline
- Executive Overview - High-level project summary