Skip to main content

Financial Data Pipeline & Data Lake Architecture

Technical Implementation Summary

Note: This is an alternative/simplified version. For the complete technical documentation with enhanced quarantine logic (metadata enrichment, loop prevention, circuit breaker, condemned layer), see Technical Reference.


End-to-end financial data pipeline processing 100M+ transactions using a Bronze/Silver/Gold medallion architecture. Transforms raw CSV data into validated, partitioned Parquet with automated validation, enhanced quarantine logic (with loop prevention and circuit breaker), and schema evolution support. Deployed via CI/CD with comprehensive monitoring and governance.

Architecture Context:

  • Data Ingestion: Implements Bronze → Silver transformation logic
  • Data Lake Architecture: Complete design covering all layers (Bronze/Silver/Gold + Quarantine + Condemned), including folder organization, promotion workflows, and business metric definitions
  • Analytical Query Design: Implements Silver → Gold aggregation patterns via SQL queries

Core Principle: Append-only raw layer and deterministic transformations, so curated outputs can be reproduced or recomputed for historical periods when necessary, while keeping raw inputs immutable.


System Architecture


Data Lake Structure (Medallion Architecture)

Key Path Patterns

  • Bronze: bronze/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/file.csv.gz
  • Silver: silver/{domain}/{dataset}/year={YYYY}/month={MM}/schema_v={vN}/run_id={...}/part-*.parquet
  • Gold: gold/{domain}/{dataset}/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet
  • Quarantine: quarantine/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet
  • Condemned: quarantine/{domain}/{dataset}/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet

Architecture Overview: The core storage design covers all layers (Bronze/Silver/Gold + Quarantine + Condemned), including folder organization, promotion workflows, and business metric definitions. Analytical query patterns show how to aggregate Silver → Gold via SQL queries.


Data Flow & Validation Pipeline

Validation Rules

  1. Null Checks: All required fields (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp) must be non-null
  2. Currency Validation: Currency code must be in ISO-4217 allowlist
  3. Amount Type Check: TransactionAmount must be parseable as numeric (negative amounts allowed)
  4. Timestamp Parsing: TransactionTimestamp must be parseable as ISO-8601
  5. Duplicate Detection: No duplicate CustomerID + TransactionTimestamp (date part) combinations

Quarantine Behavior: Invalid rows are never silently dropped; all invalid rows preserved with validation_error column and metadata (row_hash, attempt_count, retry_history, ingest_date, run_id, source_file). Rows exceeding max attempts (attempt_count >= 3) or exact duplicates are moved to condemned layer (no automatic retries). Human review and approval required before reprocessing or deletion (see Human Validation Policy).

Loop Prevention: TransactionID deduplication (optional, via Silver layer scan), duplicate detection in quarantine history, attempt limits (max 3 attempts: attempt_count < 3 allows retry; attempt_count >= 3 condemned), and circuit breaker (>100 same errors/hour) prevent infinite retry loops and ensure transaction-level idempotency. Human review and approval required before reprocessing condemned data.

Human Validation: For financial data compliance, human approval is required before promoting Silver layer data to production and before deleting condemned data (see Human Validation Policy).


Schema Evolution Strategy

Storage format: Parquet-only with additive changes and schema registry in Glue Data Catalog. This is sufficient for current scale and requirements.

Future Enhancement: Apache Iceberg tables via AWS Glue for advanced schema evolution (add/rename/reorder columns without rewriting data files). Iceberg is a future consideration, not currently implemented.

Compatibility Policy

  • Backward-compatible reads: Updated consumers can read both old and new data
  • Forward-compatible writes: Updated producers may emit additional fields without breaking older consumers
  • Additive changes only: New columns must be nullable/optional

CI/CD Pipeline

Key Artifacts

Backfill Safety

  • Determinism: Rerunning same input produces exact same counts
  • Partitioning: Timestamps map to correct year=YYYY/month=MM folder
  • Quarantine: Invalid rows never silently dropped

Failure Handling

Failed runs do not update _LATEST.json or current/ prefix. Each rerun uses new run_id (timestamp-based).


Monitoring & Observability

Structured Logging

All logs are JSON format for CloudWatch Logs Insights queries:

fields @timestamp, run_id, metric_value.quarantine_rate
| filter metric_name = "ETLComplete"

Run Completeness

_SUCCESS marker with metrics JSON is the "commit" signal for consumers. Consumers only read runs with _SUCCESS markers present.


Performance Optimization

AspectImplementationImpact
FormatCSV → Parquet (Snappy compression)60-80% size reduction
Partitioningyear/month hierarchy95% query scan reduction (Athena partition pruning)
CompressionSnappy codecFast read/write balance
ProcessingStreaming reads (pandas)Handles large files efficiently
Run Isolationrun_id per executionSafe backfills, no overwrites

Athena Query Optimization

Partitioning by year/month minimizes scan cost (Athena charges per TB scanned).


Technical Stack


Ownership & Governance

LayerOwnerStewardResponsibility
BronzeData Platform TeamIngestion LeadImmutability, raw data preservation
SilverDomain TeamsDomain AnalystValidation logic, schema evolution
GoldBusiness (Finance)Finance ControllerBusiness contracts, reporting accuracy
QuarantineData Quality TeamData Quality LeadAudit trail maintenance, resolution

Access Control

  • Bronze: Platform team only (read/write), Compliance (read-only for audit)
  • Silver: Platform team (write), Domain teams (write), Analysts (read)
  • Gold: Platform team (write), Business/Analysts (read via Athena)
  • Quarantine: Platform team (write), Data Quality team (read/write), Compliance (read-only)

Schema Change Approval

  • Silver: Domain Analyst + Platform review
  • Gold: Finance Controller + Platform review
  • All changes versioned via schema_v for backward compatibility

Runbook

How to Rerun/Backfill

  1. Trigger via AWS Step Functions or EventBridge with specific bronze partition
  2. Specify ingest_date partition to reprocess (e.g., ingest_date=2026-01-21)
  3. New run writes to unique run_id path (e.g., run_id=20260128_BACKFILL)
  4. After validation, update _LATEST.json and current/ prefix to point to new run

Where Logs Are

  • CloudWatch Logs: /aws-glue/jobs/output (namespace: Ohpen/ETL)
  • Structured JSON format for CloudWatch Logs Insights queries

How to Inspect Quarantine

  • Query quarantine/ S3 path: s3://quarantine-bucket/quarantine/mortgages/transactions/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet
  • Check validation_error column for error categories (SCHEMA_ERROR, NULL_VALUE_ERROR, TYPE_ERROR, CURRENCY_ERROR, TIMESTAMP_ERROR)
  • Check attempt_count and row_hash for loop prevention analysis
  • Query quarantine/condemned/ for auto-condemned rows (DUPLICATE_FAILURE, MAX_ATTEMPTS)
  • Use Athena or pandas to query quarantine Parquet files

Escalation Path

  • Infrastructure Issues (P1): Data Platform Team → On-call Engineer
  • Data Quality Issues (P2): Data Quality Team → Domain Teams
  • Business Metric Issues (P3): Domain Teams → Business

Code & Infrastructure

Documentation


Assumptions & Known Limitations

Assumptions

  • Batch-first ingestion: Daily batches (streaming is optional upstream extension)
  • History matters: Raw data is immutable; curated outputs are reproducible via run_id isolation
  • Negative amounts allowed: Withdrawals/refunds are valid (fraud/outlier detection out of scope)
  • Schema changes communicated: Schema evolution requires Domain/Business approval

Deployment Boundaries

  • Cross-run TransactionID deduplication available (optional, via --silver-bucket and --silver-prefix arguments)
  • Within-run duplicates flagged (CustomerID + date combinations)
  • No outlier detection or fraud/risk rules
  • No currency conversion (multi-currency reporting requires FX rates)
  • No real-time trading / millisecond SLAs

Edge Cases Handled

Missing required fields, Invalid currencies, Invalid amounts, Malformed timestamps, Duplicate account/date combinations, Empty input, All rows invalid


Author: Platform Data Engineering Team | Date: January 2026 | Version: 1.0


© 2026 Stephen AdeiCC BY 4.0