Skip to main content

Data Lake Architecture Details

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This document is the detailed architecture reference. For a high-level overview, see Data Lake Architecture (Task 2).

It covers in depth:

  • Folder structure and S3 organization (with example paths)
  • Schema evolution strategies and compatibility
  • Safe publishing pattern, _LATEST.json, and current/ prefix
  • Failure mode analysis and resilience
  • Ownership, governance, and runbooks
  • Performance, cost, security, and production readiness

For governance workflows and approval matrices, see Governance & Workflow Diagrams.


Detailed Structure Diagram with Example Paths

Each medallion layer uses a dedicated bucket (e.g. ohpen-bronze, ohpen-silver, ohpen-gold, ohpen-quarantine); names are deployment-specific. Paths below are relative to each bucket (e.g. s3://ohpen-bronze/bronze/..., s3://ohpen-silver/silver/...).

This detailed diagram shows example folder paths for each layer:


Complete Architecture Diagram

This detailed diagram shows the complete architecture design with all components, AWS services, and implementation status.

  • Ingestion (ETL Pipeline): Implements Bronze → Silver transformation only
  • Architecture: Complete data lake architecture design covering all layers
  • Analytics (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries

Athena & Glue Separation

  • Amazon Athena is a query engine that executes SQL queries. It reads Parquet files from S3 via Glue Catalog metadata. SQL queries execute on Athena compute infrastructure, not directly on S3 storage.
  • AWS Glue Data Catalog stores table schemas, partition metadata, and table locations. Glue does not execute queries; it provides metadata that Athena uses to locate and read Parquet files from S3.
  • Separation of Concerns: Glue = metadata/catalog layer (where tables are defined), Athena = query execution engine (where SQL runs). This separation allows multiple query engines (Athena, Spark, Presto) to use the same Glue Catalog metadata.

Note: the diagram source is also available in diagram.mmd.


Complete Folder Structure Examples

Full S3 Path Structure

Paths below are logical; each layer has its own bucket: Bronze → ohpen-bronze, Silver → ohpen-silver, Gold → ohpen-gold, Quarantine → ohpen-quarantine. Prefix structure (e.g. silver/, gold/) is the same within each bucket.

# Bronze bucket
s3://ohpen-bronze/
├── bronze/ # Bronze Layer (Raw, Immutable)
│ └── mortgages/
│ └── transactions/
│ └── ingest_date=2026-01-21/ # Partition by arrival time
│ └── run_id=20260121T120000Z/ # Run isolation for idempotency
│ └── file.csv.gz # Original source file (verbatim copy)

├── silver/ # Silver Layer (Validated & Enriched)
│ └── mortgages/
│ └── transactions/
│ └── year=2026/ # Business-time partition (optimized for queries)
│ └── month=01/
│ └── schema_v=v1/ # Schema version for evolution
│ └── run_id=20260121T120000Z/ # Write-once run isolation
│ ├── _SUCCESS # Atomic completion marker
│ └── part-00000.parquet # Optimized storage format

├── quarantine/ # Invalid Data (Audit Trail)
│ └── mortgages/
│ └── transactions/
│ └── ingest_date=2026-01-21/
│ └── run_id=20260121T120000Z/
│ └── invalid_rows.parquet # Contains original data + validation_error column
│ └── condemned/ # Condemned Data (No More Retries)
│ └── ingest_date=2026-01-21/
│ └── run_id=20260121T120000Z/
│ └── condemned_rows.parquet # Max attempts exceeded (attempt_count >= 3) or exact duplicates

└── gold/ # Gold Layer (Business Views)
└── finance/ # Domain: Finance
├── account_balances/ # Aggregation 1: Month-end balances (1:N pattern)
│ └── schema_v=v1/
│ ├── _LATEST.json # Authority pointer to current run
│ ├── run_id=20260121T120000Z/ # Historical run
│ │ └── as_of_month=2024-03/
│ │ └── part-00000.parquet
│ └── current/ # Stable prefix for SQL access
│ └── as_of_month=2024-03/
│ └── part-00000.parquet
├── monthly_reports/ # Aggregation 2: Monthly business reports (1:N pattern)
│ └── schema_v=v1/
│ └── current/
│ └── as_of_month=2024-03/
│ └── part-00000.parquet
└── transaction_summaries/ # Aggregation 3: Transaction summaries (1:N pattern)
└── schema_v=v1/
└── current/
└── as_of_month=2024-03/
└── part-00000.parquet

Note: The Gold layer follows the 1:N transformation pattern: one Silver dataset (silver/mortgages/transactions/) produces multiple Gold aggregations (account_balances, monthly_reports, transaction_summaries). Each Gold aggregation serves different business reporting needs while maintaining a single source of truth at Silver.


Safe Publishing Pattern & Authority

Write-Then-Publish Pattern

Each dataset run follows this pattern:

  1. Writes data files to a run_id-scoped prefix.
  2. Writes a manifest.json describing the output (optional but recommended).
  3. Writes a _SUCCESS marker once complete.

Consumers only read runs with _SUCCESS markers present.

_LATEST.json (Control Plane)

A small control-plane object pointing to the authoritative run.

Location: gold/{domain}/{dataset}/schema_v={vN}/_LATEST.json

Content:

{
"run_id": "20260121T120000Z",
"schema_version": "v1",
"published_at": "2026-01-21T12:00:00Z",
"partition_path": "gold/finance/account_balances/schema_v=v1/run_id=20260121T120000Z"
}

Used for:

  • Promotion: Update _LATEST.json to point to a new run after validation.
  • Rollback: Point _LATEST.json back to a previous run if issues are discovered.
  • Reproducibility: Track which run is currently authoritative.

current/ Prefix (Stable SQL Access)

A stable location for Amazon Athena queries via AWS Glue Data Catalog tables.

Pattern: gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet

Mechanism:

  1. New run writes to run_id={...}/ path.
  2. After validation, copy/symlink to current/ prefix.
  3. AWS Glue Data Catalog tables point to current/ for stable SQL semantics in Amazon Athena.

Benefits:

  • Stable SQL semantics for analysts (no need to update table locations).
  • Full lineage and rollback via run history (all run_id paths preserved).
  • Separation of concerns: current/ for queries, run_id/ for audit.

Example Flow:

  1. Write: gold/.../schema_v=v1/run_id=20260121T120000Z/as_of_month=2024-03/part-00000.parquet
  2. Validate: Check _SUCCESS, row counts, schema compliance
  3. Publish: Copy to gold/.../schema_v=v1/current/as_of_month=2024-03/part-00000.parquet
  4. Update: Write _LATEST.json pointing to run_id=20260121T120000Z

Known limitations

  • Copy-to-current is not atomic: The promote_silver Lambda copies Parquet files one-by-one to current/, then writes _LATEST.json. If the Lambda fails mid-copy, current/ may contain partial data. Only the _LATEST.json PUT is atomic.
  • Stale files can remain: When a new run has fewer files than the previous run, old Parquet files in current/ are not deleted and may remain (e.g. part-00003.parquet from a prior 5-file run when the new run has 3 files).
  • Stronger alternative: For single-object atomic commits, consider a pointer publish pattern: write data to run_id=.../ only, validate, then write a single _LATEST.json pointing to that path. Consumers resolve the pointer; no current/ copy. See docs/internal/verification/ETL_CAVEATS_VERIFICATION.md in the repository for upgrade options.

Schema Evolution Strategy

As business requirements change (e.g., adding TransactionType), the architecture must adapt without downtime.

Schema is a contract (not just columns)

In financial domains (mortgages today, investments long-term), schema changes are integration changes:

  • Producers and consumers evolve independently (different deploy cycles).
  • Data is often retained for years (audit + replay), so consumers must handle old shapes indefinitely.
  • The biggest risk is not "parsing fails", but semantic drift (a field keeps the same name but changes meaning).

Schema evolution is managed on two layers:

  1. Lake/analytics schema evolution (tables in Glue/Athena, currently Parquet-only; Iceberg is a future enhancement).
  2. Integration/event contract evolution (APIs/events between services). The same compatibility discipline applies even if payloads are JSON/Avro/Protobuf.
  • Backward-compatible reads: updated consumers can read both old and new data.
  • Forward-compatible writes where possible: updated producers may emit additional fields without breaking older consumers.
  • No "dangerous defaults": avoid defaults that change interpretation (finance examples: default interest rate, default risk score, default LTV). Prefer NULL/missing and explicit "unknown".

Allowed changes (safe / zero-downtime)

  • Add a new nullable/optional field (e.g., TransactionType), and treat missing as "unknown".
  • Deprecate before remove: keep field, stop populating it, document replacement, remove only after all consumers are migrated.
  • Add new derived views instead of changing semantics: if a definition changes (e.g., "balance" definition), publish a new view/table/metric name.

Breaking changes (playbooks)

Some changes are fundamentally breaking (semantics or security), and require explicit playbooks:

  1. Rolling migration via flags (dual-read / dual-logic)

    • Add a new field + a version/flag (e.g., encryption_version, calculation_version).
    • Consumers implement both paths; producers start emitting the new version gradually.
    • Works when keeping old data is acceptable and you can support both versions for a period.
  2. Versioned streams/tables (v2) + migration

  • Create ..._v2 topic/table/view.
  • Backfill/migrate old records into v2 (e.g., re-encrypt, re-calculate) when old data must not remain.
  • Switch consumers to v2, then producers, then retire v1.
  • This is the safer option for security-driven changes (e.g., compromised encryption) or meaning changes where "dual interpretation" is risky.

Replayability constraints (long retention)

If events/records can be replayed years later:

  • New fields must remain optional; consumers must tolerate missing fields.
  • Renames are treated as breaking unless you keep the old field as an alias during a transition.
  • Any change that requires "all historical rows updated" implies either:
  • a Parquet backfill plan (reprocess raw data, write new run_id), or
  • a versioned v2 dataset.
  • (Note: Iceberg snapshots could provide an alternative approach in the future, but are not currently implemented.)

Storage format: Parquet-only

The ETL pipeline uses Parquet-only format. This is sufficient for current scale and requirements.

Parquet Schema Evolution Strategy:

  1. Additive Changes Only: New columns are added to the end of the schema.
  2. Schema Registry: The Glue Data Catalog is the source of truth.
  3. Backfill on Demand: If old data needs the new column populated, AWS Step Functions triggers an AWS Glue reprocessing job (reading raw, writing to a new run_id in processed) and updates the AWS Glue Data Catalog pointer.

Future Enhancement: Iceberg Tables (Optional)

Note: Iceberg is a future consideration, not currently implemented.

For larger scale (10M+ transactions/month), Apache Iceberg tables via AWS Glue could provide:

  • Advanced schema evolution (add/rename/reorder columns without rewriting files)
  • Hidden partitioning (automatic partition management)
  • Time travel queries (query historical snapshots)

Current Approach: Parquet-only with schema versioning (schema_v=v1, schema_v=v2) provides sufficient schema evolution for current needs.

Type Change Example

Changing a column type (e.g., amount from string to decimal(18,2)) requires a new schema version and backfill:

  1. Create schema_v=v2: Update Glue Catalog table definition with new type in schema_v=v2 path.
  2. Backfill Historical Data: Reprocess all historical partitions to convert amount from string to decimal, writing to schema_v=v2/run_id=BACKFILL_YYYYMMDD/.
  3. Update Consumers: Migrate all consumers to query schema_v=v2 tables.
  4. Deprecate v1: After migration period, mark schema_v=v1 as deprecated (but preserve for audit).
  5. Why Breaking: Type changes are breaking because existing queries expecting string will fail on decimal, and data semantics change (e.g., string "100.50" vs decimal 100.50).

Domain notes: mortgages and investments

  • Prefer immutable facts + effective dating: rather than overwriting a contract attribute, append a new record/event with effective_from (and optionally effective_to).
  • Treat amendments as new facts: interest rate resets, refinancing, contract changes, corporate actions should be modeled as new events/types/tables rather than redefining old fields.
  • Money & precision are part of the contract: keep currency explicit; avoid changing precision/rounding rules without a versioned definition (and often a v2 dataset).

Handling schema drift (runtime != documented)

  • Fail fast or quarantine non-conforming records (never silently coerce).
  • Emit drift metrics (unknown fields, missing required fields, new enum values) and trigger a review.
  • For lake ingestion, quarantine is the safety valve; for integrations/events, schema validation at the boundary is the goal.

Backfills & Reprocessing

  • Backfills recompute only affected partitions (e.g., reprocess year=2024/month=03).
  • New outputs are written under a new run_id (e.g., run_id=20260128_BACKFILL).
  • Promotion occurs by updating _LATEST.json and current/ after validation.
  • No historical data is overwritten; all run_id paths are preserved for audit.

Example Backfill Flow:

  1. Identify issue: Data quality problem in March 2024
  2. Reprocess: Read bronze/.../ingest_date=2024-03-*/..., write to silver/.../schema_v=v1/run_id=20260128_BACKFILL/year=2024/month=03/...
  3. Validate: Check _SUCCESS, row counts, quality metrics
  4. Promote: Update _LATEST.json, copy to current/
  5. Notify: Alert consumers of updated dataset

Ownership & Governance

Data Layers Ownership

LayerOwnerStewardResponsibilityImplementation Status
BronzeData Platform TeamIngestion LeadImmutability, raw data preservation, ingestion reliabilityImplemented
SilverDomain TeamsDomain AnalystValidation logic, schema evolution, data qualityImplemented
GoldBusiness (Finance)Finance ControllerBusiness contracts, reporting accuracy, stakeholder communicationArchitecture Design (Gold layer structure, governance, and ownership model are best described in Architecture reference), SQL Aggregation Pattern

Error Handling Layers Ownership

LayerOwnerStewardReviewerResponsibilityImplementation Status
Quarantine (Error Handling)Data Platform TeamETL LeadData Quality TeamError detection, routing, retry logic, audit trail maintenance, infrastructure managementImplemented
Condemned (Error Handling)Data Platform TeamETL LeadData Quality TeamExclusion management, perpetual retention (financial audit), compliance, infrastructure managementImplemented

Note: The high-level Data Lake Architecture (Task 2) summarizes this design; this reference document provides the complete detail covering all layers (Bronze/Silver/Gold + Quarantine + Condemned). The Gold layer structure, governance, and ownership model are detailed in this reference (folder organization, promotion workflows, business metric definitions). Analytical Querying shows how to aggregate Silver → Gold via SQL queries (shows how to create Gold aggregations - one example of the 1:N pattern where one Silver source can produce multiple Gold aggregations). The ETL Pipeline implements Bronze → Silver transformation.

Error Handling Layers: Rules

  • Quarantine Layer:

    • Platform Team manages all write operations (ETL pipeline only)
    • Quality Team has full read access for review and analysis
    • Retry decisions are automated (max 3 attempts: attempt_count < 3 allows retry; attempt_count >= 3 condemned) but can be overridden with Quality Team approval
    • All retries preserve audit trail (retry_history, attempt_count increments)
  • Condemned Layer:

    • Platform Team manages all write operations (automatic condemnation after max attempts)
    • Quality Team has full read access for review and compliance oversight
    • Reprocessing condemned data requires human approval workflow (not automatic)
    • Perpetual retention for financial audit (no automatic deletion)
    • Deletion only via explicit, approved process (legal/compliance approval)
  • Cross-Layer Interactions:

    • Quarantine → Silver (retry success): Platform Team manages, Quality Team monitors
    • Quarantine → Condemned (max attempts): Platform Team manages automatically, Quality Team reviews
    • Condemned → Reprocessing: Requires Quality Team recommendation + human approval

Data Layers: Rules

  • Bronze is immutable and restricted: Only platform team can write; no direct business user access.
  • Silver schema changes require domain + platform review: Domain team proposes, platform team implements and validates.
  • Gold changes require explicit approval and versioning: Finance controller approves schema changes; all changes are versioned via schema_v. (Infrastructure Design, SQL Aggregation Pattern)

Governance Workflow:

  1. Schema change request → Domain Analyst (Silver) or Finance Controller (Gold).
  2. Review → Platform team assesses technical feasibility.
  3. Approval → Domain/Business owner approves.
  4. Implementation → Platform team implements with new schema_v.
  5. Validation → Quality checks, backfill if needed.
  6. Promotion → Update _LATEST.json and current/ after validation (Gold layer only - not implemented in ETL job).

Note: This reference provides the complete data lake architecture design. Gold layer structure, governance, and ownership (including promotion workflow with _LATEST.json and current/ prefix) are described above. Analytical Querying shows how to create Gold aggregations (1:N pattern). The ETL Pipeline implements Bronze → Silver only.


Detailed ownership model: See Error Handling Layers: Detailed Ownership in Governance & Workflow Diagrams for expanded ownership details, access models, and workflows.

Detailed governance workflow: See Ownership & Governance in Governance & Workflow Diagrams for complete responsibility matrices and approval workflows.


Failure Mode Analysis

This section explicitly documents what breaks if critical components are removed or fail, demonstrating system resilience understanding.

Component Failure Scenarios

If _SUCCESS Marker is Missing

Impact: Consumers cannot identify complete runs, leading to incomplete queries or query failures.

  • Symptom: Partial data may be read, causing incorrect analytics results
  • Root Cause: ETL job failed before writing _SUCCESS, or marker was accidentally deleted
  • Detection: Consumers check for _SUCCESS before reading; monitoring alerts on missing markers
  • Mitigation: ETL jobs only write _SUCCESS after all data writes complete successfully. Consumers implement strict checks to ignore runs without _SUCCESS. Failed runs are automatically retried.

If _LATEST.json is Missing

Impact: No authoritative pointer to current dataset, making promotion/rollback impossible.

  • Symptom: Automated workflows cannot determine which dataset version to use; manual discovery required
  • Root Cause: Promotion workflow failed, or file was accidentally deleted
  • Detection: Promotion workflows verify _LATEST.json exists before proceeding
  • Mitigation: _LATEST.json is updated atomically after validation. Fallback mechanisms can query _SUCCESS markers to discover latest run if needed. Version control tracks _LATEST.json changes.

If Glue Data Catalog is Missing

Impact: Athena cannot locate tables, all SQL queries fail.

  • Symptom: All Athena queries return "Table not found" errors
  • Root Cause: Glue Catalog table definition deleted, or table location points to non-existent path
  • Detection: Query failures trigger immediate alerts; catalog health checks monitor table existence
  • Mitigation: Catalog definitions are version-controlled in Terraform. Automated tests verify catalog tables exist before deployment. Backup/restore procedures for catalog metadata.

If Quarantine Layer is Removed

Impact: Invalid data is silently dropped, audit trail lost.

  • Symptom: Data quality issues go undetected; no way to investigate or recover invalid records
  • Root Cause: Quarantine write failures, or quarantine layer access denied
  • Detection: Monitoring alerts on quarantine write failures; data quality metrics track expected vs actual row counts
  • Mitigation: Quarantine writes are mandatory and fail the ETL job if they fail. Quarantine layer has separate IAM policies to prevent accidental deletion. Regular audits verify quarantine data retention.

If Bronze Layer is Overwritten (Immutability Violation)

Impact: Historical audit trail lost; backfills become impossible.

  • Symptom: Cannot reproduce historical reports; audit compliance failures
  • Root Cause: Accidental overwrite, or immutability controls bypassed
  • Detection: S3 versioning alerts on overwrites; CloudTrail logs track all write operations
  • Mitigation: Bronze layer uses S3 versioning and object lock. IAM policies prevent overwrites (only append allowed). Regular backups of critical partitions.

If run_id Isolation is Removed

Impact: Reruns overwrite previous outputs, causing data loss and making backfills unsafe.

  • Symptom: Historical run outputs disappear; cannot rollback to previous versions
  • Root Cause: ETL job writes to fixed path instead of run_id-scoped path
  • Detection: Monitoring detects duplicate run_id paths or missing run isolation
  • Mitigation: run_id generation is mandatory and timestamp-based (unique per run). ETL code enforces run_id in all output paths. Code reviews verify run isolation patterns.

If Schema Versioning is Removed

Impact: Schema changes break existing consumers; no backward compatibility.

  • Symptom: Existing queries fail after schema changes; data type mismatches
  • Root Cause: Schema changes applied without versioning; consumers not migrated
  • Detection: Query failures after schema changes; schema drift monitoring
  • Mitigation: All schema changes require new schema_v version. Migration playbooks ensure consumers are updated before old versions deprecated. Schema registry tracks all versions.

System Resilience Principles

  1. Fail-Safe Defaults: System fails in a safe state (no partial data published) rather than silently corrupting data.
  2. Defense in Depth: Multiple layers of validation (ETL validation, catalog checks, consumer checks) prevent single points of failure.
  3. Auditability: All operations are logged (CloudTrail, CloudWatch) enabling post-incident analysis.
  4. Immutability: Critical layers (Bronze) are append-only, preventing accidental data loss.
  5. Idempotency: Reruns are safe; each run writes to unique paths preventing overwrites.

Performance Benchmarks

This section provides detailed performance benchmarks collected from load testing and production runs.

ETL Performance Metrics

MetricValueNotes
Throughput≥3K rows/secondBaseline performance for 100K rows (measured)
Processing Time (100K rows)≤30 secondsGood baseline performance (measured)
Processing Time (1M rows)2-5 minutesTypical production size
Processing Time (10M rows)15-30 minutesLarge dataset, may need optimization
PySpark Improvement5-10x fasterCompared to Pandas implementation
Max File Size100GB+PySpark handles large files (vs ~40MB for Pandas)
Scalability2-100 DPUsAuto-scaling with data volume

Query Performance Metrics

MetricValueNotes
Month-End Report Query< 30 secondsFor 100M row table with partition pruning
Partition Pruning Reduction~95% (approximation)Only scans relevant year/month partitions (e.g., 3 months out of 60 months for 5-year retention = 95% reduction)
Athena Query Cost$5 per TB scannedPartition pruning minimizes scan costs
Query OptimizationPartition-basedyear/month partitioning enables efficient queries

Scalability Metrics

Test ScenarioResultNotes
100M Row TestPassedArchitecture supports large datasets
10x GrowthSupportedNo redesign required for 10x data volume
Partition StrategyOptimizedyear/month partitioning scales efficiently

Performance Comparison: Pandas vs PySpark

MetricPandas (Before)PySpark (After)Improvement
Processing Time (500MB)5-10 min1-2 min5-10x faster
Max File Size~40MB100GB+2500x larger
Scalability1 DPU2-100 DPUs100x capacity
Cost per Run$0.44/DPU-hr$0.44/DPU-hr (faster)5-10x cost reduction (same cost, faster execution)

Production Readiness Details

This section provides detailed production readiness considerations for deploying this architecture in a production environment.

High Availability Details

RTO/RPO Targets:

  • RTO (Recovery Time Objective): < 4 hours (batch processing, not real-time)
  • RPO (Recovery Point Objective): < 24 hours (daily batch runs)

Multi-AZ Architecture:

  • S3: Automatic multi-AZ replication, 99.99% availability SLA
  • Glue: Distributed processing across multiple AZs, automatic failover
  • Athena: Serverless, no infrastructure to manage, automatic scaling
  • Step Functions: Multi-AZ by default, automatic retry with exponential backoff

Availability Monitoring:

  • CloudWatch alarms for service health
  • S3 bucket monitoring for availability metrics
  • Glue job success/failure tracking

Disaster Recovery Details

Current DR Strategy:

  • Immutable Bronze Layer: Raw data never overwritten, enables full reprocessing
  • S3 Versioning: Enabled on all buckets, can recover deleted files
  • Run Isolation: Each run writes to unique run_id path, preserves historical data
  • CloudWatch Logs: 30-day retention for audit trail

Optional Enhancements:

  • S3 Cross-Region Replication: Replicate Bronze/Silver layers to secondary region (e.g., eu-west-1eu-central-1)
  • DR Runbook: Documented procedures for regional failover (if required)
  • Backup Strategy: S3 versioning provides point-in-time recovery

DR Testing:

  • Quarterly DR drills to validate recovery procedures
  • Test reprocessing from Bronze layer to validate data integrity

Cost Optimization Details

Storage Cost Optimization:

This Sankey diagram shows how monthly infrastructure costs are allocated across AWS services and storage tiers:

Complete Cost Flow:

Cost Allocation Sankey:

Cost Allocation:

  • Total Monthly Cost ($28.05) is allocated to:
    • Glue Compute ($26.40): ETL processing (pay-per-run, scales with data volume)
    • S3 Storage ($1.30): Data storage across all layers (Standard + Archive tiers)
    • Ancillary Services ($0.35): SQL query execution (Athena $0.20), Orchestration (Step Functions $0.06), Monitoring (CloudWatch $0.05), and other services (KMS, CloudTrail, SNS, SQS $0.04)
  • S3 Storage ($1.30) breakdown:
    • Standard Storage ($1.15): Active data in standard S3 storage
    • Archive Storage ($0.15): Long-term retention in Glacier tiers (transitioned via lifecycle policies)

Optimization Techniques:

  • Parquet Compression: 10x compression reduces storage needs (saves ~$0.12/month)
  • Partition Pruning: 95% scan reduction minimizes Athena query costs (saves ~$0.01/month)

Lifecycle Policies:

  • Quarantine Data: Transition to Glacier after 90 days; no automatic deletion (perpetual retention)
  • Condemned Data: Transition to Glacier after 5 years; no automatic deletion (perpetual retention)
  • Processed Data: Transition to Glacier after 90 days (optional)

Compute Cost Optimization:

  • Serverless Architecture: Pay-per-use model, no idle costs
  • Auto-Scaling: Glue DPUs scale automatically (2-100 DPUs based on data volume)
  • Query Optimization: Partition pruning reduces Athena scan costs by 95%+

Cost Monitoring:

  • AWS Cost Explorer integration with cost allocation tags
  • Monthly cost reviews to identify optimization opportunities
  • Cost alerts for unexpected spending

Security & Compliance Details

Encryption Strategy:

  • TLS in Transit: All API calls use HTTPS (TLS 1.2+)
  • S3 Encryption at Rest: SSE-S3 (AES256) by default for bronze/silver/artifacts buckets
  • SSE-KMS: Implemented for sensitive buckets (gold, quarantine) using customer-managed keys (CMK) for enhanced security and compliance
  • Athena Results: Encrypted with SSE-S3

Access Control Patterns:

Audit Logging:

  • CloudTrail: Logs all API calls (S3, Glue, Athena, IAM). Management events enabled for all services; selective data events enabled for high-risk buckets (ohpen-gold and ohpen-quarantine) to track object-level access
  • CloudWatch Logs: Structured logging with run_id, timestamps, error details
  • Data Lineage: Full audit trail via run_id isolation and metadata

Data Classification:

  • Sensitive Data: Financial transaction data (sensitive)
  • PII Handling: Minimal PII in raw layer, mask/tokenize in curated layers if required
  • Restricted Access: Raw/quarantine layers limited to Platform + Compliance teams

Compliance Alignment:

  • Immutable Audit Trail: Bronze layer is append-only, full history preserved
  • Data Retention: Perpetual retention for financial/audit data; deletion only with legal/compliance approval
  • Reproducible Reporting: Full lineage enables reproducible month-end reports

Operational Runbooks

Troubleshooting Common Issues:

  1. ETL Job Failures

    • Check CloudWatch logs for error details
    • Verify S3 permissions (IAM role has correct policies)
    • Check Glue job configuration (DPU allocation, timeout settings)
    • Verify source data format (CSV structure matches expected schema)
  2. High Quarantine Rate

    • Review validation rules (check for overly strict rules)
    • Check source data quality (investigate error patterns)
    • Review quarantine data to identify root causes
    • Adjust validation rules if needed (with domain team approval)
  3. Query Performance Issues

    • Verify partition pruning (check WHERE clause includes partition columns)
    • Check Athena query plans (use EXPLAIN to analyze query execution)
    • Optimize partition strategy (ensure partitions are not too large/small)
    • Review Parquet file sizes (ensure files are not too small, causing many small files)

Performance Tuning Procedures:

  1. ETL Performance Tuning

    • Monitor Glue job metrics (duration, DPU utilization)
    • Adjust DPU allocation if jobs are slow (increase DPUs for large datasets)
    • Optimize PySpark code (use broadcast joins, repartition if needed)
    • Review partition sizes (ensure partitions are balanced)
  2. Query Performance Tuning

    • Analyze Athena query plans (identify expensive operations)
    • Optimize partition strategy (ensure partitions are sized appropriately)
    • Review Parquet file sizes (ensure files are 128MB-1GB for optimal performance)
    • Use columnar pruning (only select needed columns)

Incident Response Procedures:

  1. ETL Pipeline Failure

    • Detection: CloudWatch alarm triggers on job failure
    • Investigation: Check CloudWatch logs, verify S3 permissions, check Glue job status
    • Recovery: Rerun ETL job with new run_id (safe, no data loss)
    • Escalation: Notify Data Platform Team if issue persists
  2. Data Quality Issue

    • Detection: CloudWatch alarm triggers on high quarantine rate (>5%)
    • Investigation: Review quarantine data, identify error patterns
    • Recovery: Fix source data or adjust validation rules, reprocess affected partitions
    • Escalation: Notify Data Quality Team for review
  3. Query Performance Degradation

    • Detection: Monitor Athena query times, set up CloudWatch alarms
    • Investigation: Analyze query plans, check partition strategy
    • Recovery: Optimize queries, adjust partition strategy if needed
    • Escalation: Notify Data Platform Team for optimization review

See also

© 2026 Stephen AdeiCC BY 4.0