Data Lake Architecture Details
© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
This document is the detailed architecture reference. For a high-level overview, see Data Lake Architecture (Task 2).
It covers in depth:
- Folder structure and S3 organization (with example paths)
- Schema evolution strategies and compatibility
- Safe publishing pattern,
_LATEST.json, andcurrent/prefix - Failure mode analysis and resilience
- Ownership, governance, and runbooks
- Performance, cost, security, and production readiness
For governance workflows and approval matrices, see Governance & Workflow Diagrams.
Detailed Structure Diagram with Example Paths
Each medallion layer uses a dedicated bucket (e.g. ohpen-bronze, ohpen-silver, ohpen-gold, ohpen-quarantine); names are deployment-specific. Paths below are relative to each bucket (e.g. s3://ohpen-bronze/bronze/..., s3://ohpen-silver/silver/...).
This detailed diagram shows example folder paths for each layer:
Complete Architecture Diagram
This detailed diagram shows the complete architecture design with all components, AWS services, and implementation status.
- Ingestion (ETL Pipeline): Implements Bronze → Silver transformation only
- Architecture: Complete data lake architecture design covering all layers
- Analytics (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries
Athena & Glue Separation
- Amazon Athena is a query engine that executes SQL queries. It reads Parquet files from S3 via Glue Catalog metadata. SQL queries execute on Athena compute infrastructure, not directly on S3 storage.
- AWS Glue Data Catalog stores table schemas, partition metadata, and table locations. Glue does not execute queries; it provides metadata that Athena uses to locate and read Parquet files from S3.
- Separation of Concerns: Glue = metadata/catalog layer (where tables are defined), Athena = query execution engine (where SQL runs). This separation allows multiple query engines (Athena, Spark, Presto) to use the same Glue Catalog metadata.
Note: the diagram source is also available in diagram.mmd.
Complete Folder Structure Examples
Full S3 Path Structure
Paths below are logical; each layer has its own bucket: Bronze → ohpen-bronze, Silver → ohpen-silver, Gold → ohpen-gold, Quarantine → ohpen-quarantine. Prefix structure (e.g. silver/, gold/) is the same within each bucket.
# Bronze bucket
s3://ohpen-bronze/
├── bronze/ # Bronze Layer (Raw, Immutable)
│ └── mortgages/
│ └── transactions/
│ └── ingest_date=2026-01-21/ # Partition by arrival time
│ └── run_id=20260121T120000Z/ # Run isolation for idempotency
│ └── file.csv.gz # Original source file (verbatim copy)
│
├── silver/ # Silver Layer (Validated & Enriched)
│ └── mortgages/
│ └── transactions/
│ └── year=2026/ # Business-time partition (optimized for queries)
│ └── month=01/
│ └── schema_v=v1/ # Schema version for evolution
│ └── run_id=20260121T120000Z/ # Write-once run isolation
│ ├── _SUCCESS # Atomic completion marker
│ └── part-00000.parquet # Optimized storage format
│
├── quarantine/ # Invalid Data (Audit Trail)
│ └── mortgages/
│ └── transactions/
│ └── ingest_date=2026-01-21/
│ └── run_id=20260121T120000Z/
│ └── invalid_rows.parquet # Contains original data + validation_error column
│ └── condemned/ # Condemned Data (No More Retries)
│ └── ingest_date=2026-01-21/
│ └── run_id=20260121T120000Z/
│ └── condemned_rows.parquet # Max attempts exceeded (attempt_count >= 3) or exact duplicates
│
└── gold/ # Gold Layer (Business Views)
└── finance/ # Domain: Finance
├── account_balances/ # Aggregation 1: Month-end balances (1:N pattern)
│ └── schema_v=v1/
│ ├── _LATEST.json # Authority pointer to current run
│ ├── run_id=20260121T120000Z/ # Historical run
│ │ └── as_of_month=2024-03/
│ │ └── part-00000.parquet
│ └── current/ # Stable prefix for SQL access
│ └── as_of_month=2024-03/
│ └── part-00000.parquet
├── monthly_reports/ # Aggregation 2: Monthly business reports (1:N pattern)
│ └── schema_v=v1/
│ └── current/
│ └── as_of_month=2024-03/
│ └── part-00000.parquet
└── transaction_summaries/ # Aggregation 3: Transaction summaries (1:N pattern)
└── schema_v=v1/
└── current/
└── as_of_month=2024-03/
└── part-00000.parquet
Note: The Gold layer follows the 1:N transformation pattern: one Silver dataset (silver/mortgages/transactions/) produces multiple Gold aggregations (account_balances, monthly_reports, transaction_summaries). Each Gold aggregation serves different business reporting needs while maintaining a single source of truth at Silver.
Safe Publishing Pattern & Authority
Write-Then-Publish Pattern
Each dataset run follows this pattern:
- Writes data files to a
run_id-scoped prefix. - Writes a
manifest.jsondescribing the output (optional but recommended). - Writes a
_SUCCESSmarker once complete.
Consumers only read runs with _SUCCESS markers present.
_LATEST.json (Control Plane)
A small control-plane object pointing to the authoritative run.
Location: gold/{domain}/{dataset}/schema_v={vN}/_LATEST.json
Content:
{
"run_id": "20260121T120000Z",
"schema_version": "v1",
"published_at": "2026-01-21T12:00:00Z",
"partition_path": "gold/finance/account_balances/schema_v=v1/run_id=20260121T120000Z"
}
Used for:
- Promotion: Update
_LATEST.jsonto point to a new run after validation. - Rollback: Point
_LATEST.jsonback to a previous run if issues are discovered. - Reproducibility: Track which run is currently authoritative.
current/ Prefix (Stable SQL Access)
A stable location for Amazon Athena queries via AWS Glue Data Catalog tables.
Pattern: gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet
Mechanism:
- New run writes to
run_id={...}/path. - After validation, copy/symlink to
current/prefix. - AWS Glue Data Catalog tables point to
current/for stable SQL semantics in Amazon Athena.
Benefits:
- Stable SQL semantics for analysts (no need to update table locations).
- Full lineage and rollback via run history (all
run_idpaths preserved). - Separation of concerns:
current/for queries,run_id/for audit.
Example Flow:
- Write:
gold/.../schema_v=v1/run_id=20260121T120000Z/as_of_month=2024-03/part-00000.parquet - Validate: Check
_SUCCESS, row counts, schema compliance - Publish: Copy to
gold/.../schema_v=v1/current/as_of_month=2024-03/part-00000.parquet - Update: Write
_LATEST.jsonpointing torun_id=20260121T120000Z
Known limitations
- Copy-to-current is not atomic: The promote_silver Lambda copies Parquet files one-by-one to
current/, then writes_LATEST.json. If the Lambda fails mid-copy,current/may contain partial data. Only the_LATEST.jsonPUT is atomic. - Stale files can remain: When a new run has fewer files than the previous run, old Parquet files in
current/are not deleted and may remain (e.g.part-00003.parquetfrom a prior 5-file run when the new run has 3 files). - Stronger alternative: For single-object atomic commits, consider a pointer publish pattern: write data to
run_id=.../only, validate, then write a single_LATEST.jsonpointing to that path. Consumers resolve the pointer; nocurrent/copy. Seedocs/internal/verification/ETL_CAVEATS_VERIFICATION.mdin the repository for upgrade options.
Schema Evolution Strategy
As business requirements change (e.g., adding TransactionType), the architecture must adapt without downtime.
Schema is a contract (not just columns)
In financial domains (mortgages today, investments long-term), schema changes are integration changes:
- Producers and consumers evolve independently (different deploy cycles).
- Data is often retained for years (audit + replay), so consumers must handle old shapes indefinitely.
- The biggest risk is not "parsing fails", but semantic drift (a field keeps the same name but changes meaning).
Schema evolution is managed on two layers:
- Lake/analytics schema evolution (tables in Glue/Athena, currently Parquet-only; Iceberg is a future enhancement).
- Integration/event contract evolution (APIs/events between services). The same compatibility discipline applies even if payloads are JSON/Avro/Protobuf.
Compatibility policy (recommended default)
- Backward-compatible reads: updated consumers can read both old and new data.
- Forward-compatible writes where possible: updated producers may emit additional fields without breaking older consumers.
- No "dangerous defaults": avoid defaults that change interpretation (finance examples: default interest rate, default risk score, default LTV). Prefer
NULL/missing and explicit "unknown".
Allowed changes (safe / zero-downtime)
- Add a new nullable/optional field (e.g.,
TransactionType), and treat missing as "unknown". - Deprecate before remove: keep field, stop populating it, document replacement, remove only after all consumers are migrated.
- Add new derived views instead of changing semantics: if a definition changes (e.g., "balance" definition), publish a new view/table/metric name.
Breaking changes (playbooks)
Some changes are fundamentally breaking (semantics or security), and require explicit playbooks:
-
Rolling migration via flags (dual-read / dual-logic)
- Add a new field + a version/flag (e.g.,
encryption_version,calculation_version). - Consumers implement both paths; producers start emitting the new version gradually.
- Works when keeping old data is acceptable and you can support both versions for a period.
- Add a new field + a version/flag (e.g.,
-
Versioned streams/tables (v2) + migration
- Create
..._v2topic/table/view. - Backfill/migrate old records into v2 (e.g., re-encrypt, re-calculate) when old data must not remain.
- Switch consumers to v2, then producers, then retire v1.
- This is the safer option for security-driven changes (e.g., compromised encryption) or meaning changes where "dual interpretation" is risky.
Replayability constraints (long retention)
If events/records can be replayed years later:
- New fields must remain optional; consumers must tolerate missing fields.
- Renames are treated as breaking unless you keep the old field as an alias during a transition.
- Any change that requires "all historical rows updated" implies either:
- a Parquet backfill plan (reprocess raw data, write new run_id), or
- a versioned v2 dataset.
- (Note: Iceberg snapshots could provide an alternative approach in the future, but are not currently implemented.)
Storage format: Parquet-only
The ETL pipeline uses Parquet-only format. This is sufficient for current scale and requirements.
Parquet Schema Evolution Strategy:
- Additive Changes Only: New columns are added to the end of the schema.
- Schema Registry: The Glue Data Catalog is the source of truth.
- Backfill on Demand: If old data needs the new column populated, AWS Step Functions triggers an AWS Glue reprocessing job (reading
raw, writing to a newrun_idinprocessed) and updates the AWS Glue Data Catalog pointer.
Future Enhancement: Iceberg Tables (Optional)
Note: Iceberg is a future consideration, not currently implemented.
For larger scale (10M+ transactions/month), Apache Iceberg tables via AWS Glue could provide:
- Advanced schema evolution (add/rename/reorder columns without rewriting files)
- Hidden partitioning (automatic partition management)
- Time travel queries (query historical snapshots)
Current Approach: Parquet-only with schema versioning (schema_v=v1, schema_v=v2) provides sufficient schema evolution for current needs.
Type Change Example
Changing a column type (e.g., amount from string to decimal(18,2)) requires a new schema version and backfill:
- Create
schema_v=v2: Update Glue Catalog table definition with new type inschema_v=v2path. - Backfill Historical Data: Reprocess all historical partitions to convert
amountfrom string to decimal, writing toschema_v=v2/run_id=BACKFILL_YYYYMMDD/. - Update Consumers: Migrate all consumers to query
schema_v=v2tables. - Deprecate v1: After migration period, mark
schema_v=v1as deprecated (but preserve for audit). - Why Breaking: Type changes are breaking because existing queries expecting
stringwill fail ondecimal, and data semantics change (e.g., string "100.50" vs decimal 100.50).
Domain notes: mortgages and investments
- Prefer immutable facts + effective dating: rather than overwriting a contract attribute, append a new record/event with
effective_from(and optionallyeffective_to). - Treat amendments as new facts: interest rate resets, refinancing, contract changes, corporate actions should be modeled as new events/types/tables rather than redefining old fields.
- Money & precision are part of the contract: keep currency explicit; avoid changing precision/rounding rules without a versioned definition (and often a v2 dataset).
Handling schema drift (runtime != documented)
- Fail fast or quarantine non-conforming records (never silently coerce).
- Emit drift metrics (unknown fields, missing required fields, new enum values) and trigger a review.
- For lake ingestion, quarantine is the safety valve; for integrations/events, schema validation at the boundary is the goal.
Backfills & Reprocessing
- Backfills recompute only affected partitions (e.g., reprocess
year=2024/month=03). - New outputs are written under a new
run_id(e.g.,run_id=20260128_BACKFILL). - Promotion occurs by updating
_LATEST.jsonandcurrent/after validation. - No historical data is overwritten; all
run_idpaths are preserved for audit.
Example Backfill Flow:
- Identify issue: Data quality problem in March 2024
- Reprocess: Read
bronze/.../ingest_date=2024-03-*/..., write tosilver/.../schema_v=v1/run_id=20260128_BACKFILL/year=2024/month=03/... - Validate: Check
_SUCCESS, row counts, quality metrics - Promote: Update
_LATEST.json, copy tocurrent/ - Notify: Alert consumers of updated dataset
Ownership & Governance
Data Layers Ownership
| Layer | Owner | Steward | Responsibility | Implementation Status |
|---|---|---|---|---|
| Bronze | Data Platform Team | Ingestion Lead | Immutability, raw data preservation, ingestion reliability | Implemented |
| Silver | Domain Teams | Domain Analyst | Validation logic, schema evolution, data quality | Implemented |
| Gold | Business (Finance) | Finance Controller | Business contracts, reporting accuracy, stakeholder communication | Architecture Design (Gold layer structure, governance, and ownership model are best described in Architecture reference), SQL Aggregation Pattern |
Error Handling Layers Ownership
| Layer | Owner | Steward | Reviewer | Responsibility | Implementation Status |
|---|---|---|---|---|---|
| Quarantine (Error Handling) | Data Platform Team | ETL Lead | Data Quality Team | Error detection, routing, retry logic, audit trail maintenance, infrastructure management | Implemented |
| Condemned (Error Handling) | Data Platform Team | ETL Lead | Data Quality Team | Exclusion management, perpetual retention (financial audit), compliance, infrastructure management | Implemented |
Note: The high-level Data Lake Architecture (Task 2) summarizes this design; this reference document provides the complete detail covering all layers (Bronze/Silver/Gold + Quarantine + Condemned). The Gold layer structure, governance, and ownership model are detailed in this reference (folder organization, promotion workflows, business metric definitions). Analytical Querying shows how to aggregate Silver → Gold via SQL queries (shows how to create Gold aggregations - one example of the 1:N pattern where one Silver source can produce multiple Gold aggregations). The ETL Pipeline implements Bronze → Silver transformation.
Error Handling Layers: Rules
-
Quarantine Layer:
- Platform Team manages all write operations (ETL pipeline only)
- Quality Team has full read access for review and analysis
- Retry decisions are automated (max 3 attempts: attempt_count < 3 allows retry; attempt_count >= 3 condemned) but can be overridden with Quality Team approval
- All retries preserve audit trail (retry_history, attempt_count increments)
-
Condemned Layer:
- Platform Team manages all write operations (automatic condemnation after max attempts)
- Quality Team has full read access for review and compliance oversight
- Reprocessing condemned data requires human approval workflow (not automatic)
- Perpetual retention for financial audit (no automatic deletion)
- Deletion only via explicit, approved process (legal/compliance approval)
-
Cross-Layer Interactions:
- Quarantine → Silver (retry success): Platform Team manages, Quality Team monitors
- Quarantine → Condemned (max attempts): Platform Team manages automatically, Quality Team reviews
- Condemned → Reprocessing: Requires Quality Team recommendation + human approval
Data Layers: Rules
- Bronze is immutable and restricted: Only platform team can write; no direct business user access.
- Silver schema changes require domain + platform review: Domain team proposes, platform team implements and validates.
- Gold changes require explicit approval and versioning: Finance controller approves schema changes; all changes are versioned via
schema_v. (Infrastructure Design, SQL Aggregation Pattern)
Governance Workflow:
- Schema change request → Domain Analyst (Silver) or Finance Controller (Gold).
- Review → Platform team assesses technical feasibility.
- Approval → Domain/Business owner approves.
- Implementation → Platform team implements with new
schema_v. - Validation → Quality checks, backfill if needed.
- Promotion → Update
_LATEST.jsonandcurrent/after validation (Gold layer only - not implemented in ETL job).
Note: This reference provides the complete data lake architecture design. Gold layer structure, governance, and ownership (including promotion workflow with _LATEST.json and current/ prefix) are described above. Analytical Querying shows how to create Gold aggregations (1:N pattern). The ETL Pipeline implements Bronze → Silver only.
Detailed ownership model: See Error Handling Layers: Detailed Ownership in Governance & Workflow Diagrams for expanded ownership details, access models, and workflows.
Detailed governance workflow: See Ownership & Governance in Governance & Workflow Diagrams for complete responsibility matrices and approval workflows.
Failure Mode Analysis
This section explicitly documents what breaks if critical components are removed or fail, demonstrating system resilience understanding.
Component Failure Scenarios
If _SUCCESS Marker is Missing
Impact: Consumers cannot identify complete runs, leading to incomplete queries or query failures.
- Symptom: Partial data may be read, causing incorrect analytics results
- Root Cause: ETL job failed before writing
_SUCCESS, or marker was accidentally deleted - Detection: Consumers check for
_SUCCESSbefore reading; monitoring alerts on missing markers - Mitigation: ETL jobs only write
_SUCCESSafter all data writes complete successfully. Consumers implement strict checks to ignore runs without_SUCCESS. Failed runs are automatically retried.
If _LATEST.json is Missing
Impact: No authoritative pointer to current dataset, making promotion/rollback impossible.
- Symptom: Automated workflows cannot determine which dataset version to use; manual discovery required
- Root Cause: Promotion workflow failed, or file was accidentally deleted
- Detection: Promotion workflows verify
_LATEST.jsonexists before proceeding - Mitigation:
_LATEST.jsonis updated atomically after validation. Fallback mechanisms can query_SUCCESSmarkers to discover latest run if needed. Version control tracks_LATEST.jsonchanges.
If Glue Data Catalog is Missing
Impact: Athena cannot locate tables, all SQL queries fail.
- Symptom: All Athena queries return "Table not found" errors
- Root Cause: Glue Catalog table definition deleted, or table location points to non-existent path
- Detection: Query failures trigger immediate alerts; catalog health checks monitor table existence
- Mitigation: Catalog definitions are version-controlled in Terraform. Automated tests verify catalog tables exist before deployment. Backup/restore procedures for catalog metadata.
If Quarantine Layer is Removed
Impact: Invalid data is silently dropped, audit trail lost.
- Symptom: Data quality issues go undetected; no way to investigate or recover invalid records
- Root Cause: Quarantine write failures, or quarantine layer access denied
- Detection: Monitoring alerts on quarantine write failures; data quality metrics track expected vs actual row counts
- Mitigation: Quarantine writes are mandatory and fail the ETL job if they fail. Quarantine layer has separate IAM policies to prevent accidental deletion. Regular audits verify quarantine data retention.
If Bronze Layer is Overwritten (Immutability Violation)
Impact: Historical audit trail lost; backfills become impossible.
- Symptom: Cannot reproduce historical reports; audit compliance failures
- Root Cause: Accidental overwrite, or immutability controls bypassed
- Detection: S3 versioning alerts on overwrites; CloudTrail logs track all write operations
- Mitigation: Bronze layer uses S3 versioning and object lock. IAM policies prevent overwrites (only append allowed). Regular backups of critical partitions.
If run_id Isolation is Removed
Impact: Reruns overwrite previous outputs, causing data loss and making backfills unsafe.
- Symptom: Historical run outputs disappear; cannot rollback to previous versions
- Root Cause: ETL job writes to fixed path instead of
run_id-scoped path - Detection: Monitoring detects duplicate
run_idpaths or missing run isolation - Mitigation:
run_idgeneration is mandatory and timestamp-based (unique per run). ETL code enforcesrun_idin all output paths. Code reviews verify run isolation patterns.
If Schema Versioning is Removed
Impact: Schema changes break existing consumers; no backward compatibility.
- Symptom: Existing queries fail after schema changes; data type mismatches
- Root Cause: Schema changes applied without versioning; consumers not migrated
- Detection: Query failures after schema changes; schema drift monitoring
- Mitigation: All schema changes require new
schema_vversion. Migration playbooks ensure consumers are updated before old versions deprecated. Schema registry tracks all versions.
System Resilience Principles
- Fail-Safe Defaults: System fails in a safe state (no partial data published) rather than silently corrupting data.
- Defense in Depth: Multiple layers of validation (ETL validation, catalog checks, consumer checks) prevent single points of failure.
- Auditability: All operations are logged (CloudTrail, CloudWatch) enabling post-incident analysis.
- Immutability: Critical layers (Bronze) are append-only, preventing accidental data loss.
- Idempotency: Reruns are safe; each run writes to unique paths preventing overwrites.
Performance Benchmarks
This section provides detailed performance benchmarks collected from load testing and production runs.
ETL Performance Metrics
| Metric | Value | Notes |
|---|---|---|
| Throughput | ≥3K rows/second | Baseline performance for 100K rows (measured) |
| Processing Time (100K rows) | ≤30 seconds | Good baseline performance (measured) |
| Processing Time (1M rows) | 2-5 minutes | Typical production size |
| Processing Time (10M rows) | 15-30 minutes | Large dataset, may need optimization |
| PySpark Improvement | 5-10x faster | Compared to Pandas implementation |
| Max File Size | 100GB+ | PySpark handles large files (vs ~40MB for Pandas) |
| Scalability | 2-100 DPUs | Auto-scaling with data volume |
Query Performance Metrics
| Metric | Value | Notes |
|---|---|---|
| Month-End Report Query | < 30 seconds | For 100M row table with partition pruning |
| Partition Pruning Reduction | ~95% (approximation) | Only scans relevant year/month partitions (e.g., 3 months out of 60 months for 5-year retention = 95% reduction) |
| Athena Query Cost | $5 per TB scanned | Partition pruning minimizes scan costs |
| Query Optimization | Partition-based | year/month partitioning enables efficient queries |
Scalability Metrics
| Test Scenario | Result | Notes |
|---|---|---|
| 100M Row Test | Passed | Architecture supports large datasets |
| 10x Growth | Supported | No redesign required for 10x data volume |
| Partition Strategy | Optimized | year/month partitioning scales efficiently |
Performance Comparison: Pandas vs PySpark
| Metric | Pandas (Before) | PySpark (After) | Improvement |
|---|---|---|---|
| Processing Time (500MB) | 5-10 min | 1-2 min | 5-10x faster |
| Max File Size | ~40MB | 100GB+ | 2500x larger |
| Scalability | 1 DPU | 2-100 DPUs | 100x capacity |
| Cost per Run | $0.44/DPU-hr | $0.44/DPU-hr (faster) | 5-10x cost reduction (same cost, faster execution) |
Production Readiness Details
This section provides detailed production readiness considerations for deploying this architecture in a production environment.
High Availability Details
RTO/RPO Targets:
- RTO (Recovery Time Objective): < 4 hours (batch processing, not real-time)
- RPO (Recovery Point Objective): < 24 hours (daily batch runs)
Multi-AZ Architecture:
- S3: Automatic multi-AZ replication, 99.99% availability SLA
- Glue: Distributed processing across multiple AZs, automatic failover
- Athena: Serverless, no infrastructure to manage, automatic scaling
- Step Functions: Multi-AZ by default, automatic retry with exponential backoff
Availability Monitoring:
- CloudWatch alarms for service health
- S3 bucket monitoring for availability metrics
- Glue job success/failure tracking
Disaster Recovery Details
Current DR Strategy:
- Immutable Bronze Layer: Raw data never overwritten, enables full reprocessing
- S3 Versioning: Enabled on all buckets, can recover deleted files
- Run Isolation: Each run writes to unique
run_idpath, preserves historical data - CloudWatch Logs: 30-day retention for audit trail
Optional Enhancements:
- S3 Cross-Region Replication: Replicate Bronze/Silver layers to secondary region (e.g.,
eu-west-1→eu-central-1) - DR Runbook: Documented procedures for regional failover (if required)
- Backup Strategy: S3 versioning provides point-in-time recovery
DR Testing:
- Quarterly DR drills to validate recovery procedures
- Test reprocessing from Bronze layer to validate data integrity
Cost Optimization Details
Storage Cost Optimization:
This Sankey diagram shows how monthly infrastructure costs are allocated across AWS services and storage tiers:
Complete Cost Flow:
Cost Allocation Sankey:
Cost Allocation:
- Total Monthly Cost ($28.05) is allocated to:
- Glue Compute ($26.40): ETL processing (pay-per-run, scales with data volume)
- S3 Storage ($1.30): Data storage across all layers (Standard + Archive tiers)
- Ancillary Services ($0.35): SQL query execution (Athena $0.20), Orchestration (Step Functions $0.06), Monitoring (CloudWatch $0.05), and other services (KMS, CloudTrail, SNS, SQS $0.04)
- S3 Storage ($1.30) breakdown:
- Standard Storage ($1.15): Active data in standard S3 storage
- Archive Storage ($0.15): Long-term retention in Glacier tiers (transitioned via lifecycle policies)
Optimization Techniques:
- Parquet Compression: 10x compression reduces storage needs (saves ~$0.12/month)
- Partition Pruning: 95% scan reduction minimizes Athena query costs (saves ~$0.01/month)
Lifecycle Policies:
- Quarantine Data: Transition to Glacier after 90 days; no automatic deletion (perpetual retention)
- Condemned Data: Transition to Glacier after 5 years; no automatic deletion (perpetual retention)
- Processed Data: Transition to Glacier after 90 days (optional)
Compute Cost Optimization:
- Serverless Architecture: Pay-per-use model, no idle costs
- Auto-Scaling: Glue DPUs scale automatically (2-100 DPUs based on data volume)
- Query Optimization: Partition pruning reduces Athena scan costs by 95%+
Cost Monitoring:
- AWS Cost Explorer integration with cost allocation tags
- Monthly cost reviews to identify optimization opportunities
- Cost alerts for unexpected spending
Security & Compliance Details
Encryption Strategy:
- TLS in Transit: All API calls use HTTPS (TLS 1.2+)
- S3 Encryption at Rest: SSE-S3 (AES256) by default for bronze/silver/artifacts buckets
- SSE-KMS: Implemented for sensitive buckets (gold, quarantine) using customer-managed keys (CMK) for enhanced security and compliance
- Athena Results: Encrypted with SSE-S3
Access Control Patterns:
Audit Logging:
- CloudTrail: Logs all API calls (S3, Glue, Athena, IAM). Management events enabled for all services; selective data events enabled for high-risk buckets (
ohpen-goldandohpen-quarantine) to track object-level access - CloudWatch Logs: Structured logging with
run_id, timestamps, error details - Data Lineage: Full audit trail via
run_idisolation and metadata
Data Classification:
- Sensitive Data: Financial transaction data (sensitive)
- PII Handling: Minimal PII in raw layer, mask/tokenize in curated layers if required
- Restricted Access: Raw/quarantine layers limited to Platform + Compliance teams
Compliance Alignment:
- Immutable Audit Trail: Bronze layer is append-only, full history preserved
- Data Retention: Perpetual retention for financial/audit data; deletion only with legal/compliance approval
- Reproducible Reporting: Full lineage enables reproducible month-end reports
Operational Runbooks
Troubleshooting Common Issues:
-
ETL Job Failures
- Check CloudWatch logs for error details
- Verify S3 permissions (IAM role has correct policies)
- Check Glue job configuration (DPU allocation, timeout settings)
- Verify source data format (CSV structure matches expected schema)
-
High Quarantine Rate
- Review validation rules (check for overly strict rules)
- Check source data quality (investigate error patterns)
- Review quarantine data to identify root causes
- Adjust validation rules if needed (with domain team approval)
-
Query Performance Issues
- Verify partition pruning (check WHERE clause includes partition columns)
- Check Athena query plans (use EXPLAIN to analyze query execution)
- Optimize partition strategy (ensure partitions are not too large/small)
- Review Parquet file sizes (ensure files are not too small, causing many small files)
Performance Tuning Procedures:
-
ETL Performance Tuning
- Monitor Glue job metrics (duration, DPU utilization)
- Adjust DPU allocation if jobs are slow (increase DPUs for large datasets)
- Optimize PySpark code (use broadcast joins, repartition if needed)
- Review partition sizes (ensure partitions are balanced)
-
Query Performance Tuning
- Analyze Athena query plans (identify expensive operations)
- Optimize partition strategy (ensure partitions are sized appropriately)
- Review Parquet file sizes (ensure files are 128MB-1GB for optimal performance)
- Use columnar pruning (only select needed columns)
Incident Response Procedures:
-
ETL Pipeline Failure
- Detection: CloudWatch alarm triggers on job failure
- Investigation: Check CloudWatch logs, verify S3 permissions, check Glue job status
- Recovery: Rerun ETL job with new
run_id(safe, no data loss) - Escalation: Notify Data Platform Team if issue persists
-
Data Quality Issue
- Detection: CloudWatch alarm triggers on high quarantine rate (>5%)
- Investigation: Review quarantine data, identify error patterns
- Recovery: Fix source data or adjust validation rules, reprocess affected partitions
- Escalation: Notify Data Quality Team for review
-
Query Performance Degradation
- Detection: Monitor Athena query times, set up CloudWatch alarms
- Investigation: Analyze query plans, check partition strategy
- Recovery: Optimize queries, adjust partition strategy if needed
- Escalation: Notify Data Platform Team for optimization review
See also
- Governance & Workflow Diagrams - Ownership models and governance workflows