Technical Reference
1. Overview
Scope: This design addresses the case requirements; production use would align with actual volumes and compliance.
This pipeline ingests raw transaction CSVs, validates financial integrity, and publishes optimized Parquet datasets for analytics. It is designed for auditability, reproducibility, and safe schema evolution.
2. Architecture & Data Flow
Raw (S3) -> Metadata Enrichment -> Loop Prevention -> ETL (Python/Glue) -> Processed/Quarantine/Condemned (Parquet) -> [Analytical Query Design](/docs/tasks/sql/SQL_BREAKDOWN)
Note: Analytical Querying reads from both Silver and Gold layers for analytics and reporting.
- Ingestion: Immutable CSVs land in
raw/transactions/ingest_date=YYYY-MM-DD/. - Metadata Enrichment: Each row is enriched with tracking metadata:
row_hash(SHA256): Unique identifier for deduplicationsource_file_id: Source file identifierattempt_count: Number of processing attempts (starts at 0)ingestion_timestamp: When the row was first ingested
- Loop Prevention:
- TransactionID Deduplication: Checks Silver layer for existing TransactionID + event_date combinations; auto-condemns if found (optional, enabled via
--silver-bucketand--silver-prefix) - Duplicate Detection: Checks quarantine history for exact duplicates (same
row_hash); auto-condemns if found - Attempt Limit: Maximum 3 retry attempts per row; auto-condemns if exceeded
- Circuit Breaker: Halts pipeline if >100 same errors occur within 1 hour (requires human intervention)
- TransactionID Deduplication: Checks Silver layer for existing TransactionID + event_date combinations; auto-condemns if found (optional, enabled via
- Transformation:
- Validation: Schema/required-column checks, null checks, numeric amount type check, ISO-4217 currency validation, timestamp parsing.
- Quarantine: Invalid rows are isolated in
quarantine/with error details,attempt_count, and retry history; never dropped silently. - Partitioning: Output partitioned by
year/monthfor query performance (see Partitioning Strategy).
- Storage:
- Format: Snappy-compressed Parquet (currently Parquet-only; Iceberg is a future enhancement).
- Idempotency: Each run writes to a unique
run_idpath to prevent data corruption during retries. - Metadata:
run_idandingest_timeare emitted in_SUCCESS; row-level quarantine includesvalidation_error,row_hash,attempt_count, and source context. - Human Approval: For financial data compliance, human approval is required before promoting Silver layer data to production consumption (see Human Validation Policy - excluded from documentation).
- Condemned Layer: Rows exceeding max attempts or exact duplicates are moved to
quarantine/condemned/for permanent retention (no automatic retries).
2.1. Data Lake Folder Structure
The data platform follows a Bronze/Silver/Gold medallion architecture with explicit folder organization:
s3://data-lake-bucket/
├── bronze/ # Bronze Layer (Raw, Immutable)
│ └── mortgages/
│ └── transactions/
│ └── ingest_date=2026-01-21/ # Partition by arrival time
│ └── run_id=20260121T120000Z/ # Run isolation for idempotency
│ └── file.csv.gz # Original source file (verbatim copy)
│
├── silver/ # Silver Layer (Validated & Enriched)
│ └── mortgages/
│ └── transactions/
│ └── year=2026/ # Business-time partition (optimized for queries)
│ └── month=01/
│ └── schema_v=v1/ # Schema version for evolution
│ └── run_id=20260121T120000Z/ # Write-once run isolation
│ ├── _SUCCESS # Atomic completion marker
│ └── part-00000.parquet # Optimized storage format
│
├── quarantine/ # Invalid Data (Audit Trail)
│ └── mortgages/
│ └── transactions/
│ └── ingest_date=2026-01-21/
│ └── run_id=20260121T120000Z/
│ └── invalid_rows.parquet # Contains original data + validation_error column
│ └── condemned/ # Condemned Data (No More Retries)
│ └── ingest_date=2026-01-21/
│ └── run_id=20260121T120000Z/
│ └── condemned_rows.parquet # Max attempts exceeded or exact duplicates
│
└── gold/ # Gold Layer (Business Views)
└── finance/ # [Data Lake Architecture](/docs/reference/reference-data-lake-architecture#complete-folder-structure-examples) (folder organization, [governance](/docs/reference/reference-data-lake-architecture#ownership--governance))
├── account_balances/ # Aggregation 1: [SQL Aggregation Pattern](/docs/tasks/sql/SQL_BREAKDOWN#gold-layer-aggregation) (1:N pattern - one of multiple aggregations)
│ └── schema_v=v1/ # ETL Layer: Ingests to processed (Silver), SQL calculates Gold
│ ├── _LATEST.json # Authority pointer to current run
│ ├── run_id=20260121T120000Z/ # Historical run
│ │ └── as_of_month=2024-03/
│ │ └── part-00000.parquet
│ └── current/ # Stable prefix for SQL access
│ └── as_of_month=2024-03/
│ └── part-00000.parquet
├── monthly_reports/ # Aggregation 2: Additional business aggregations (1:N pattern)
│ └── schema_v=v1/
│ └── current/
│ └── as_of_month=2024-03/
│ └── part-00000.parquet
└── transaction_summaries/ # Aggregation 3: Additional business aggregations (1:N pattern)
└── schema_v=v1/
└── current/
└── as_of_month=2024-03/
└── part-00000.parquet
Transformation Cardinality: The Gold layer follows a 1:N pattern: one Silver dataset (silver/mortgages/transactions/) produces multiple Gold aggregations (account_balances, monthly_reports, transaction_summaries). Each aggregation serves different business reporting needs while maintaining a single source of truth at Silver.
Key Path Patterns:
- Bronze:
bronze/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/file.csv.gz - Silver:
silver/{domain}/{dataset}/year={YYYY}/month={MM}/schema_v={vN}/run_id={...}/part-*.parquet - Quarantine:
quarantine/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet - Condemned:
quarantine/{domain}/{dataset}/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet - Gold:
gold/{domain}/{dataset}/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet(Data Lake Architecture (Gold layer structure, governance, and ownership model are best described in Architecture documentation), SQL Aggregation Pattern)- Multiple aggregations per Silver source (1:N relationship)
- Examples:
gold/finance/account_balances/,gold/finance/monthly_reports/,gold/finance/transaction_summaries/
- Gold (current):
gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet(Data Lake Architecture (Gold layer structure, governance, and ownership model are best described in Architecture documentation), SQL Aggregation Pattern)
Note: Data Lake Architecture; Silver → Gold: Analytical Querying. ETL: Bronze → Silver only.
Condemned Retention Policy:
- Condemned rows are retained indefinitely for financial audit (perpetual retention)
- No automatic retries are performed on condemned items
- Manual review and intervention required for any reprocessing
- Human approval required before deletion (see Human Validation Policy for approval workflow - excluded from documentation)
2.2. Data Contract
Required Fields:
TransactionID(string): Unique transaction identifierCustomerID(string): Customer identifierTransactionAmount(numeric): Transaction amount (negative values allowed for withdrawals/refunds)Currency(string): ISO-4217 currency code (e.g., EUR, USD, GBP)TransactionTimestamp(timestamp): Transaction timestamp (ISO-8601 format)
Partition Keys (derived from TransactionTimestamp):
year: Year extracted from TransactionTimestampmonth: Month extracted from TransactionTimestamp
Output Schema:
- All input fields preserved
- Partition columns added:
year,month - Metadata columns:
row_hash(SHA256),source_file_id,attempt_count,ingestion_timestamp,ingest_time,run_id - Quarantine columns: All above +
validation_error,retry_history
2.3. Validation & Quality Controls
Pre-Validation Checks (Loop Prevention):
-
Metadata Enrichment → All rows receive tracking metadata
row_hash(SHA256): Computed from row content for deduplicationsource_file_id: Identifier of the source fileattempt_count: Starts at 0, increments on each retryingestion_timestamp: First ingestion time
-
Duplicate Detection → Auto-condemn if exact duplicate found
- Checks: Quarantine history for exact
row_hashmatch - Failure action: Auto-condemn to
quarantine/condemned/withvalidation_error = "DUPLICATE_FAILURE" - Purpose: Prevents infinite retry loops on identical bad data
- Checks: Quarantine history for exact
-
Attempt Limit Check → Auto-condemn after 3rd failure
- Checks:
attempt_count < 3allows retry;attempt_count >= 3condemned (max 3 attempts: 0, 1, 2 allowed; 3 condemned) - Human review and approval required before reprocessing condemned data
- Failure action: Auto-condemn to
quarantine/condemned/withvalidation_error = "MAX_ATTEMPTS" - Purpose: Prevents infinite retry loops on persistently failing rows
- Checks:
Validation Rules (applied in order after pre-validation):
-
Schema Validation → Quarantine if schema invalid
- Checks: Required columns present, data types valid
- Failure action: Row quarantined with
validation_error = "SCHEMA_ERROR",attempt_count++
-
Null Checks → Quarantine if missing required fields
- Checks: All required fields (
TransactionID,CustomerID,TransactionAmount,Currency,TransactionTimestamp) must be non-null - Failure action: Row quarantined with
validation_error = "NULL_VALUE_ERROR",attempt_count++
- Checks: All required fields (
-
Amount Type Check → Quarantine if non-numeric
- Checks: TransactionAmount must be parseable as numeric
- Failure action: Row quarantined with
validation_error = "TYPE_ERROR",attempt_count++ - Note: Negative amounts are allowed (withdrawals/refunds)
-
Currency Validation → Quarantine if not ISO-4217
- Checks: Currency code must be in allowed list (ISO-4217 standard)
- Failure action: Row quarantined with
validation_error = "CURRENCY_ERROR",attempt_count++
-
Timestamp Parsing → Quarantine if unparseable
- Checks: TransactionTimestamp must be parseable as ISO-8601 timestamp
- Failure action: Row quarantined with
validation_error = "TIMESTAMP_ERROR",attempt_count++
Post-Quarantine Circuit Breaker:
- Circuit Breaker Check: After each quarantine action, check if >100 rows with the same error type occurred within the last hour
- Threshold Exceeded: Pipeline halts automatically, requires human intervention
- Purpose: Prevents processing storms from systemic data quality issues
What Happens on Failure:
- Pre-validation failures (duplicates, max attempts exceeded: attempt_count >= 3) → Auto-condemn to
quarantine/condemned/. Human review and approval required before reprocessing. - Validation failures → Quarantine with
attempt_countincremented - Never silently dropped — all invalid rows preserved for audit and review
- Quarantine includes: original data +
validation_errorcolumn + metadata (row_hash,attempt_count,retry_history,ingest_date,run_id,source_file) - Valid rows proceed to Silver layer (validated Parquet)
Quarantine Aging & Review:
- < 7 days: Automated monitoring (no human action required)
- 7-30 days: Flagged for human review and decision
- > 30 days: Escalated for review (potential systemic issue)
- Resolution paths: Source Fix (new file,
attempt_countresets), ETL Fix (same file,attempt_countpreserved), or Condemn
3. Reprocessing & Backfills
The design assumes an append-only raw layer. The backfill and reprocessing workflow includes loop prevention checks to avoid infinite retry cycles.
Backfill Process:
- Trigger Backfill: Platform Step Functions or EventBridge triggers the Platform Glue job for the specific
rawpartition. - Metadata Generation: Each row receives
row_hash(SHA256) andattempt_count=0for new ingestions. - New Version: Output writes to a new
run_idfolder (e.g.,processed/.../run_id=20260128_BACKFILL). - Publish: Update the Platform Glue Data Catalog to point to the new
run_id(currently Parquet-only; Iceberg would provide automatic partition snapshots as a future enhancement).
Quarantine Remediation Scenarios
Scenario A: Source Provider Fix (New File)
- Trigger: Data provider sends corrected CSV file with new
source_file_id - Behavior:
attempt_countresets to 0 (treated as new data) - Process:
- New file ingested to
raw/with newsource_file_id - New
row_hashcomputed (may differ if data was corrected) - Original quarantined rows marked as superseded
- New run writes to
run_id=YYYYMMDD_SOURCE_FIX
- New file ingested to
Scenario B: ETL Logic Fix (Same File)
- Trigger: ETL validation rules updated to handle previously invalid data
- Behavior:
attempt_countincrements (preserves retry history) - Process:
- Check
attempt_count < 3before reprocessing (max 3 attempts: 0, 1, 2 allowed; >= 3 condemned) - Human review and approval required before reprocessing condemned data
- If < 3: Reprocess original data with updated rules, increment
attempt_count, log toretry_history - If ≥ 3: Auto-condemn to
quarantine/condemned/(no automatic retry) - New run writes to
run_id=YYYYMMDD_ETL_FIX
- Check
Scenario C: Circuit Breaker Triggered
- Trigger: >100 rows with same error type within 1 hour window
- Behavior: Pipeline halts automatically, requires human intervention
- Process:
- Pipeline stops processing immediately
- Alert sent to operations team (Platform PagerDuty critical)
- Root cause investigation required
- Manual restart after systemic issue is resolved
Loop Prevention Guarantees:
- Duplicate Detection: Exact
row_hashmatches in quarantine history → Auto-condemn - Attempt Limit: Maximum 3 attempts allowed per row (attempt_count 0, 1, 2 can retry; attempt_count >= 3 condemned). Human review and approval required before reprocessing condemned data.
- Circuit Breaker: Prevents processing storms from systemic issues
- Metadata Tracking:
row_hash,source_file_id,attempt_count,retry_historyenable full audit trail
3.1. Schema Evolution Strategy
Compatibility Policy:
- Backward-compatible reads: Updated consumers can read both old and new data
- Forward-compatible writes: Updated producers may emit additional fields without breaking older consumers
- Additive changes only: New columns must be nullable/optional
Schema Versioning:
- All Silver/Gold paths include
schema_v={version}(e.g.,schema_v=v1,schema_v=v2) - Enables schema evolution without breaking consumers
- Backward compatibility maintained — old data remains accessible
Allowed Changes (Safe / Zero-Downtime):
- Add new nullable/optional field (e.g.,
TransactionType) - Deprecate before remove: keep field, stop populating, document replacement
- Add new derived views instead of changing semantics
Breaking Changes (Require Playbook):
- Type changes require new schema version and backfill
- Renames require transition period with aliases
- Semantic changes require versioned v2 dataset
Recommended Approach:
- Current: Parquet-only with schema versioning (
schema_v=v1,schema_v=v2) - Future Enhancement: Apache Iceberg tables via Platform Glue for advanced schema evolution (add/rename/reorder columns without rewriting files)
- Fallback: Parquet-only with additive changes and schema registry in Platform Glue Data Catalog
See Data Lake Architecture for complete schema evolution strategy.
4. Operational Monitoring
Complete Monitoring & Alerting Flow:
CloudWatch Metrics Collection Flow:
CloudWatch Metrics & Logging
The pipeline publishes custom metrics to Platform CloudWatch (namespace: Ohpen/ETL):
Volume Metrics:
InputRows: Total rows read from raw layerValidRows: Rows successfully written to processed layerQuarantinedRows: Rows sent to quarantineCondemnedRows: Rows auto-condemned (duplicates or max attempts exceeded: attempt_count >= 3). Human review and approval required before reprocessing.
Quality Metrics:
QuarantineRate: Percentage of rows quarantinedErrorTypeDistribution: Breakdown by error type (SCHEMA_ERROR, NULL_VALUE_ERROR, etc.)ValidationPassRate: Percentage of rows passing all validationsRetryAttemptDistribution: Distribution ofattempt_countvalues
Performance Metrics:
DurationSeconds: ETL execution timeEndToEndTime: Total time from ingestion to availabilityPartitionLag: Delay in partition availability
Loop Prevention Metrics:
AvgAttemptCount: Averageattempt_countacross all processed rowsDuplicateDetectionRate: Percentage of rows flagged as duplicatesCircuitBreakerTriggers: Count of circuit breaker activationsAutoCondemnationRate: Percentage of rows auto-condemned
All logs are structured JSON for Platform CloudWatch Logs Insights, enabling queries like:
fields @timestamp, run_id, metric_value.quarantine_rate
| filter metric_name = "ETLComplete"
Alert Thresholds & Actions
Critical Alerts (Platform PagerDuty + Auto-halt):
- T1: Quarantine Rate > 5% → Data Quality Team (P2 - 4 hours)
- T2: Job Failure → Data Platform Team (P1 - Immediate)
- T4: Zero rows processed → Data Platform Team (P1 - Immediate)
- T5: Circuit Breaker Triggered → Automatic pipeline halt + Data Platform Team (P1 - Immediate)
Warning Alerts (Platform Slack Notification):
- T3: Processing Time > 2x average → Data Platform Team (P3 - 8 hours)
- T6: Avg attempt_count > 1.5 → Data Quality Team (P2 - 4 hours)
- T7: Auto-condemn rate > 0.5% → Data Quality Team (P2 - 4 hours)
Dashboards & Visualization
- Platform CloudWatch Dashboard: Real-time metrics, retry trends, circuit breaker status
- Platform Grafana: Historical trends, loop detection patterns, long-term quality trends
- Daily Summary Report: Email digest with condemned items, quarantine summary, key metrics
Alert Ownership & Escalation
Alerts are routed to appropriate teams based on alert type:
- Infrastructure Alerts (Data Platform Team): Job failures, missing partitions, runtime anomalies, circuit breaker triggers
- Data Quality Alerts (Data Quality Team): Quarantine rate spikes (
> 5%), validation failures, highattempt_count, auto-condemnation spikes - Business Metric Alerts (Domain Teams / Business): Volume anomalies, SLA breaches
See CI/CD Workflow - Governance for complete alert ownership matrix and escalation paths.
4.2. Quarantine Lifecycle Management
Quarantine Lifecycle Stages:
-
Quarantine Entry: Row fails validation and is quarantined with error details,
row_hash,attempt_count, and source context. -
Daily Triage (Automated):
- Age < 7 days: Automated monitoring (no human action)
- Age 7-30 days: Flagged for human review
- Age > 30 days: Auto-flagged for condemnation review (potential systemic issue)
-
Human Review: Data team analyzes error patterns, identifies root cause, and determines resolution action.
-
Resolution Actions:
- Source System Fix: Notify provider, request corrected file, re-ingest (new
source_file_id,attempt_countresets) - ETL Logic Fix: Update validation rules, backfill from raw (same
source_file_id,attempt_countincrements) - Manual Correction: Create corrected CSV, inject into raw, mark for reprocessing
- Source System Fix: Notify provider, request corrected file, re-ingest (new
-
Reprocessing: Re-run validation against fixed data or updated rules.
-
Terminal Disposition:
- Success: Row successfully reprocessed, moved to processed layer
- Condemned: Moved to
quarantine/condemned/for permanent retention (no automatic retries)
Retention Policy: Condemned items retained in perpetuity for financial audit; no automatic deletion; deletion only via approved process.
4.1. CI/CD + Deployment
CI Pipeline (Platform GitHub Actions):
- Validation: Runs on every Pull Request
rufflinting (code style)pytestunit tests (partition logic, null handling, quarantine checks)- Artifact Build: Packages Python ETL code, tags with Git SHA (e.g.,
etl-v1.0.0-a1b2c3d.zip) - Deployment (CD):
- Uploads artifact to S3 (Code Bucket)
- Terraform
plan&applyto update Platform infrastructure (Glue Jobs, IAM, Buckets) - Updates Glue Job to point to new artifact
Backfill Safety Checks:
- Determinism: Rerunning same input produces exact same counts
- Partitioning: Timestamps map to correct
year=YYYY/month=MMfolder - Quarantine: Invalid rows never silently dropped
Failure Handling:
- Failed runs do not update
_LATEST.jsonorcurrent/prefix - Each rerun uses new
run_id(timestamp-based) - Previous failed runs remain in storage (audit trail)
See CI/CD Workflow for complete workflow details.
5. Cost & Scalability
Storage Costs:
- Raw Layer: S3 Infrequent Access ($0.0125/GB/month) for immutable source data
- Processed Layer: S3 Standard with Snappy compression (~90% size reduction vs CSV, 10x compression: 500MB → 50MB)
- Quarantine/Condemned: S3 Glacier for long-term retention (compliance/audit)
Compute Costs:
- Platform Glue Serverless: $0.44/DPU-hour, auto-scales with data volume
- No Infrastructure Management: Pay per job execution, no idle costs
- Horizontal Scaling: Glue DPUs scale automatically with data volume
Query Costs:
- Platform Athena: $5 per TB scanned (see SQL Queries)
- Partitioning Strategy:
year/monthpartitioning reduces typical query scans to 1-5% of total data - Cost Optimization: Partition pruning minimizes scan costs significantly
Scalability:
- Horizontal Scaling: Glue DPUs scale automatically with data volume
- S3 Unlimited Capacity: No storage bottlenecks
- No Single Point of Failure: Distributed architecture ensures high availability
6. Security & Governance (production expectations)
Data Ownership Model
Ownership shifts from technical → business as data moves downstream:
- Bronze Layer: Data Platform Team (immutability, ingestion reliability) - Data Ingestion & Transformation
- Silver Layer: Domain Teams (validation rules, schema evolution) - Data Ingestion & Transformation
- Gold Layer: Business/Finance (metric definitions, reporting accuracy) - Data Lake Architecture (Gold layer structure, governance, and ownership model are best described in Architecture documentation), SQL Aggregation
Access Control & Permissions
-
Encryption: TLS in transit; S3 encryption at rest (SSE-S3 for bronze/silver/artifacts; SSE-KMS with CMK for gold/quarantine)
-
IAM Roles: Least privilege per layer (Platform: RW on Bronze/Silver, Domain: RW on Silver, Business: R on Gold)
-
Prefix-Scoped Permissions: IAM policies are scoped to S3 prefixes for fine-grained access control:
- Platform Team:
s3://bucket/bronze/*,s3://bucket/silver/*,s3://bucket/quarantine/* - Domain Teams:
s3://bucket/silver/{domain}/*(write),s3://bucket/gold/{domain}/*(read) - Business/Analysts:
s3://bucket/gold/*(read-only) - Compliance:
s3://bucket/bronze/*,s3://bucket/quarantine/*(read-only for audit)
- Platform Team:
-
Restricted Access:
raw/andquarantine/restricted to Platform and Compliance teams -
PII Handling: Keep PII minimal in
raw/where possible; mask/tokenize in curated layers if required
Governance & Auditability
- Immutability: Bronze layer is immutable (append-only, no overwrites)
- Run Isolation: Each run writes to unique
run_idpath for audit trail - Metadata:
_SUCCESSmarkers include run metrics; Platform CloudWatch logs capture full lineage - Schema Versioning: All schema changes versioned via
schema_vfor backward compatibility - Change Approval: Schema changes require Domain/Business approval + Platform implementation
For complete governance model, see CI/CD Workflow documentation (Section 5: Governance and Compliance).
7. Runbook
How to Rerun:
- Trigger via Platform Step Functions or EventBridge with specific
rawpartition - Specify
ingest_datepartition to reprocess (e.g.,ingest_date=2026-01-21) - New run writes to unique
run_idpath (e.g.,run_id=20260128_BACKFILL) - After validation, update
_LATEST.jsonandcurrent/prefix to point to new run
Where Logs Are:
- Platform CloudWatch Logs:
/aws-glue/jobs/output(namespace:Ohpen/ETL) - Structured JSON format for Platform CloudWatch Logs Insights queries
- Query example:
fields @timestamp, run_id, metric_value.quarantine_rate | filter metric_name = "ETLComplete"
How to Inspect Quarantine:
- Query
quarantine/S3 path:s3://quarantine-bucket/quarantine/mortgages/transactions/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet - Check
validation_errorcolumn for error categories (SCHEMA_ERROR, NULL_VALUE_ERROR, TYPE_ERROR, CURRENCY_ERROR, TIMESTAMP_ERROR) - Check
attempt_countcolumn to identify rows approaching max retry limit (≥2) - Check
row_hashfor duplicate detection analysis - Review
retry_historyto understand previous remediation attempts - Use Athena SQL queries or pandas to query quarantine Parquet files
- Filter by
validation_errorto see breakdown by error type - Filter by
attempt_countto identify high-retry items
How to Review Condemned Items:
- Query
quarantine/condemned/S3 path:s3://quarantine-bucket/quarantine/mortgages/transactions/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet - Check
validation_errorfor condemnation reason (DUPLICATE_FAILURE, MAX_ATTEMPTS) - Review
attempt_count(should be < 3 for retries allowed; >= 3 condemned) - Check
row_hashto identify exact duplicates - No automatic retries are performed on condemned items
- Manual intervention required for any reprocessing
How to Handle Circuit Breaker:
- Detection: Pipeline halts automatically when >100 same errors occur within 1 hour
- Immediate Actions:
- Check Platform CloudWatch logs for circuit breaker trigger details
- Identify the error type causing the threshold breach
- Review quarantine data for the problematic error pattern
- Investigate root cause (systemic data quality issue vs ETL bug)
- Resolution:
- If Source Issue: Contact data provider, request corrected data
- If ETL Issue: Update validation rules, deploy fix, manually restart pipeline
- If Temporary Spike: Wait for error rate to drop below threshold, manually restart
- Restart: After root cause is resolved, manually trigger pipeline restart
- Prevention: Monitor
CircuitBreakerTriggersmetric and investigate patterns before threshold is reached
Escalation Path:
- Infrastructure Issues (P1): Data Platform Team — job failures, missing partitions, runtime anomalies, circuit breaker triggers
- Data Quality Issues (P2): Data Quality Team — quarantine rate spikes (>5%), validation failures, high
attempt_count(>1.5 avg), auto-condemnation spikes (>0.5%) - Business Metric Issues (P3): Domain Teams / Business — volume anomalies, SLA breaches
Common Operations:
- Backfill: Trigger reprocessing for specific date range via Step Functions
- Source Fix: New file with new
source_file_id→attempt_countresets to 0 - ETL Fix: Same file with updated rules →
attempt_countincrements (check < 3 before reprocessing; >= 3 condemned). Human review and approval required before reprocessing condemned data.
- Source Fix: New file with new
- Schema Update: Update Platform Glue Data Catalog, create new
schema_v, backfill if needed (see Schema Evolution) - Quarantine Review: Query quarantine Parquet, check
attempt_countandrow_hash, identify root cause, fix source data or ETL rules, reprocess - Condemned Review: Query condemned Parquet, analyze patterns, determine if manual intervention needed
- Circuit Breaker Recovery: Investigate root cause, fix systemic issue, manually restart pipeline
8. Links & Related Documentation
- ETL Complete Reference - All ETL diagrams, pseudocode, and implementation code
- Architecture Boundaries - Design assumptions and edge case handling
- Data Lake Architecture - Core storage design
- Analytical Query Design - SQL analytics logic
- CI/CD Workflow - CI (
ci.yml) and CD (cd.yml), design and Terraform - Stakeholder Email - Reporting templates
- Communication Overview - Task 5 documentation
- Testing Guide - Comprehensive testing documentation
- Excluded from docs: Human Validation Policy, CI/CD Testing, Terraform/ETL full code (in implementation package)
Code & Infrastructure
- ETL: ETL Implementation; Terraform: CI/CD Artifacts
Monitoring & Dashboards
- CloudWatch Dashboard: Namespace
Ohpen/ETL - Metrics:
InputRows,ValidRows,QuarantinedRows,QuarantineRate,DurationSeconds - Logs: CloudWatch Logs Insights (namespace:
Ohpen/ETL)
Technical Documentation
- Testing Guide - Comprehensive testing documentation
- Unified Testing Convention - Testing standards (excluded from documentation)
- AWS Services Analysis - Service selection rationale
9. Assumptions & Known Limitations (case study scope)
- Batch-first ingestion (streaming is an optional upstream extension, not required here).
- Negative amounts are allowed (withdrawals/refunds); anomaly detection is out of scope.
- Loop Prevention: Maximum 3 retry attempts per row (
attempt_countlimit); exact duplicates in quarantine history are auto-condemned. - Circuit Breaker: Pipeline halts automatically when >100 same errors occur within 1 hour window; requires human intervention to restart.
- Condemned Items: Rows exceeding max attempts (attempt_count >= 3) or exact duplicates are moved to
quarantine/condemned/with no automatic retries. Human review and approval is required before any reprocessing or deletion. - TransactionID Deduplication: Checks Silver layer for existing TransactionID + event_date combinations to prevent duplicate processing across runs (optional feature).
- Duplicate Detection: Checks quarantine history for exact
row_hashmatches (not business-key deduplication within processed data). - Config separation: a config.yaml template is shipped (excluded from documentation) (code currently uses CLI args/env vars).