Architecture Boundaries

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This document captures the explicit assumptions and edge cases considered for the ETL pipeline.

Input Assumptions

Schema / Columns

Required columns are exactly: TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp
If any required column is missing, the job fails fast with a clear error

Currency Handling

Currency codes are validated against an ISO-4217 allowlist subset for the case study
Assumption: codes are provided as 3-letter uppercase strings in the input

Timestamp Handling

TransactionTimestamp is parsed with UTC semantics (Pandas: pd.to_datetime(..., utc=True); PySpark: to_timestamp() with formats that interpret Z as UTC). Same assumption: ISO-like or parseable; unparseable → TIMESTAMP_ERROR.
Partitioning uses the parsed timestamp's year and month (transaction-time partition)

Amount Handling

Negative amounts are allowed (e.g., withdrawals/refunds)
Detecting suspicious values is a separate outlier/fraud control, not a schema validation rule

Operational Assumptions

Idempotency & Reruns

Each run writes under a unique run_id (run isolation)
Reruns/backfills do not overwrite previous run's outputs directly

Quarantine as Audit Trail

Data is never silently dropped. Invalid rows remain accessible for audit/debug
All quarantine data includes full metadata for traceability

Data Retention

Condemned data: Indefinite retention for financial audit (financial traces retained in perpetuity)
Lifecycle: Transitions to Glacier after 5 years for cost; no automatic deletion. Deletion only via explicit, approved process (see HUMAN_VALIDATION_POLICY)
Rationale: Financial and audit data must be retained indefinitely; deletion requires legal/compliance approval and change control

Runtime and orchestration

Runtime choice: Pandas (single-file / small batch) → AWS Lambda; PySpark (large or partitioned) → AWS Glue. Same validation and quarantine logic in both.
Trigger: After initial ingestion (raw file upload to S3 Bronze), EventBridge picks up s3:ObjectCreated and starts the Step Function. The Step Function houses validation and testing and then invokes Lambda (Pandas) or Glue (PySpark) for ETL.
Run identity: run_id is the execution identifier. When orchestrated by Step Functions, it equals the execution name ($$.Execution.Name) and is passed to the ETL job as --run-key (argument name). The ETL does not generate a run_id in this case; the Step Functions execution is the single source of truth. When run standalone (e.g. local or ad‑hoc), the ETL generates run_id locally (e.g. timestamp-based). No separate run table is maintained; run history is available via Step Functions (ListExecutions, DescribeExecution, GetExecutionHistory) and CloudWatch Logs.
S3 ACL control: Applied at the Step Function level: new (and optionally existing) objects in the bronze/silver/gold/quarantine buckets can have ACLs set or updated as part of the orchestration so access is consistent and auditable.

Edge Cases Handled

Schema Validation Errors

Missing required columns: The job fails fast with a clear error (cannot proceed without schema). This is consistent with the fail-fast behavior described in the Input Assumptions section above.
Type mismatches: Handled during per-row validation → TYPE_ERROR (for TransactionAmount) or TIMESTAMP_ERROR (for TransactionTimestamp), not SCHEMA_ERROR. Type validation occurs after schema validation passes.

Missing Required Fields

Any row missing one or more required fields is quarantined with validation_error = "NULL_VALUE_ERROR"
attempt_count is incremented

Invalid Currency Codes

Rows with currencies outside the allowlist are quarantined with validation_error = "CURRENCY_ERROR"
attempt_count is incremented

Invalid Amount Types

Rows where TransactionAmount is not parseable as a number are quarantined with validation_error = "TYPE_ERROR"
attempt_count is incremented

Malformed/Unparseable Timestamps

Rows where TransactionTimestamp cannot be parsed are quarantined with validation_error = "TIMESTAMP_ERROR"
attempt_count is incremented

Duplicate Account/Date Combinations

Rows with duplicate CustomerID + TransactionTimestamp (date part) combinations are flagged and quarantined
validation_error = "Duplicate account/date combination" (or appended to existing errors)
attempt_count is incremented
Note: This is business logic duplicate detection (separate from exact hash duplicate detection)
Note: Duplicates are preserved (not dropped) for audit and business review

Exact Duplicate Detection (Loop Prevention)

Rows with exact row_hash match in quarantine history are auto-condemned with validation_error = "DUPLICATE_FAILURE"
Row is moved to condemned layer (no automatic retries)

Max Attempts Exceeded (Loop Prevention)

Rows with attempt_count >= 3 (max 3 attempts: attempt_count 0, 1, 2 allowed; attempt_count >= 3 condemned) are auto-condemned with validation_error = "MAX_ATTEMPTS"
Row is moved to condemned layer (no automatic retries)

Circuit Breaker Triggered

When >100 same errors occur within 1 hour:
- Pipeline halts with RuntimeError
- No data is written (valid or quarantine)
- Requires human intervention to investigate and restart

Multiple Issues in One Row

If multiple validation issues occur, the error message is preserved/extended so it remains explainable

Empty Input

If the DataFrame is empty, the job produces no output files and logs a warning (no crash)

All Rows Invalid

If every row is invalid, processed output is empty and all rows land in quarantine (still a valid outcome)

Intentionally Out of Scope

These are realistic production requirements but were intentionally not implemented to keep the solution aligned with the case study scope:

Cross-run deduplication (e.g., repeated TransactionID across different ingestion runs/files). Within-run duplicates are flagged
Outlier detection (e.g., unusually large values, abnormal patterns) and fraud/risk rules
Currency conversion (multi-currency reporting requires FX rates, valuation time, and consistent rounding policy)
Late-arriving corrections beyond rerunning a period (in production you'd define correction events + reconciliation rules)

Architecture Boundaries and Caveats

These boundaries affect correctness and auditability. They are documented so reviewers understand the current design and potential upgrade paths.

Promotion (copy-then-pointer)

Promotion copies Parquet files to current/, then writes _LATEST.json. The copy phase is not atomic; only the _LATEST.json PUT is. If the promote_silver Lambda fails mid-copy, current/ may contain partial data. Consumers should treat _LATEST.json as the commit boundary.
When a new run has fewer files than the previous run, old Parquet files in current/ are not deleted and may remain (stale files).
See Data Lake Architecture - Known limitations for details and the pointer-publish alternative.

Concurrency

There is no per-partition lock. EventBridge cannot set a custom Step Functions execution name; each trigger gets a unique UUID. Two concurrent runs (e.g. duplicate S3 events, schedule + S3 overlap) can both pass deduplication and invariants. Promotion picks the most recent by _SUCCESS modified time.
Upgrade path: add a Lambda intermediary that calls StartExecution(name="transactions:2024-01") for per-partition locking. See docs/internal/verification/ETL_CAVEATS_VERIFICATION.md.

Quarantine loop-prevention

PySpark (production): Quarantine history is read from S3 Parquet; duplicate row_hash detection is implemented.
Pandas (dev/small batch): check_quarantine_history returns empty (stub). Loop prevention via attempt_count and Silver dedup still applies; quarantine-history duplicate detection is skipped.

Where This Is Validated

The edge cases above are validated in:

Unit Tests (excluded from documentation): nulls, invalid currency, malformed timestamps, empty input, all-invalid, determinism, partition columns
Integration Tests (excluded from documentation): end-to-end flow, partitioned output structure, _SUCCESS marker, quarantine output

High-level test documentation: Testing Guide

Input Assumptions​

Schema / Columns​

Currency Handling​

Timestamp Handling​

Amount Handling​

Operational Assumptions​

Idempotency & Reruns​

Quarantine as Audit Trail​

Data Retention​

Runtime and orchestration​

Edge Cases Handled​

Schema Validation Errors​

Missing Required Fields​

Invalid Currency Codes​

Invalid Amount Types​

Malformed/Unparseable Timestamps​

Duplicate Account/Date Combinations​

Exact Duplicate Detection (Loop Prevention)​

Max Attempts Exceeded (Loop Prevention)​

Circuit Breaker Triggered​

Multiple Issues in One Row​

Empty Input​

All Rows Invalid​

Intentionally Out of Scope​

Architecture Boundaries and Caveats​

Promotion (copy-then-pointer)​

Concurrency​

Quarantine loop-prevention​

Where This Is Validated​

See also​