© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
Quality Requirements & Scenarios
Overview
This document consolidates scattered quality goals (satisfies arc42 Chapter 10: Quality Requirements). It provides a quality tree showing priority hierarchy, quality scenarios with measurable criteria, and explicit mapping to AWS Well-Architected Framework pillars.
Code and documentation standards: Python code follows PEP 8 (style), PEP 257 (docstrings), and PEP 484/526 (type hints where used). Linting and formatting are enforced via Ruff. Architecture and documentation align with AWS Well-Architected Framework and arc42.
Extended context: Design Decisions Summary, Data Lake Architecture.
Quality Tree (Priority Hierarchy)
For FinTech data platforms, quality requirements follow a strict priority hierarchy driven by compliance and auditability mandates:
Rationale: Real-time/streaming are out of scope for this case (batch CSV-in-S3). FinTech compliance mandates auditability > cost > performance > latency. This prioritization drives architectural decisions:
- Auditability first: Immutable Bronze layer, perpetual retention of Condemned data, CloudTrail logging
- Cost second: Partition pruning (95% reduction), lifecycle policies (Glacier transitions), serverless model (no idle costs)
- Performance third: Query optimization (<30s for 100M rows), ETL throughput (≥3K rows/s), scalability (tested to 100M rows)
- Latency last: Batch processing acceptable; sub-second latency not required for month-end reports
Consistency model
- Silver (business semantics): Exactly-once per business event. Each distinct transaction is represented at most once in Silver for a given business key (e.g. TransactionID, tx_date), achieved by run_id isolation and TransactionID deduplication in the ETL.
- Promotion: Write-then-publish. The ETL writes to an isolated run_id path; only after the promotion gate does the Lambda copy data to the
current/prefix and update_LATEST.json. From a consumer’s perspective, reading fromcurrent/after promotion gives a consistent, deduplicated snapshot for that promotion. - Reads (S3 / Athena): S3 and Athena provide eventual consistency for list and read operations. There is no strong consistency (e.g. read-your-writes) guarantee across regions or across multiple readers; the design does not rely on it for batch analytics.
Data quality dimensions
The four core data quality dimensions are explicitly addressed and mapped to implementation and SLAs:
| Dimension | Definition | Implementation | SLA / target | Evidence |
|---|---|---|---|---|
| Completeness | No missing required data; expected records present | Required-field validation (schema/null check), promotion gate (row counts), month-end sign-off | <5% quarantine rate; 0 data loss on failure | ETL Flow - Validation, Quality Scenarios (data loss, quarantine rate) |
| Validity | Data conforms to schema and business rules | Schema check, currency allowlist, amount type, timestamp parse; invalid rows → quarantine | <5% invalid rows per run; circuit breaker at >100 same errors/hour | ETL Flow - Validation, ARCHITECTURE - Circuit Breaker |
| Freshness | Data is up to date for its purpose | Trigger-to-_SUCCESS latency; max(TransactionTimestamp) in Silver vs now | ETL <60 min end-to-end; freshness metric in stakeholder comms | Executive Summary (promotion gate, metrics), RUNBOOKS (operational metrics) ("How up-to-date") |
| Uniqueness | At most one record per business key in Silver | TransactionID dedup (loop prevention), row_hash for exact duplicates; condemned on duplicate | Exactly-once per (TransactionID, tx_date) in Silver | ETL Flow - Ingestion semantics, Data Model Design Rationale, ETL Flow - Loop Prevention |
These dimensions are reflected in the Quality Scenarios Table and in Compliance & Controls (e.g. SOC 1/ISAE 3402, BCBS 239).
Quality Scenarios Table
| Scenario | Quality Attribute | Measure | Implementation | Evidence |
|---|---|---|---|---|
| 100M row month-end query | Performance | <30 seconds, <$1 cost | Partition pruning + Parquet columnar format | SQL Breakdown - Query scans 5M rows after pruning (95% reduction) |
| ETL run quarantine rate | Reliability | <5% invalid rows | Validation engine + circuit breaker | ETL Flow - Promotion gate checks quarantine rate before Silver publish |
| Data loss on failure | Reliability | 0 data loss | run_id isolation + _SUCCESS markers | Data Lake Architecture - Each run writes to unique path, no overwrites |
| Infrastructure change audit | Security | 100% API calls logged | CloudTrail management events | Tooling & Controls - Management events enabled for all services |
| Schema change backwards compatibility | Maintainability | 0 breaking changes | Additive-only + schema_v versioning | Data Lake Architecture - New nullable columns, versioned paths |
| End-to-end traceability | Auditability | 100% runs traceable | Step Functions execution ARN as canonical identifier | Traceability Design - AWS-native identifiers, no custom tracking |
| Cost predictability | Cost Optimization | 95% scan reduction for Q1 queries | Year/month partitioning | Design Decisions Summary - Partition pruning reduces Athena costs |
| ETL throughput | Performance | ≥3K rows/second | PySpark optimizations | PySpark Implementation Summary - Measured: 100K rows processed in ≤30s |
| Deployment rollback | Reliability | Smoke tests run in CD; rollback available | Smoke job after deploy; rollback_terraform.sh (manual or approval-gated) | Rollback Playbook - Smoke runs in CD; automated rollback optional / manual trigger |
| Quarantine retry limit | Reliability | Max 3 attempts, then condemned | Loop prevention + attempt_count tracking | ETL Flow - attempt_count 0, 1, 2 allowed; >=3 condemned |
AWS Well-Architected Framework Mapping
This solution addresses all six AWS Well-Architected Framework pillars. The table below shows how key implementations map to each pillar:
| Pillar | Key Implementation | Location | Quality Attribute |
|---|---|---|---|
| Security | KMS encryption (gold/quarantine buckets), IAM least privilege (prefix-scoped policies), CloudTrail audit trail (management + data events), S3 versioning (accidental deletion protection), OIDC authentication (no static keys in CI/CD) | main.tf lines 131-144 (encryption), 262-358 (IAM), 1031-1064 (CloudTrail); .github/workflows/cd.yml line 36 (OIDC) | Auditability (highest priority) |
| Cost Optimization | Partition pruning (95% scan reduction for Q1 queries on 5-year retention), lifecycle policies (Glacier after 90 days), serverless architecture (no idle cost), Parquet compression (10x reduction vs CSV), cost anomaly alarms | balance_history_2024_q1.sql (partition pruning), main.tf lines 184-258 (lifecycle policies), CloudWatch alarms for Athena scan cost | Cost Predictability (high priority) |
| Performance Efficiency | PySpark optimizations (vectorized operations, broadcast joins, adaptive execution), Parquet columnar format (predicate pushdown), ~128MB file sizing (Spark/Athena sweet spot), year/month partitioning (query-aligned) | ingest_transactions_spark.py lines 101-103 (adaptive execution), s3_operations_spark.py lines 122-128 (file sizing), partition design | Performance (medium priority) |
| Reliability | Quarantine + condemned layers (error isolation), SQS DLQ (3 retries then dead-letter queue), run_id isolation (idempotency, safe reruns), Glue auto-retry (exponential backoff), automated rollback (smoke tests + rollback script), S3 versioning | validator.py (quarantine logic), main.tf lines 731-737 (Glue retry), 1088-1107 (SQS DLQ), CD workflow (rollback) | Reliability (high priority) |
| Operational Excellence | CI/CD automation with comprehensive validation, Infrastructure-as-code (Terraform), structured logging (run_id correlation across services), CloudWatch metrics/alarms (with RunId and ExecutionArn dimensions), traceability design doc (AWS-native identifiers) | .github/workflows/ci.yml (CI), .github/workflows/cd.yml (CD), main.tf (IaC), TRACEABILITY_DESIGN.md (observability patterns) | Auditability (highest priority) |
| Sustainability | Lifecycle policies (auto-Glacier transition reduces power consumption), Parquet compression (storage efficiency, less data transferred), partition pruning (compute efficiency, 95% less processing), serverless architecture (no idle resources, scale to zero) | main.tf lifecycle rules (lines 184-258), Parquet compression (all writes), partition pruning (query patterns), serverless design (Glue, Athena, Lambda) | Cost Optimization (high priority) |
Framework Philosophy
Pillars inform trade-offs but business requirements drive architecture.
For example, the quarantine layer choice prioritizes Reliability (audit trail, error isolation, retry tracking) and Security (compliance, perpetual retention) over Cost Optimization (storage overhead for invalid data). This is the correct trade-off for FinTech, where auditability and compliance are non-negotiable.
Similarly, choosing serverless over always-on compute prioritizes Cost Optimization (pay-per-query, no idle cost) and Sustainability (scale to zero) over Performance Efficiency (sub-second latency). This aligns with the batch OLAP workload where month-end reports can tolerate 30-second query times.
Traceability and auditability have top priority. Changes that would deteriorate them are not adopted (e.g., replacing Step Functions execution identity with Glue-only run_id, or introducing a second state store such as DynamoDB for loop prevention). Alternatives considered—including DynamoDB for quarantine state, Glue-derived run identity, and additive options such as execution start time or Glue lineage—are documented in Traceability Design under Design Boundaries: Traceability and Auditability First. The decision was to change nothing for run identity, enrichment, and loop prevention; the current design remains the single source of truth.
Quality Requirements Summary
Non-Functional Requirements
| Requirement | Priority | Target | Implementation |
|---|---|---|---|
| Auditability | HIGHEST | 100% runs traceable, immutable audit trail | Step Functions execution ARN, CloudTrail logging, Bronze layer immutability |
| Cost Predictability | HIGH | 95% scan reduction, $27/month baseline | Partition pruning, lifecycle policies, serverless architecture |
| Reliability | HIGH | <5% quarantine rate, 0 data loss, automated rollback | Validation engine, run_id isolation, promotion gate, rollback procedures |
| Performance | MEDIUM | <30s for 100M row queries, ≥3K rows/s ETL throughput | Partition pruning, Parquet format, PySpark optimizations |
| Maintainability | MEDIUM | 0 breaking schema changes, backward compatibility | Additive-only schema evolution, schema_v versioning |
| Latency | LOW | Batch processing acceptable, no real-time requirements | Serverless batch processing, scheduled runs |
Quality Scenarios (Detailed)
Scenario 1: 100M Row Month-End Query
Context: Finance team queries account balance history for Q1 2024 (100M row table).
Quality Attribute: Performance
Measure:
- Query execution time: <30 seconds
- Query cost: <$1 (Athena scan cost)
- Data scanned: <5M rows (95% reduction via partition pruning)
Implementation:
- Year/month partitioning:
WHERE year=2024 AND month IN ('01', '02', '03') - Parquet columnar format: Predicate pushdown, column pruning
- Window functions: Efficient carry-forward logic
Evidence: SQL Breakdown - Query optimization patterns
Scenario 2: ETL Run Quarantine Rate
Context: ETL processes 1M rows, some fail validation.
Quality Attribute: Reliability
Measure:
- Quarantine rate: <5% invalid rows
- Promotion gate: Blocks promotion if quarantine rate ≥5%
- Alert threshold: Platform team notified if quarantine rate exceeds threshold
Implementation:
- Validation engine: Schema + domain validation
- Promotion gate: Lambda
read_run_summarychecks quarantine rate before Silver publish - Circuit breaker: Halts pipeline if >100 same errors/hour
Evidence: ETL Flow - Validation and promotion gate logic
Scenario 3: Data Loss on Failure
Context: ETL run fails mid-execution, or deployment fails.
Quality Attribute: Reliability
Measure:
- Data loss: 0 (zero tolerance)
- Partial writes: Prevented via write-then-publish pattern
- Rollback capability: Automated rollback to previous known-good state
Implementation:
- run_id isolation: Each run writes to unique path, no overwrites
- _SUCCESS markers: Consumers only read complete runs
- Staging pointer pattern: CD deploys exact artifact CI validated
Evidence: Data Lake Architecture - Safe publishing pattern
Scenario 4: End-to-End Traceability
Context: Audit requires tracing single transaction through all systems.
Quality Attribute: Auditability
Measure:
- Traceability: 100% runs traceable via execution ARN
- Correlation: All services include run_id and execution_arn in logs/metrics
- Audit trail: CloudTrail logs all API calls with execution context
Implementation:
- Step Functions execution ARN: Canonical identifier propagated to all services
- CloudWatch metrics: RunId and ExecutionArn dimensions
- S3 metadata: _SUCCESS and _LATEST.json store correlation identifiers
Evidence: Traceability Design - Run identity propagation patterns
See also
- Design Decisions Summary - Trade-off analysis for quality decisions
- Data Lake Architecture - Architecture patterns implementing quality requirements
- ETL Flow - ETL validation and quality gates
- Testing Guide - Validation and resilience testing
- Traceability Design - Observability patterns