Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Quality Requirements & Scenarios

Overview

This document consolidates scattered quality goals (satisfies arc42 Chapter 10: Quality Requirements). It provides a quality tree showing priority hierarchy, quality scenarios with measurable criteria, and explicit mapping to AWS Well-Architected Framework pillars.

Code and documentation standards: Python code follows PEP 8 (style), PEP 257 (docstrings), and PEP 484/526 (type hints where used). Linting and formatting are enforced via Ruff. Architecture and documentation align with AWS Well-Architected Framework and arc42.

Extended context: Design Decisions Summary, Data Lake Architecture.


Quality Tree (Priority Hierarchy)

For FinTech data platforms, quality requirements follow a strict priority hierarchy driven by compliance and auditability mandates:

Rationale: Real-time/streaming are out of scope for this case (batch CSV-in-S3). FinTech compliance mandates auditability > cost > performance > latency. This prioritization drives architectural decisions:

  • Auditability first: Immutable Bronze layer, perpetual retention of Condemned data, CloudTrail logging
  • Cost second: Partition pruning (95% reduction), lifecycle policies (Glacier transitions), serverless model (no idle costs)
  • Performance third: Query optimization (<30s for 100M rows), ETL throughput (≥3K rows/s), scalability (tested to 100M rows)
  • Latency last: Batch processing acceptable; sub-second latency not required for month-end reports

Consistency model

  • Silver (business semantics): Exactly-once per business event. Each distinct transaction is represented at most once in Silver for a given business key (e.g. TransactionID, tx_date), achieved by run_id isolation and TransactionID deduplication in the ETL.
  • Promotion: Write-then-publish. The ETL writes to an isolated run_id path; only after the promotion gate does the Lambda copy data to the current/ prefix and update _LATEST.json. From a consumer’s perspective, reading from current/ after promotion gives a consistent, deduplicated snapshot for that promotion.
  • Reads (S3 / Athena): S3 and Athena provide eventual consistency for list and read operations. There is no strong consistency (e.g. read-your-writes) guarantee across regions or across multiple readers; the design does not rely on it for batch analytics.

Data quality dimensions

The four core data quality dimensions are explicitly addressed and mapped to implementation and SLAs:

DimensionDefinitionImplementationSLA / targetEvidence
CompletenessNo missing required data; expected records presentRequired-field validation (schema/null check), promotion gate (row counts), month-end sign-off<5% quarantine rate; 0 data loss on failureETL Flow - Validation, Quality Scenarios (data loss, quarantine rate)
ValidityData conforms to schema and business rulesSchema check, currency allowlist, amount type, timestamp parse; invalid rows → quarantine<5% invalid rows per run; circuit breaker at >100 same errors/hourETL Flow - Validation, ARCHITECTURE - Circuit Breaker
FreshnessData is up to date for its purposeTrigger-to-_SUCCESS latency; max(TransactionTimestamp) in Silver vs nowETL <60 min end-to-end; freshness metric in stakeholder commsExecutive Summary (promotion gate, metrics), RUNBOOKS (operational metrics) ("How up-to-date")
UniquenessAt most one record per business key in SilverTransactionID dedup (loop prevention), row_hash for exact duplicates; condemned on duplicateExactly-once per (TransactionID, tx_date) in SilverETL Flow - Ingestion semantics, Data Model Design Rationale, ETL Flow - Loop Prevention

These dimensions are reflected in the Quality Scenarios Table and in Compliance & Controls (e.g. SOC 1/ISAE 3402, BCBS 239).


Quality Scenarios Table

ScenarioQuality AttributeMeasureImplementationEvidence
100M row month-end queryPerformance<30 seconds, <$1 costPartition pruning + Parquet columnar formatSQL Breakdown - Query scans 5M rows after pruning (95% reduction)
ETL run quarantine rateReliability<5% invalid rowsValidation engine + circuit breakerETL Flow - Promotion gate checks quarantine rate before Silver publish
Data loss on failureReliability0 data lossrun_id isolation + _SUCCESS markersData Lake Architecture - Each run writes to unique path, no overwrites
Infrastructure change auditSecurity100% API calls loggedCloudTrail management eventsTooling & Controls - Management events enabled for all services
Schema change backwards compatibilityMaintainability0 breaking changesAdditive-only + schema_v versioningData Lake Architecture - New nullable columns, versioned paths
End-to-end traceabilityAuditability100% runs traceableStep Functions execution ARN as canonical identifierTraceability Design - AWS-native identifiers, no custom tracking
Cost predictabilityCost Optimization95% scan reduction for Q1 queriesYear/month partitioningDesign Decisions Summary - Partition pruning reduces Athena costs
ETL throughputPerformance≥3K rows/secondPySpark optimizationsPySpark Implementation Summary - Measured: 100K rows processed in ≤30s
Deployment rollbackReliabilitySmoke tests run in CD; rollback availableSmoke job after deploy; rollback_terraform.sh (manual or approval-gated)Rollback Playbook - Smoke runs in CD; automated rollback optional / manual trigger
Quarantine retry limitReliabilityMax 3 attempts, then condemnedLoop prevention + attempt_count trackingETL Flow - attempt_count 0, 1, 2 allowed; >=3 condemned

AWS Well-Architected Framework Mapping

This solution addresses all six AWS Well-Architected Framework pillars. The table below shows how key implementations map to each pillar:

PillarKey ImplementationLocationQuality Attribute
SecurityKMS encryption (gold/quarantine buckets), IAM least privilege (prefix-scoped policies), CloudTrail audit trail (management + data events), S3 versioning (accidental deletion protection), OIDC authentication (no static keys in CI/CD)main.tf lines 131-144 (encryption), 262-358 (IAM), 1031-1064 (CloudTrail); .github/workflows/cd.yml line 36 (OIDC)Auditability (highest priority)
Cost OptimizationPartition pruning (95% scan reduction for Q1 queries on 5-year retention), lifecycle policies (Glacier after 90 days), serverless architecture (no idle cost), Parquet compression (10x reduction vs CSV), cost anomaly alarmsbalance_history_2024_q1.sql (partition pruning), main.tf lines 184-258 (lifecycle policies), CloudWatch alarms for Athena scan costCost Predictability (high priority)
Performance EfficiencyPySpark optimizations (vectorized operations, broadcast joins, adaptive execution), Parquet columnar format (predicate pushdown), ~128MB file sizing (Spark/Athena sweet spot), year/month partitioning (query-aligned)ingest_transactions_spark.py lines 101-103 (adaptive execution), s3_operations_spark.py lines 122-128 (file sizing), partition designPerformance (medium priority)
ReliabilityQuarantine + condemned layers (error isolation), SQS DLQ (3 retries then dead-letter queue), run_id isolation (idempotency, safe reruns), Glue auto-retry (exponential backoff), automated rollback (smoke tests + rollback script), S3 versioningvalidator.py (quarantine logic), main.tf lines 731-737 (Glue retry), 1088-1107 (SQS DLQ), CD workflow (rollback)Reliability (high priority)
Operational ExcellenceCI/CD automation with comprehensive validation, Infrastructure-as-code (Terraform), structured logging (run_id correlation across services), CloudWatch metrics/alarms (with RunId and ExecutionArn dimensions), traceability design doc (AWS-native identifiers).github/workflows/ci.yml (CI), .github/workflows/cd.yml (CD), main.tf (IaC), TRACEABILITY_DESIGN.md (observability patterns)Auditability (highest priority)
SustainabilityLifecycle policies (auto-Glacier transition reduces power consumption), Parquet compression (storage efficiency, less data transferred), partition pruning (compute efficiency, 95% less processing), serverless architecture (no idle resources, scale to zero)main.tf lifecycle rules (lines 184-258), Parquet compression (all writes), partition pruning (query patterns), serverless design (Glue, Athena, Lambda)Cost Optimization (high priority)

Framework Philosophy

Pillars inform trade-offs but business requirements drive architecture.

For example, the quarantine layer choice prioritizes Reliability (audit trail, error isolation, retry tracking) and Security (compliance, perpetual retention) over Cost Optimization (storage overhead for invalid data). This is the correct trade-off for FinTech, where auditability and compliance are non-negotiable.

Similarly, choosing serverless over always-on compute prioritizes Cost Optimization (pay-per-query, no idle cost) and Sustainability (scale to zero) over Performance Efficiency (sub-second latency). This aligns with the batch OLAP workload where month-end reports can tolerate 30-second query times.

Traceability and auditability have top priority. Changes that would deteriorate them are not adopted (e.g., replacing Step Functions execution identity with Glue-only run_id, or introducing a second state store such as DynamoDB for loop prevention). Alternatives considered—including DynamoDB for quarantine state, Glue-derived run identity, and additive options such as execution start time or Glue lineage—are documented in Traceability Design under Design Boundaries: Traceability and Auditability First. The decision was to change nothing for run identity, enrichment, and loop prevention; the current design remains the single source of truth.


Quality Requirements Summary

Non-Functional Requirements

RequirementPriorityTargetImplementation
AuditabilityHIGHEST100% runs traceable, immutable audit trailStep Functions execution ARN, CloudTrail logging, Bronze layer immutability
Cost PredictabilityHIGH95% scan reduction, $27/month baselinePartition pruning, lifecycle policies, serverless architecture
ReliabilityHIGH<5% quarantine rate, 0 data loss, automated rollbackValidation engine, run_id isolation, promotion gate, rollback procedures
PerformanceMEDIUM<30s for 100M row queries, ≥3K rows/s ETL throughputPartition pruning, Parquet format, PySpark optimizations
MaintainabilityMEDIUM0 breaking schema changes, backward compatibilityAdditive-only schema evolution, schema_v versioning
LatencyLOWBatch processing acceptable, no real-time requirementsServerless batch processing, scheduled runs

Quality Scenarios (Detailed)

Scenario 1: 100M Row Month-End Query

Context: Finance team queries account balance history for Q1 2024 (100M row table).

Quality Attribute: Performance

Measure:

  • Query execution time: <30 seconds
  • Query cost: <$1 (Athena scan cost)
  • Data scanned: <5M rows (95% reduction via partition pruning)

Implementation:

  • Year/month partitioning: WHERE year=2024 AND month IN ('01', '02', '03')
  • Parquet columnar format: Predicate pushdown, column pruning
  • Window functions: Efficient carry-forward logic

Evidence: SQL Breakdown - Query optimization patterns


Scenario 2: ETL Run Quarantine Rate

Context: ETL processes 1M rows, some fail validation.

Quality Attribute: Reliability

Measure:

  • Quarantine rate: <5% invalid rows
  • Promotion gate: Blocks promotion if quarantine rate ≥5%
  • Alert threshold: Platform team notified if quarantine rate exceeds threshold

Implementation:

  • Validation engine: Schema + domain validation
  • Promotion gate: Lambda read_run_summary checks quarantine rate before Silver publish
  • Circuit breaker: Halts pipeline if >100 same errors/hour

Evidence: ETL Flow - Validation and promotion gate logic


Scenario 3: Data Loss on Failure

Context: ETL run fails mid-execution, or deployment fails.

Quality Attribute: Reliability

Measure:

  • Data loss: 0 (zero tolerance)
  • Partial writes: Prevented via write-then-publish pattern
  • Rollback capability: Automated rollback to previous known-good state

Implementation:

  • run_id isolation: Each run writes to unique path, no overwrites
  • _SUCCESS markers: Consumers only read complete runs
  • Staging pointer pattern: CD deploys exact artifact CI validated

Evidence: Data Lake Architecture - Safe publishing pattern


Scenario 4: End-to-End Traceability

Context: Audit requires tracing single transaction through all systems.

Quality Attribute: Auditability

Measure:

  • Traceability: 100% runs traceable via execution ARN
  • Correlation: All services include run_id and execution_arn in logs/metrics
  • Audit trail: CloudTrail logs all API calls with execution context

Implementation:

  • Step Functions execution ARN: Canonical identifier propagated to all services
  • CloudWatch metrics: RunId and ExecutionArn dimensions
  • S3 metadata: _SUCCESS and _LATEST.json store correlation identifiers

Evidence: Traceability Design - Run identity propagation patterns


See also

© 2026 Stephen AdeiCC BY 4.0