© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
Traceability Design Principles
Overview
This document establishes the design principles for end-to-end traceability in the case study OLAP analytics platform. These principles ensure that every execution, data artifact, and operational event can be traced back to its origin without introducing unnecessary complexity or custom identifiers. Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions).
Core Principle: AWS-Native Identifiers
The design uses AWS-native identifiers and does not introduce custom UUIDs or separate run tables.
This principle aligns with our Design Decisions Summary for serverless-first architecture:
- Reduces operational complexity (no separate tracking infrastructure)
- Leverages AWS's built-in traceability (execution history via Step Functions, logs via CloudWatch, metrics)
- Ensures identifiers are unique, immutable, and automatically managed
- Prevents identifier proliferation and correlation gaps (see Tooling & Controls for service selection rationale)
Terminology Convention
📋 TERMINOLOGY STANDARD
run_id: The identifier value itself (use in documentation, logs, S3 paths, CloudWatch dimensions)--run-key: CLI argument name only (implementation detail for passing run_id to Glue/Lambda)execution nameorStep Functions execution name: AWS resource property that becomes run_id when orchestratedexecution ARN: Full AWS resource identifier (canonical identifier for traceability)Use
run_idconsistently in prose. Reserve--run-keyfor CLI examples only.
Quality Assurance Terminology
This solution uses multiple quality assurance concepts at different stages. This table clarifies the distinctions:
| Term | Scope | When | Purpose | Implementation |
|---|---|---|---|---|
| Row Validation | ETL logic | During Bronze → Silver transformation | Schema compliance, data type checks, business rules | Pandas/PySpark validation engine in ETL code |
| Promotion Gate | Deployment | Before Silver run → production (current/ prefix) | Quality control before making a run's Silver data the live, queryable dataset (not Silver→Gold) | Lambda function checks quarantine rate and row counts (blocks on critical errors) |
| Testing (CI/CD) | Code quality | On git push / PR | Verify transformation logic correctness | pytest unit tests, integration tests in GitHub Actions |
| Monitoring | Operations | Post-deployment, runtime | Detect anomalies, failures, performance issues | CloudWatch alarms, metrics, dashboards |
Key Distinction: Validation ensures data quality, promotion gate ensures release safety, testing ensures code correctness, monitoring ensures operational health.
Canonical Identifier
Primary: Step Functions Execution ARN
Canonical identifier: Step Functions Execution ARN
Format: arn:aws:states:REGION:ACCOUNT:execution:STATE_MACHINE_NAME:EXECUTION_NAME
Why Step Functions Execution ARN?
Step Functions Execution ARN was chosen as the canonical identifier because:
- Unique per orchestration - 1:1 relationship with each ETL run when orchestrated
- Immutable - Cannot be changed after execution creation
- Queryable via AWS APIs -
ListExecutions,DescribeExecution,GetExecutionHistoryprovide full execution context - Correlates to CloudWatch Logs - Execution ARN appears in log streams, enabling log correlation
- No custom tracking infrastructure needed - Leverages AWS's built-in execution history and audit trail
- Available throughout execution lifecycle - Passed to all downstream services (Glue, Lambda) for end-to-end correlation
Alternative approaches considered and rejected:
- Glue JobRunId: Would make Glue the canonical source, breaking correlation when Lambda is used for small batches
- Custom UUID: Requires separate tracking table (DynamoDB or database), adds operational complexity
- Timestamp-based ID: Not guaranteed unique for parallel runs
Properties:
- Unique per execution
- Immutable
- Links to execution history via
DescribeExecution,GetExecutionHistory - Available throughout the execution lifecycle
- Correlates to CloudWatch Logs (log streams tagged with execution ARN)
Usage:
- Passed to all downstream services (Glue ETL, Lambda promotion gates)
- Stored in S3 metadata files (
_SUCCESS,_LATEST.json) as described in Data Lake Architecture - Included in CloudWatch metric dimensions (see CI/CD Workflow - Operational Monitoring)
- Included in SNS/SQS failure alerts (see Audit & Notifications)
- Used in CloudTrail for audit correlation (see Tooling & Controls)
Supporting Identifiers
The following identifiers support the canonical Step Functions Execution ARN. They form a hierarchy where each identifier traces back to the execution ARN.
Identifier Hierarchy:
Step Functions Execution ARN (canonical)
├── run_id (execution name)
├── EventBridge Event ID (trigger source)
└── Glue JobRunId
└── Lambda RequestId (promotion)
└── SNS/SQS MessageId (alerts)
Hierarchy Rules:
- Step Functions Execution ARN is the root - all other identifiers trace back to it
- run_id is derived from execution name (when orchestrated) or generated locally (when standalone)
- EventBridge Event ID links trigger event to execution
- Glue JobRunId and Lambda RequestId are service-specific and stored in Step Functions state
- SNS/SQS MessageIds are ephemeral but traceable via Step Functions failure handling
1. run_id (Step Functions Execution Name)
Definition: The execution identifier value, derived from Step Functions execution name when orchestrated, or generated locally when run standalone.
Format: YYYYMMDDTHHMMSSZ (ISO 8601 compact) or Step Functions execution name
Properties:
- Human-readable
- Used in S3 paths for run isolation
- Included in CloudWatch Logs structured fields
- Available in CloudWatch metric dimensions
Usage:
- S3 paths:
s3://bucket/prefix/run_id=VALUE/ - CloudWatch metric dimension:
RunId - CloudWatch Logs field:
run_id - Success marker:
_SUCCESSJSON
Terminology:
run_id: The identifier value (preferred term in documentation and logs)--run-key: The CLI argument name when passing to Glue/Lambda (implementation detail)
When orchestrated by Step Functions:
run_id = $$.Execution.Name
When run standalone:
run_id = datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')
2. Glue JobRunId
Definition: AWS Glue's native job run identifier
Format: jr_XXXXX (Glue-generated)
Properties:
- Unique per Glue job execution
- Immutable
- Links to Glue CloudWatch Logs:
/aws-glue/jobs/output(log stream contains JobRunId) - Returned by
glue:StartJobRun.syncin Step Functions
Usage:
- Stored in Step Functions state:
$.glue_result.JobRunId - Passed to Lambda:
promotion_input.glue_job_run_id - Stored in
_LATEST.jsonfor traceability - Included in failure alerts (when available)
Correlation:
Step Functions Execution ARN → ValidateOutput state → $.glue_result.JobRunId
3. EventBridge Event ID
Definition: AWS EventBridge native event identifier for trigger provenance
Format: UUID (EventBridge-generated)
Properties:
- Unique per event
- Immutable
- Available in EventBridge event detail
Usage:
- Captured from S3
ObjectCreatedevents - Passed to Glue:
--trigger-event-id - Stored in
_SUCCESSmarker:trigger.event_id
Correlation:
S3 Upload → EventBridge event → Step Functions → Glue → _SUCCESS marker
4. Lambda RequestId
Definition: AWS Lambda native request identifier
Format: UUID (Lambda-generated)
Properties:
- Unique per Lambda invocation
- Available in
context.aws_request_id - Links to CloudWatch Logs log stream
Usage:
- Logged in structured logs:
request_idfield - Available in CloudWatch Logs Insights queries
Correlation:
Step Functions → Lambda invoke → context.aws_request_id → CloudWatch Logs
5. SNS MessageId / SQS MessageId
Definition: AWS SNS/SQS native message identifiers
Properties:
- Unique per message
- Immutable
- Links to message delivery and consumption events
Usage:
- Available in Lambda event when consuming from SQS
- Should be logged by SQS consumer Lambda (if implemented)
Design Boundaries: Traceability and Auditability First
Traceability and auditability have top priority. No change that deteriorates them is acceptable.
The following alternatives were considered and rejected, or accepted as additive-only with documented tradeoffs. The decision was to change nothing for run identity, enrichment, and loop prevention so the current design remains the single source of truth.
Rejected: Glue-derived run_id
Using Glue JobRunId and job start time as run identity (instead of Step Functions execution name) was considered. Rejected: Step Functions execution is the canonical run identity; promotion Lambdas and S3 paths expect run_key from Step Functions. Switching to Glue-only run_id would dilute the execution chain and require broad changes. Traceability stays Step Functions–centric.
Rejected: DynamoDB for loop prevention
Using DynamoDB to store row_hash/attempt_count for duplicate detection (instead of loading quarantine Parquet from S3 each run) was considered. Rejected: (1) Traceability and auditability are top priority—quarantine in S3 is the single source of truth for audit; a second state store would complicate lineage. (2) Batch workload: one S3 read per run is sufficient; DynamoDB would be overkill. (3) No duplicate state: one authoritative store is maintained (S3) for quarantine and condemned data.
Enrichment: job-derived properties only
Metadata enrichment (row_hash, source_file_id, attempt_count, ingestion_timestamp) uses properties derived at job start or from row data, not values fetched from AWS APIs during enrichment. run_id and ingest_time come from Step Functions (when orchestrated) or from local job start; enrichment stamps rows with that context. This keeps a single, clear chain and avoids coupling enrichment to extra AWS calls.
Additive options considered (tradeoffs; these were not adopted)
Passing execution start time from Step Functions as ingest_time, Glue Data Catalog lineage, S3 object tags, and expanded CloudWatch/CloudTrail correlation were evaluated as additive options (they do not replace current identifiers). Each has tradeoffs (coupling, cost, maintenance, drift). They are not adopted in the current design; the current design is sufficient for traceability and auditability. If adopted later, they must remain additive and must not replace or obscure Step Functions execution identity or S3-based audit trail.
Identifier Propagation Rules
Rule 1: Always propagate execution_arn
Every downstream service invocation must receive the Step Functions execution ARN:
Step Functions → Glue: --execution-arn
Step Functions → Lambda: event.execution_arn
Glue → CloudWatch Metrics: dimension ExecutionArn
Glue → S3 (_SUCCESS): execution_arn field
Rule 2: Always propagate run_id
Every data artifact and log entry must include run_id:
Step Functions → Glue: --run-key (becomes run_id)
Glue → S3 paths: run_id=VALUE/
Glue → CloudWatch Logs: run_id field
Glue → CloudWatch Metrics: dimension RunId
Rule 3: Store correlation in metadata files
All S3 metadata files must include traceability identifiers:
_SUCCESS marker:
{
"run_id": "20260129T120000Z",
"execution_arn": "arn:aws:states:...",
"trigger": {
"event_id": "abc-123",
"event_time": "2026-01-29T12:00:00Z",
"etag": "...",
"version_id": "..."
},
"metrics": {...}
}
_LATEST.json marker:
{
"run_id": "20260129T120000Z",
"glue_job_run_id": "jr_abc123",
"execution_arn": "arn:aws:states:...",
"promoted_at": "2026-01-29T12:05:00Z",
"schema_version": "v1"
}
Rule 4: Include dimensions in CloudWatch metrics
All CloudWatch metrics must include dimensions for filtering:
dimensions = [
{'Name': 'RunId', 'Value': run_id},
{'Name': 'ExecutionArn', 'Value': execution_arn}
]
This enables:
- Filtering metrics by specific execution
- Alarming on per-run anomalies
- Correlating metrics with logs and S3 outputs
Rule 5: Enrich failure alerts with context
All failure alerts (SNS) must include:
execution_arn: Link to Step Functions executionrun_id(orrun_key): Link to S3 outputsglue_job_run_id: Link to Glue CloudWatch Logs (when available)timestamp: When the failure occurred
This enables:
- One-click navigation from alert to logs
- One-click navigation from alert to S3 outputs
- Automated incident response
Anti-Patterns (Prohibited)
Prohibited: generate custom UUIDs
# BAD: Custom UUID introduces unnecessary identifier
import uuid
run_id = str(uuid.uuid4())
# GOOD: Use Step Functions execution name or timestamp
run_id = args.run_key if args.run_key else datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')
Prohibited: maintain separate run tables
# BAD: Separate DynamoDB/RDS table duplicates AWS execution history
create_table('etl_runs', columns=['run_id', 'status', 'start_time', ...])
# GOOD: Use Step Functions ListExecutions, DescribeExecution
executions = sfn.list_executions(stateMachineArn=state_machine_arn, statusFilter='RUNNING')
Prohibited: publish metrics without dimensions
# BAD: Metrics cannot be filtered by run or execution
cloudwatch.put_metric_data(
Namespace='Ohpen/ETL',
MetricData=[{'MetricName': 'InputRows', 'Value': 1000, 'Unit': 'Count'}]
)
# GOOD: Metrics can be filtered by run_id and execution_arn
cloudwatch.put_metric_data(
Namespace='Ohpen/ETL',
MetricData=[{
'MetricName': 'InputRows',
'Value': 1000,
'Unit': 'Count',
'Dimensions': [
{'Name': 'RunId', 'Value': run_id},
{'Name': 'ExecutionArn', 'Value': execution_arn}
]
}]
)
Prohibited: send alerts without correlation context
# BAD: Alert cannot be linked to execution or outputs
sns.publish(
TopicArn=topic_arn,
Message='ETL failed',
Subject='Failure'
)
# GOOD: Alert includes all correlation identifiers
sns.publish(
TopicArn=topic_arn,
Message=json.dumps({
'alert_type': 'step_functions_failure',
'execution_arn': execution_arn,
'run_id': run_id,
'glue_job_run_id': glue_job_run_id,
'timestamp': timestamp
}),
Subject='Ohpen ETL Pipeline Failure'
)
Audit & Compliance
CloudTrail Integration
CloudTrail provides automatic audit logging for all AWS API calls:
- Management events: All infrastructure changes (Glue, S3, IAM, Step Functions)
- Data events: S3 object-level operations (GetObject, PutObject) for sensitive buckets (Gold, Quarantine)
Correlation:
- CloudTrail events include
userIdentity(IAM role/user) - Step Functions executions are logged with execution ARN
- Glue job runs are logged with JobRunId
- Lambda invocations are logged with RequestId
No additional work required: CloudTrail is provisioned and logs are retained per compliance policy.
Execution History Reconstruction
To reconstruct exactly what happened for a given execution:
- Query by execution ARN:
aws stepfunctions describe-execution --execution-arn ARN - Get execution history:
aws stepfunctions get-execution-history --execution-arn ARN - Find Glue JobRunId: Extract from
$.glue_result.JobRunIdin execution output - Find CloudWatch Logs: Search by
run_idorexecution_arnin Logs Insights - Find S3 outputs:
s3://bucket/prefix/run_id=VALUE/ - Find CloudTrail events: Filter by
execution_arnorJobRunIdin CloudTrail logs
Implementation Checklist
When adding new services or workflows, verify:
- Step Functions execution ARN is passed to all downstream services
- run_id is included in all S3 paths, logs, and metrics
- CloudWatch metrics include
RunIdandExecutionArndimensions - Failure alerts include
execution_arn,run_id, and service-specific identifiers - Metadata files (
_SUCCESS,_LATEST.json) store all correlation identifiers - No custom UUIDs or separate run tables are introduced
- Lambda functions log
context.aws_request_id - SQS consumers log
MessageIdandReceiptHandle
See also
- Data Lake Architecture - How run_id enables safe backfills and run isolation
- ETL Flow - How run_id is used in ETL pipeline metadata and S3 paths
- CI/CD Workflow - Step Functions orchestration and execution ARN propagation
- Audit & Notifications - CloudTrail integration and SNS/SQS message correlation
- Runtime Scenarios - Operational flows showing traceability in action
Summary
Key Takeaways:
- Use Step Functions Execution ARN as the canonical identifier
- Propagate
execution_arnandrun_idto all services - Store correlation identifiers in S3 metadata files
- Include dimensions in CloudWatch metrics
- Enrich failure alerts with full context
- Never introduce custom UUIDs or separate run tables
- Leverage AWS-native identifiers for traceability