Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Traceability Design Principles

Overview

This document establishes the design principles for end-to-end traceability in the case study OLAP analytics platform. These principles ensure that every execution, data artifact, and operational event can be traced back to its origin without introducing unnecessary complexity or custom identifiers. Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions).

Core Principle: AWS-Native Identifiers

The design uses AWS-native identifiers and does not introduce custom UUIDs or separate run tables.

This principle aligns with our Design Decisions Summary for serverless-first architecture:

  • Reduces operational complexity (no separate tracking infrastructure)
  • Leverages AWS's built-in traceability (execution history via Step Functions, logs via CloudWatch, metrics)
  • Ensures identifiers are unique, immutable, and automatically managed
  • Prevents identifier proliferation and correlation gaps (see Tooling & Controls for service selection rationale)

Terminology Convention

📋 TERMINOLOGY STANDARD

  • run_id: The identifier value itself (use in documentation, logs, S3 paths, CloudWatch dimensions)
  • --run-key: CLI argument name only (implementation detail for passing run_id to Glue/Lambda)
  • execution name or Step Functions execution name: AWS resource property that becomes run_id when orchestrated
  • execution ARN: Full AWS resource identifier (canonical identifier for traceability)

Use run_id consistently in prose. Reserve --run-key for CLI examples only.


Quality Assurance Terminology

This solution uses multiple quality assurance concepts at different stages. This table clarifies the distinctions:

TermScopeWhenPurposeImplementation
Row ValidationETL logicDuring Bronze → Silver transformationSchema compliance, data type checks, business rulesPandas/PySpark validation engine in ETL code
Promotion GateDeploymentBefore Silver run → production (current/ prefix)Quality control before making a run's Silver data the live, queryable dataset (not Silver→Gold)Lambda function checks quarantine rate and row counts (blocks on critical errors)
Testing (CI/CD)Code qualityOn git push / PRVerify transformation logic correctnesspytest unit tests, integration tests in GitHub Actions
MonitoringOperationsPost-deployment, runtimeDetect anomalies, failures, performance issuesCloudWatch alarms, metrics, dashboards

Key Distinction: Validation ensures data quality, promotion gate ensures release safety, testing ensures code correctness, monitoring ensures operational health.


Canonical Identifier

Primary: Step Functions Execution ARN

Canonical identifier: Step Functions Execution ARN

Format: arn:aws:states:REGION:ACCOUNT:execution:STATE_MACHINE_NAME:EXECUTION_NAME

Why Step Functions Execution ARN?

Step Functions Execution ARN was chosen as the canonical identifier because:

  1. Unique per orchestration - 1:1 relationship with each ETL run when orchestrated
  2. Immutable - Cannot be changed after execution creation
  3. Queryable via AWS APIs - ListExecutions, DescribeExecution, GetExecutionHistory provide full execution context
  4. Correlates to CloudWatch Logs - Execution ARN appears in log streams, enabling log correlation
  5. No custom tracking infrastructure needed - Leverages AWS's built-in execution history and audit trail
  6. Available throughout execution lifecycle - Passed to all downstream services (Glue, Lambda) for end-to-end correlation

Alternative approaches considered and rejected:

  • Glue JobRunId: Would make Glue the canonical source, breaking correlation when Lambda is used for small batches
  • Custom UUID: Requires separate tracking table (DynamoDB or database), adds operational complexity
  • Timestamp-based ID: Not guaranteed unique for parallel runs

Properties:

  • Unique per execution
  • Immutable
  • Links to execution history via DescribeExecution, GetExecutionHistory
  • Available throughout the execution lifecycle
  • Correlates to CloudWatch Logs (log streams tagged with execution ARN)

Usage:


Supporting Identifiers

The following identifiers support the canonical Step Functions Execution ARN. They form a hierarchy where each identifier traces back to the execution ARN.

Identifier Hierarchy:

Step Functions Execution ARN (canonical)
├── run_id (execution name)
├── EventBridge Event ID (trigger source)
└── Glue JobRunId
└── Lambda RequestId (promotion)
└── SNS/SQS MessageId (alerts)

Hierarchy Rules:

  • Step Functions Execution ARN is the root - all other identifiers trace back to it
  • run_id is derived from execution name (when orchestrated) or generated locally (when standalone)
  • EventBridge Event ID links trigger event to execution
  • Glue JobRunId and Lambda RequestId are service-specific and stored in Step Functions state
  • SNS/SQS MessageIds are ephemeral but traceable via Step Functions failure handling

1. run_id (Step Functions Execution Name)

Definition: The execution identifier value, derived from Step Functions execution name when orchestrated, or generated locally when run standalone.

Format: YYYYMMDDTHHMMSSZ (ISO 8601 compact) or Step Functions execution name

Properties:

  • Human-readable
  • Used in S3 paths for run isolation
  • Included in CloudWatch Logs structured fields
  • Available in CloudWatch metric dimensions

Usage:

  • S3 paths: s3://bucket/prefix/run_id=VALUE/
  • CloudWatch metric dimension: RunId
  • CloudWatch Logs field: run_id
  • Success marker: _SUCCESS JSON

Terminology:

  • run_id: The identifier value (preferred term in documentation and logs)
  • --run-key: The CLI argument name when passing to Glue/Lambda (implementation detail)

When orchestrated by Step Functions:

run_id = $$.Execution.Name

When run standalone:

run_id = datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')

2. Glue JobRunId

Definition: AWS Glue's native job run identifier

Format: jr_XXXXX (Glue-generated)

Properties:

  • Unique per Glue job execution
  • Immutable
  • Links to Glue CloudWatch Logs: /aws-glue/jobs/output (log stream contains JobRunId)
  • Returned by glue:StartJobRun.sync in Step Functions

Usage:

  • Stored in Step Functions state: $.glue_result.JobRunId
  • Passed to Lambda: promotion_input.glue_job_run_id
  • Stored in _LATEST.json for traceability
  • Included in failure alerts (when available)

Correlation:

Step Functions Execution ARN → ValidateOutput state → $.glue_result.JobRunId

3. EventBridge Event ID

Definition: AWS EventBridge native event identifier for trigger provenance

Format: UUID (EventBridge-generated)

Properties:

  • Unique per event
  • Immutable
  • Available in EventBridge event detail

Usage:

  • Captured from S3 ObjectCreated events
  • Passed to Glue: --trigger-event-id
  • Stored in _SUCCESS marker: trigger.event_id

Correlation:

S3 Upload → EventBridge event → Step Functions → Glue → _SUCCESS marker

4. Lambda RequestId

Definition: AWS Lambda native request identifier

Format: UUID (Lambda-generated)

Properties:

  • Unique per Lambda invocation
  • Available in context.aws_request_id
  • Links to CloudWatch Logs log stream

Usage:

  • Logged in structured logs: request_id field
  • Available in CloudWatch Logs Insights queries

Correlation:

Step Functions → Lambda invoke → context.aws_request_id → CloudWatch Logs

5. SNS MessageId / SQS MessageId

Definition: AWS SNS/SQS native message identifiers

Properties:

  • Unique per message
  • Immutable
  • Links to message delivery and consumption events

Usage:

  • Available in Lambda event when consuming from SQS
  • Should be logged by SQS consumer Lambda (if implemented)

Design Boundaries: Traceability and Auditability First

Traceability and auditability have top priority. No change that deteriorates them is acceptable.

The following alternatives were considered and rejected, or accepted as additive-only with documented tradeoffs. The decision was to change nothing for run identity, enrichment, and loop prevention so the current design remains the single source of truth.

Rejected: Glue-derived run_id

Using Glue JobRunId and job start time as run identity (instead of Step Functions execution name) was considered. Rejected: Step Functions execution is the canonical run identity; promotion Lambdas and S3 paths expect run_key from Step Functions. Switching to Glue-only run_id would dilute the execution chain and require broad changes. Traceability stays Step Functions–centric.

Rejected: DynamoDB for loop prevention

Using DynamoDB to store row_hash/attempt_count for duplicate detection (instead of loading quarantine Parquet from S3 each run) was considered. Rejected: (1) Traceability and auditability are top priority—quarantine in S3 is the single source of truth for audit; a second state store would complicate lineage. (2) Batch workload: one S3 read per run is sufficient; DynamoDB would be overkill. (3) No duplicate state: one authoritative store is maintained (S3) for quarantine and condemned data.

Enrichment: job-derived properties only

Metadata enrichment (row_hash, source_file_id, attempt_count, ingestion_timestamp) uses properties derived at job start or from row data, not values fetched from AWS APIs during enrichment. run_id and ingest_time come from Step Functions (when orchestrated) or from local job start; enrichment stamps rows with that context. This keeps a single, clear chain and avoids coupling enrichment to extra AWS calls.

Additive options considered (tradeoffs; these were not adopted)

Passing execution start time from Step Functions as ingest_time, Glue Data Catalog lineage, S3 object tags, and expanded CloudWatch/CloudTrail correlation were evaluated as additive options (they do not replace current identifiers). Each has tradeoffs (coupling, cost, maintenance, drift). They are not adopted in the current design; the current design is sufficient for traceability and auditability. If adopted later, they must remain additive and must not replace or obscure Step Functions execution identity or S3-based audit trail.


Identifier Propagation Rules

Rule 1: Always propagate execution_arn

Every downstream service invocation must receive the Step Functions execution ARN:

Step Functions → Glue:       --execution-arn
Step Functions → Lambda: event.execution_arn
Glue → CloudWatch Metrics: dimension ExecutionArn
Glue → S3 (_SUCCESS): execution_arn field

Rule 2: Always propagate run_id

Every data artifact and log entry must include run_id:

Step Functions → Glue:       --run-key (becomes run_id)
Glue → S3 paths: run_id=VALUE/
Glue → CloudWatch Logs: run_id field
Glue → CloudWatch Metrics: dimension RunId

Rule 3: Store correlation in metadata files

All S3 metadata files must include traceability identifiers:

_SUCCESS marker:

{
"run_id": "20260129T120000Z",
"execution_arn": "arn:aws:states:...",
"trigger": {
"event_id": "abc-123",
"event_time": "2026-01-29T12:00:00Z",
"etag": "...",
"version_id": "..."
},
"metrics": {...}
}

_LATEST.json marker:

{
"run_id": "20260129T120000Z",
"glue_job_run_id": "jr_abc123",
"execution_arn": "arn:aws:states:...",
"promoted_at": "2026-01-29T12:05:00Z",
"schema_version": "v1"
}

Rule 4: Include dimensions in CloudWatch metrics

All CloudWatch metrics must include dimensions for filtering:

dimensions = [
{'Name': 'RunId', 'Value': run_id},
{'Name': 'ExecutionArn', 'Value': execution_arn}
]

This enables:

  • Filtering metrics by specific execution
  • Alarming on per-run anomalies
  • Correlating metrics with logs and S3 outputs

Rule 5: Enrich failure alerts with context

All failure alerts (SNS) must include:

  • execution_arn: Link to Step Functions execution
  • run_id (or run_key): Link to S3 outputs
  • glue_job_run_id: Link to Glue CloudWatch Logs (when available)
  • timestamp: When the failure occurred

This enables:

  • One-click navigation from alert to logs
  • One-click navigation from alert to S3 outputs
  • Automated incident response

Anti-Patterns (Prohibited)

Prohibited: generate custom UUIDs

# BAD: Custom UUID introduces unnecessary identifier
import uuid
run_id = str(uuid.uuid4())
# GOOD: Use Step Functions execution name or timestamp
run_id = args.run_key if args.run_key else datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')

Prohibited: maintain separate run tables

# BAD: Separate DynamoDB/RDS table duplicates AWS execution history
create_table('etl_runs', columns=['run_id', 'status', 'start_time', ...])
# GOOD: Use Step Functions ListExecutions, DescribeExecution
executions = sfn.list_executions(stateMachineArn=state_machine_arn, statusFilter='RUNNING')

Prohibited: publish metrics without dimensions

# BAD: Metrics cannot be filtered by run or execution
cloudwatch.put_metric_data(
Namespace='Ohpen/ETL',
MetricData=[{'MetricName': 'InputRows', 'Value': 1000, 'Unit': 'Count'}]
)
# GOOD: Metrics can be filtered by run_id and execution_arn
cloudwatch.put_metric_data(
Namespace='Ohpen/ETL',
MetricData=[{
'MetricName': 'InputRows',
'Value': 1000,
'Unit': 'Count',
'Dimensions': [
{'Name': 'RunId', 'Value': run_id},
{'Name': 'ExecutionArn', 'Value': execution_arn}
]
}]
)

Prohibited: send alerts without correlation context

# BAD: Alert cannot be linked to execution or outputs
sns.publish(
TopicArn=topic_arn,
Message='ETL failed',
Subject='Failure'
)
# GOOD: Alert includes all correlation identifiers
sns.publish(
TopicArn=topic_arn,
Message=json.dumps({
'alert_type': 'step_functions_failure',
'execution_arn': execution_arn,
'run_id': run_id,
'glue_job_run_id': glue_job_run_id,
'timestamp': timestamp
}),
Subject='Ohpen ETL Pipeline Failure'
)

Audit & Compliance

CloudTrail Integration

CloudTrail provides automatic audit logging for all AWS API calls:

  • Management events: All infrastructure changes (Glue, S3, IAM, Step Functions)
  • Data events: S3 object-level operations (GetObject, PutObject) for sensitive buckets (Gold, Quarantine)

Correlation:

  • CloudTrail events include userIdentity (IAM role/user)
  • Step Functions executions are logged with execution ARN
  • Glue job runs are logged with JobRunId
  • Lambda invocations are logged with RequestId

No additional work required: CloudTrail is provisioned and logs are retained per compliance policy.

Execution History Reconstruction

To reconstruct exactly what happened for a given execution:

  1. Query by execution ARN: aws stepfunctions describe-execution --execution-arn ARN
  2. Get execution history: aws stepfunctions get-execution-history --execution-arn ARN
  3. Find Glue JobRunId: Extract from $.glue_result.JobRunId in execution output
  4. Find CloudWatch Logs: Search by run_id or execution_arn in Logs Insights
  5. Find S3 outputs: s3://bucket/prefix/run_id=VALUE/
  6. Find CloudTrail events: Filter by execution_arn or JobRunId in CloudTrail logs

Implementation Checklist

When adding new services or workflows, verify:

  • Step Functions execution ARN is passed to all downstream services
  • run_id is included in all S3 paths, logs, and metrics
  • CloudWatch metrics include RunId and ExecutionArn dimensions
  • Failure alerts include execution_arn, run_id, and service-specific identifiers
  • Metadata files (_SUCCESS, _LATEST.json) store all correlation identifiers
  • No custom UUIDs or separate run tables are introduced
  • Lambda functions log context.aws_request_id
  • SQS consumers log MessageId and ReceiptHandle

See also


Summary

Key Takeaways:

  1. Use Step Functions Execution ARN as the canonical identifier
  2. Propagate execution_arn and run_id to all services
  3. Store correlation identifiers in S3 metadata files
  4. Include dimensions in CloudWatch metrics
  5. Enrich failure alerts with full context
  6. Never introduce custom UUIDs or separate run tables
  7. Leverage AWS-native identifiers for traceability
© 2026 Stephen AdeiCC BY 4.0