© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Traceability Design Principles

Overview

This document establishes the design principles for end-to-end traceability in the case study OLAP analytics platform. These principles ensure that every execution, data artifact, and operational event can be traced back to its origin without introducing unnecessary complexity or custom identifiers. Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions).

Core Principle: AWS-Native Identifiers

The design uses AWS-native identifiers and does not introduce custom UUIDs or separate run tables.

This principle aligns with our Design Decisions Summary for serverless-first architecture:

Reduces operational complexity (no separate tracking infrastructure)
Leverages AWS's built-in traceability (execution history via Step Functions, logs via CloudWatch, metrics)
Ensures identifiers are unique, immutable, and automatically managed
Prevents identifier proliferation and correlation gaps (see Tooling & Controls for service selection rationale)

Terminology Convention

📋 TERMINOLOGY STANDARD

run_id: The identifier value itself (use in documentation, logs, S3 paths, CloudWatch dimensions)

--run-key: CLI argument name only (implementation detail for passing run_id to Glue/Lambda)

execution name or Step Functions execution name: AWS resource property that becomes run_id when orchestrated

execution ARN: Full AWS resource identifier (canonical identifier for traceability)

Use run_id consistently in prose. Reserve --run-key for CLI examples only.

Quality Assurance Terminology

This solution uses multiple quality assurance concepts at different stages. This table clarifies the distinctions:

Term	Scope	When	Purpose	Implementation
Row Validation	ETL logic	During Bronze → Silver transformation	Schema compliance, data type checks, business rules	Pandas/PySpark validation engine in ETL code
Promotion Gate	Deployment	Before Silver run → production (`current/` prefix)	Quality control before making a run's Silver data the live, queryable dataset (not Silver→Gold)	Lambda function checks quarantine rate and row counts (blocks on critical errors)
Testing (CI/CD)	Code quality	On git push / PR	Verify transformation logic correctness	pytest unit tests, integration tests in GitHub Actions
Monitoring	Operations	Post-deployment, runtime	Detect anomalies, failures, performance issues	CloudWatch alarms, metrics, dashboards

Key Distinction: Validation ensures data quality, promotion gate ensures release safety, testing ensures code correctness, monitoring ensures operational health.

Canonical Identifier

Primary: Step Functions Execution ARN

Canonical identifier: Step Functions Execution ARN

Format: arn:aws:states:REGION:ACCOUNT:execution:STATE_MACHINE_NAME:EXECUTION_NAME

Why Step Functions Execution ARN?

Step Functions Execution ARN was chosen as the canonical identifier because:

Unique per orchestration - 1:1 relationship with each ETL run when orchestrated
Immutable - Cannot be changed after execution creation
Queryable via AWS APIs - ListExecutions, DescribeExecution, GetExecutionHistory provide full execution context
Correlates to CloudWatch Logs - Execution ARN appears in log streams, enabling log correlation
No custom tracking infrastructure needed - Leverages AWS's built-in execution history and audit trail
Available throughout execution lifecycle - Passed to all downstream services (Glue, Lambda) for end-to-end correlation

Alternative approaches considered and rejected:

Glue JobRunId: Would make Glue the canonical source, breaking correlation when Lambda is used for small batches
Custom UUID: Requires separate tracking table (DynamoDB or database), adds operational complexity
Timestamp-based ID: Not guaranteed unique for parallel runs

Properties:

Unique per execution
Immutable
Links to execution history via DescribeExecution, GetExecutionHistory
Available throughout the execution lifecycle
Correlates to CloudWatch Logs (log streams tagged with execution ARN)

Usage:

Passed to all downstream services (Glue ETL, Lambda promotion gates)
Stored in S3 metadata files (_SUCCESS, _LATEST.json) as described in Data Lake Architecture
Included in CloudWatch metric dimensions (see CI/CD Workflow - Operational Monitoring)
Included in SNS/SQS failure alerts (see Audit & Notifications)
Used in CloudTrail for audit correlation (see Tooling & Controls)

Supporting Identifiers

The following identifiers support the canonical Step Functions Execution ARN. They form a hierarchy where each identifier traces back to the execution ARN.

Identifier Hierarchy:

Step Functions Execution ARN (canonical)
├── run_id (execution name)
├── EventBridge Event ID (trigger source)
└── Glue JobRunId
    └── Lambda RequestId (promotion)
        └── SNS/SQS MessageId (alerts)

Hierarchy Rules:

Step Functions Execution ARN is the root - all other identifiers trace back to it
run_id is derived from execution name (when orchestrated) or generated locally (when standalone)
EventBridge Event ID links trigger event to execution
Glue JobRunId and Lambda RequestId are service-specific and stored in Step Functions state
SNS/SQS MessageIds are ephemeral but traceable via Step Functions failure handling

1. run_id (Step Functions Execution Name)

Definition: The execution identifier value, derived from Step Functions execution name when orchestrated, or generated locally when run standalone.

Format: YYYYMMDDTHHMMSSZ (ISO 8601 compact) or Step Functions execution name

Properties:

Human-readable
Used in S3 paths for run isolation
Included in CloudWatch Logs structured fields
Available in CloudWatch metric dimensions

Usage:

S3 paths: s3://bucket/prefix/run_id=VALUE/
CloudWatch metric dimension: RunId
CloudWatch Logs field: run_id
Success marker: _SUCCESS JSON

Terminology:

run_id: The identifier value (preferred term in documentation and logs)
--run-key: The CLI argument name when passing to Glue/Lambda (implementation detail)

When orchestrated by Step Functions:

run_id = $$.Execution.Name

When run standalone:

run_id = datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')

2. Glue JobRunId

Definition: AWS Glue's native job run identifier

Format: jr_XXXXX (Glue-generated)

Properties:

Unique per Glue job execution
Immutable
Links to Glue CloudWatch Logs: /aws-glue/jobs/output (log stream contains JobRunId)
Returned by glue:StartJobRun.sync in Step Functions

Usage:

Stored in Step Functions state: $.glue_result.JobRunId
Passed to Lambda: promotion_input.glue_job_run_id
Stored in _LATEST.json for traceability
Included in failure alerts (when available)

Correlation:

Step Functions Execution ARN → ValidateOutput state → $.glue_result.JobRunId

3. EventBridge Event ID

Definition: AWS EventBridge native event identifier for trigger provenance

Format: UUID (EventBridge-generated)

Properties:

Unique per event
Immutable
Available in EventBridge event detail

Usage:

Captured from S3 ObjectCreated events
Passed to Glue: --trigger-event-id
Stored in _SUCCESS marker: trigger.event_id

Correlation:

S3 Upload → EventBridge event → Step Functions → Glue → _SUCCESS marker

4. Lambda RequestId

Definition: AWS Lambda native request identifier

Format: UUID (Lambda-generated)

Properties:

Unique per Lambda invocation
Available in context.aws_request_id
Links to CloudWatch Logs log stream

Usage:

Logged in structured logs: request_id field
Available in CloudWatch Logs Insights queries

Correlation:

Step Functions → Lambda invoke → context.aws_request_id → CloudWatch Logs

Definition: AWS SNS/SQS native message identifiers

Properties:

Unique per message
Immutable
Links to message delivery and consumption events

Usage:

Available in Lambda event when consuming from SQS
Should be logged by SQS consumer Lambda (if implemented)

Design Boundaries: Traceability and Auditability First

Traceability and auditability have top priority. No change that deteriorates them is acceptable.

The following alternatives were considered and rejected, or accepted as additive-only with documented tradeoffs. The decision was to change nothing for run identity, enrichment, and loop prevention so the current design remains the single source of truth.

Rejected: Glue-derived run_id

Using Glue JobRunId and job start time as run identity (instead of Step Functions execution name) was considered. Rejected: Step Functions execution is the canonical run identity; promotion Lambdas and S3 paths expect run_key from Step Functions. Switching to Glue-only run_id would dilute the execution chain and require broad changes. Traceability stays Step Functions–centric.

Rejected: DynamoDB for loop prevention

Using DynamoDB to store row_hash/attempt_count for duplicate detection (instead of loading quarantine Parquet from S3 each run) was considered. Rejected: (1) Traceability and auditability are top priority—quarantine in S3 is the single source of truth for audit; a second state store would complicate lineage. (2) Batch workload: one S3 read per run is sufficient; DynamoDB would be overkill. (3) No duplicate state: one authoritative store is maintained (S3) for quarantine and condemned data.

Enrichment: job-derived properties only

Metadata enrichment (row_hash, source_file_id, attempt_count, ingestion_timestamp) uses properties derived at job start or from row data, not values fetched from AWS APIs during enrichment. run_id and ingest_time come from Step Functions (when orchestrated) or from local job start; enrichment stamps rows with that context. This keeps a single, clear chain and avoids coupling enrichment to extra AWS calls.

Additive options considered (tradeoffs; these were not adopted)

Passing execution start time from Step Functions as ingest_time, Glue Data Catalog lineage, S3 object tags, and expanded CloudWatch/CloudTrail correlation were evaluated as additive options (they do not replace current identifiers). Each has tradeoffs (coupling, cost, maintenance, drift). They are not adopted in the current design; the current design is sufficient for traceability and auditability. If adopted later, they must remain additive and must not replace or obscure Step Functions execution identity or S3-based audit trail.

Identifier Propagation Rules

Rule 1: Always propagate execution_arn

Every downstream service invocation must receive the Step Functions execution ARN:

Step Functions → Glue:       --execution-arn
Step Functions → Lambda:     event.execution_arn
Glue → CloudWatch Metrics:   dimension ExecutionArn
Glue → S3 (_SUCCESS):        execution_arn field

Rule 2: Always propagate run_id

Every data artifact and log entry must include run_id:

Step Functions → Glue:       --run-key (becomes run_id)
Glue → S3 paths:            run_id=VALUE/
Glue → CloudWatch Logs:     run_id field
Glue → CloudWatch Metrics:  dimension RunId

Rule 3: Store correlation in metadata files

All S3 metadata files must include traceability identifiers:

_SUCCESS marker:

{
  "run_id": "20260129T120000Z",
  "execution_arn": "arn:aws:states:...",
  "trigger": {
    "event_id": "abc-123",
    "event_time": "2026-01-29T12:00:00Z",
    "etag": "...",
    "version_id": "..."
  },
  "metrics": {...}
}

_LATEST.json marker:

{
  "run_id": "20260129T120000Z",
  "glue_job_run_id": "jr_abc123",
  "execution_arn": "arn:aws:states:...",
  "promoted_at": "2026-01-29T12:05:00Z",
  "schema_version": "v1"
}

Rule 4: Include dimensions in CloudWatch metrics

All CloudWatch metrics must include dimensions for filtering:

dimensions = [
    {'Name': 'RunId', 'Value': run_id},
    {'Name': 'ExecutionArn', 'Value': execution_arn}
]

This enables:

Filtering metrics by specific execution
Alarming on per-run anomalies
Correlating metrics with logs and S3 outputs

Rule 5: Enrich failure alerts with context

All failure alerts (SNS) must include:

execution_arn: Link to Step Functions execution
run_id (or run_key): Link to S3 outputs
glue_job_run_id: Link to Glue CloudWatch Logs (when available)
timestamp: When the failure occurred

This enables:

One-click navigation from alert to logs
One-click navigation from alert to S3 outputs
Automated incident response

Anti-Patterns (Prohibited)

Prohibited: generate custom UUIDs

# BAD: Custom UUID introduces unnecessary identifier
import uuid
run_id = str(uuid.uuid4())

# GOOD: Use Step Functions execution name or timestamp
run_id = args.run_key if args.run_key else datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')

Prohibited: maintain separate run tables

# BAD: Separate DynamoDB/RDS table duplicates AWS execution history
create_table('etl_runs', columns=['run_id', 'status', 'start_time', ...])

# GOOD: Use Step Functions ListExecutions, DescribeExecution
executions = sfn.list_executions(stateMachineArn=state_machine_arn, statusFilter='RUNNING')

Prohibited: publish metrics without dimensions

# BAD: Metrics cannot be filtered by run or execution
cloudwatch.put_metric_data(
    Namespace='Ohpen/ETL',
    MetricData=[{'MetricName': 'InputRows', 'Value': 1000, 'Unit': 'Count'}]
)

# GOOD: Metrics can be filtered by run_id and execution_arn
cloudwatch.put_metric_data(
    Namespace='Ohpen/ETL',
    MetricData=[{
        'MetricName': 'InputRows',
        'Value': 1000,
        'Unit': 'Count',
        'Dimensions': [
            {'Name': 'RunId', 'Value': run_id},
            {'Name': 'ExecutionArn', 'Value': execution_arn}
        ]
    }]
)

Prohibited: send alerts without correlation context

# BAD: Alert cannot be linked to execution or outputs
sns.publish(
    TopicArn=topic_arn,
    Message='ETL failed',
    Subject='Failure'
)

# GOOD: Alert includes all correlation identifiers
sns.publish(
    TopicArn=topic_arn,
    Message=json.dumps({
        'alert_type': 'step_functions_failure',
        'execution_arn': execution_arn,
        'run_id': run_id,
        'glue_job_run_id': glue_job_run_id,
        'timestamp': timestamp
    }),
    Subject='Ohpen ETL Pipeline Failure'
)

Audit & Compliance

CloudTrail Integration

CloudTrail provides automatic audit logging for all AWS API calls:

Management events: All infrastructure changes (Glue, S3, IAM, Step Functions)
Data events: S3 object-level operations (GetObject, PutObject) for sensitive buckets (Gold, Quarantine)

Correlation:

CloudTrail events include userIdentity (IAM role/user)
Step Functions executions are logged with execution ARN
Glue job runs are logged with JobRunId
Lambda invocations are logged with RequestId

No additional work required: CloudTrail is provisioned and logs are retained per compliance policy.

Execution History Reconstruction

To reconstruct exactly what happened for a given execution:

Query by execution ARN: aws stepfunctions describe-execution --execution-arn ARN
Get execution history: aws stepfunctions get-execution-history --execution-arn ARN
Find Glue JobRunId: Extract from $.glue_result.JobRunId in execution output
Find CloudWatch Logs: Search by run_id or execution_arn in Logs Insights
Find S3 outputs: s3://bucket/prefix/run_id=VALUE/
Find CloudTrail events: Filter by execution_arn or JobRunId in CloudTrail logs

Implementation Checklist

When adding new services or workflows, verify:

Step Functions execution ARN is passed to all downstream services
run_id is included in all S3 paths, logs, and metrics
CloudWatch metrics include RunId and ExecutionArn dimensions
Failure alerts include execution_arn, run_id, and service-specific identifiers
Metadata files (_SUCCESS, _LATEST.json) store all correlation identifiers
No custom UUIDs or separate run tables are introduced
Lambda functions log context.aws_request_id
SQS consumers log MessageId and ReceiptHandle

Summary

Key Takeaways:

Use Step Functions Execution ARN as the canonical identifier
Propagate execution_arn and run_id to all services
Store correlation identifiers in S3 metadata files
Include dimensions in CloudWatch metrics
Enrich failure alerts with full context
Never introduce custom UUIDs or separate run tables
Leverage AWS-native identifiers for traceability

Overview​

Core Principle: AWS-Native Identifiers​

Terminology Convention​

Quality Assurance Terminology​

Canonical Identifier​

Primary: Step Functions Execution ARN​

Supporting Identifiers​

1. run_id (Step Functions Execution Name)​

2. Glue JobRunId​

3. EventBridge Event ID​

4. Lambda RequestId​

5. SNS MessageId / SQS MessageId​

Design Boundaries: Traceability and Auditability First​

Rejected: Glue-derived run_id​

Rejected: DynamoDB for loop prevention​

Enrichment: job-derived properties only​

Additive options considered (tradeoffs; these were not adopted)​

Identifier Propagation Rules​

Rule 1: Always propagate execution_arn​

Rule 2: Always propagate run_id​

Rule 3: Store correlation in metadata files​

Rule 4: Include dimensions in CloudWatch metrics​

Rule 5: Enrich failure alerts with context​

Anti-Patterns (Prohibited)​

Prohibited: generate custom UUIDs​

Prohibited: maintain separate run tables​

Prohibited: publish metrics without dimensions​

Prohibited: send alerts without correlation context​

Audit & Compliance​

CloudTrail Integration​

Execution History Reconstruction​

Implementation Checklist​

See also​

Summary​

Overview

Core Principle: AWS-Native Identifiers

Terminology Convention

Quality Assurance Terminology

Canonical Identifier

Primary: Step Functions Execution ARN

Supporting Identifiers

1. run_id (Step Functions Execution Name)

2. Glue JobRunId

3. EventBridge Event ID

4. Lambda RequestId

5. SNS MessageId / SQS MessageId

Design Boundaries: Traceability and Auditability First

Rejected: Glue-derived run_id

Rejected: DynamoDB for loop prevention

Enrichment: job-derived properties only

Additive options considered (tradeoffs; these were not adopted)

Identifier Propagation Rules

Rule 1: Always propagate execution_arn

Rule 2: Always propagate run_id

Rule 3: Store correlation in metadata files

Rule 4: Include dimensions in CloudWatch metrics

Rule 5: Enrich failure alerts with context

Anti-Patterns (Prohibited)

Prohibited: generate custom UUIDs

Prohibited: maintain separate run tables

Prohibited: publish metrics without dimensions

Prohibited: send alerts without correlation context

Audit & Compliance

CloudTrail Integration

Execution History Reconstruction

Implementation Checklist

See also

Summary