© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Backfill Playbook

Use Case: Reprocess historical data after schema fixes, bug fixes, or data corrections
Safety Guarantee: run_id isolation prevents overwriting existing data

Overview

This playbook provides a safe, repeatable procedure for backfilling historical data in the ETL pipeline. The run_id isolation pattern ensures that backfills never overwrite existing Silver/Gold data, making the process safe for production use. Part of Operational Runbooks.

GenAI: Backfill playbook text (“what to reprocess and in what order”) and post-mortem drafts from run logs are natural fits for Bedrock. See GenAI in the Ohpen Case & Opportunities.

When Backfills Are Needed

Common Scenarios

Schema Evolution
- New column added to schema that requires recalculating historical data
- Example: Adding TransactionType field to existing transactions
Bug Fixes in Validation Logic
- Validation rule was too strict, rejecting valid transactions
- Example: Currency code validation incorrectly rejected "GBP"
Data Quality Corrections
- Upstream system corrected historical data
- Example: Customer IDs were renumbered in source system
Quarantine Recovery
- Previously quarantined rows now pass validation after fix
- Example: Timestamp parser fixed to handle additional formats
Business Logic Changes
- Calculation logic changed (e.g., interest rate formula)
- Requires reprocessing to apply new logic to historical data

Safety Guarantees

run_id Isolation Pattern

Every ETL run writes to a unique S3 path determined by run_id:

silver/mortgages/transactions/
  year=2024/
    month=01/
      schema_v=v1/
        run_id=20260121T120000Z/          # Original run
        run_id=backfill-20260129T150000Z/ # Backfill run (separate path)

Key Safety Properties:

✅ No Overwrites: Backfills write to new run_id paths, never overwrite existing data
✅ Audit Trail: Both original and backfill data retained for comparison
✅ Rollback Safety: Can delete backfill data without affecting original
✅ Parallel Execution: Multiple backfills can run concurrently (different date ranges)
✅ Idempotency: Re-running same backfill creates new run_id, does not corrupt existing

Prerequisites

Before starting a backfill:

1. Verify Bronze Layer Data Exists

aws s3 ls s3://ohpen-bronze/bronze/mortgages/transactions/ingest_date=2024-01-15/ --recursive

2. Identify Affected Date Range

Determine which ingest_date partitions need reprocessing
Example: January 2024 = ingest_date=2024-01-01 through ingest_date=2024-01-31

3. Backup Current Silver/Gold Data (Optional but Recommended)

# Create backup prefix in S3
aws s3 sync s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/ \
             s3://ohpen-silver/backup-pre-backfill-20260129/year=2024/month=01/

4. Test on Single Date First

Always test backfill on one date before running full date range
Verify output matches expectations

Step-by-Step Backfill Procedure

Step 1: Generate Unique Backfill run_id

Format: backfill-{original_date}-{timestamp}

# Example for backfilling 2024-01-15 data
export BACKFILL_RUN_ID="backfill-2024-01-15-$(date -u +%Y%m%dT%H%M%SZ)"
echo "Backfill run_id: $BACKFILL_RUN_ID"

Output: backfill-2024-01-15-20260129T150000Z

Step 2: Identify Source Data in Bronze Layer

# List available Bronze data for target date
aws s3 ls s3://ohpen-bronze/bronze/mortgages/transactions/ingest_date=2024-01-15/ --recursive

# Expected output:
# bronze/mortgages/transactions/ingest_date=2024-01-15/run_id=20240115T120000Z/transactions.csv.gz

Note the S3 key for use in Step Functions input.

Step 3: Prepare Step Functions Input JSON

Create input file backfill-input.json:

{
  "run_key": "backfill-2024-01-15-20260129T150000Z",
  "s3_bucket": "ohpen-bronze",
  "s3_key": "bronze/mortgages/transactions/ingest_date=2024-01-15/run_id=20240115T120000Z/transactions.csv.gz",
  "schema_version": "v1",
  "trigger_source": "manual_backfill",
  "trigger_event_id": "backfill-2024-01-15",
  "trigger_event_time": "2026-01-29T15:00:00Z"
}

Key Points:

run_key: Use the backfill run_id generated in Step 1
s3_key: Full path to Bronze data (including original run_id)
trigger_source: Set to manual_backfill for tracking

Step 4: Trigger Step Functions Execution

Method A: AWS Console

Navigate to AWS Console → Step Functions
Select state machine: ohpen-etl-orchestration
Click "Start execution"
Paste backfill-input.json content
Name execution: backfill-2024-01-15-20260129T150000Z
Click "Start execution"

Method B: AWS CLI

aws stepfunctions start-execution \
  --state-machine-arn "arn:aws:states:eu-west-1:ACCOUNT_ID:stateMachine:ohpen-etl-orchestration" \
  --name "backfill-2024-01-15-20260129T150000Z" \
  --input file://backfill-input.json

Step 5: Monitor Execution

Watch execution progress:

# Get execution ARN from Step 4 output
export EXECUTION_ARN="arn:aws:states:eu-west-1:ACCOUNT_ID:execution:ohpen-etl-orchestration:backfill-2024-01-15-20260129T150000Z"

# Poll execution status
aws stepfunctions describe-execution --execution-arn $EXECUTION_ARN --query 'status'

Expected states: RUNNING → SUCCEEDED (or FAILED if errors)

Check CloudWatch Logs:

Glue job logs: /aws-glue/jobs/output
Search for: run_id: backfill-2024-01-15-20260129T150000Z

Step 6: Validate Backfill Output

Check Silver Layer:

# Verify backfill data exists
aws s3 ls s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/schema_v=v1/run_id=backfill-2024-01-15-20260129T150000Z/ --recursive

# Expected output:
# part-00000.parquet
# part-00001.parquet
# _SUCCESS

Read _SUCCESS metadata:

aws s3 cp s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS - | jq .

# Verify:
# - run_id matches backfill run_id
# - execution_arn matches Step Functions execution
# - metrics show expected row counts

Compare with Original Data (if testing):

# Original run row count
aws s3 cp s3://ohpen-silver/.../run_id=20240115T120000Z/_SUCCESS - | jq '.metrics.valid_rows'

# Backfill run row count
aws s3 cp s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS - | jq '.metrics.valid_rows'

# Row counts should match (unless bug fix changed validation)

Step 7: Update Gold Layer (If Needed)

If Gold layer aggregations depend on Silver data:

Option A: Trigger Gold Layer Aggregation

# Run Gold aggregation SQL on both run_ids
athena query --query "
SELECT account_id, SUM(amount) as total
FROM silver_transactions
WHERE year=2024 AND month=01 AND run_id IN (
  '20240115T120000Z',
  'backfill-2024-01-15-20260129T150000Z'
)
GROUP BY account_id
"

Option B: Point _LATEST.json to Backfill Data

Only if backfill should replace original data as canonical:

# Update _LATEST.json to point to backfill run
aws s3 cp s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json - | jq '. + {
  "run_id": "backfill-2024-01-15-20260129T150000Z",
  "updated_at": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}' | aws s3 cp - s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json

Warning: This makes the backfill data the "current" data. Original data still exists but is not the default query target.

Step 8: Verify End-to-End

Run sample Athena query:

SELECT COUNT(*), MIN(tx_date), MAX(tx_date)
FROM silver_transactions
WHERE year=2024 AND month=01 AND run_id='backfill-2024-01-15-20260129T150000Z';

Expected: Row counts match Bronze input, date range matches target date.

Parallel Backfills

Safe Concurrent Execution

Multiple backfills can run in parallel safely because each writes to a unique run_id path:

# Backfill January 1-15 (Execution 1)
run_id: backfill-2024-01-01-to-15-20260129T150000Z

# Backfill January 16-31 (Execution 2, runs concurrently)
run_id: backfill-2024-01-16-to-31-20260129T160000Z

# No conflicts - different run_id paths

Recommendations:

Keep backfill date ranges non-overlapping for clarity
Use descriptive run_id names that include date range
Monitor Glue DPU quota (max concurrent jobs)

Batch Backfill Script

For backfilling multiple dates:

#!/bin/bash
# batch-backfill.sh - Backfill multiple dates

START_DATE="2024-01-01"
END_DATE="2024-01-31"

# Generate date range
for date in $(seq -f "%04g-01-%02g" 2024 01 2024 31); do
  export BACKFILL_RUN_ID="backfill-$date-$(date -u +%Y%m%dT%H%M%SZ)"

  # Create input JSON
  cat > backfill-input-$date.json <<EOF
{
  "run_key": "$BACKFILL_RUN_ID",
  "s3_bucket": "ohpen-bronze",
  "s3_key": "bronze/mortgages/transactions/ingest_date=$date/run_id=<ORIGINAL_RUN_ID>/transactions.csv.gz",
  "schema_version": "v1",
  "trigger_source": "manual_backfill"
}
EOF

  # Trigger execution
  aws stepfunctions start-execution \
    --state-machine-arn "arn:aws:states:eu-west-1:ACCOUNT_ID:stateMachine:ohpen-etl-orchestration" \
    --name "$BACKFILL_RUN_ID" \
    --input file://backfill-input-$date.json

  echo "Started backfill for $date: $BACKFILL_RUN_ID"

  # Optional: Wait between executions to avoid overwhelming Glue
  sleep 60
done

Cleanup After Backfill

Option 1: Keep Both Original and Backfill Data (Recommended)

Retain audit trail showing before/after comparison
Disk space impact: 2x storage for backfilled partitions
No action needed

Option 2: Delete Original Data (Use with Caution)

Only if:

Backfill verified correct
Business approved data replacement
Audit trail documented elsewhere

# Delete original run_id data
aws s3 rm s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/schema_v=v1/run_id=20240115T120000Z/ --recursive

# Verify backfill data still exists
aws s3 ls s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive

Warning: This is irreversible. Ensure backups exist.

Option 3: Delete Backfill Data (If Test Failed)

# Delete failed backfill
aws s3 rm s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive

# Original data unaffected

Troubleshooting

Issue: Step Functions Execution Failed

Check: Glue job logs for error details

aws logs filter-log-events \
  --log-group-name /aws-glue/jobs/output \
  --filter-pattern "backfill-2024-01-15" \
  --limit 50

Common errors:

Bronze data not found → Verify S3 key in input JSON
Validation errors → Check if schema_version matches data
Permission errors → Verify IAM role has S3 read access

Issue: Backfill Data Missing in Silver Layer

Check: _SUCCESS marker exists

aws s3 ls s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS

If missing: Job may have failed silently. Check Glue job status.

Issue: Duplicate Data in Athena Queries

Cause: Query not filtering by run_id

Solution: Always include run_id in WHERE clause:

-- Correct: Filter by specific run_id
SELECT * FROM silver_transactions WHERE run_id='backfill-2024-01-15-20260129T150000Z';

-- Wrong: Scans all run_ids (includes original + backfill = duplicates)
SELECT * FROM silver_transactions WHERE year=2024 AND month=01;

Best Practices

Always test on single date first before full date range backfill
Use descriptive run_id names that include date range and purpose
Document backfill reason in execution notes or wiki
Verify output metrics match expectations (row counts, validation pass rates)
Keep backfill audit trail (do not delete original data immediately)
Monitor Glue DPU quota during large backfills
Communicate backfill schedule to stakeholders (may impact query performance)

Rollback Procedure

If backfill produces incorrect data:

Step 1: Delete Backfill Data

aws s3 rm s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive

Step 2: Verify Original Data Intact

aws s3 ls s3://ohpen-silver/.../run_id=20240115T120000Z/ --recursive

Step 3: Revert `_LATEST.json` (If Updated)

aws s3 cp s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json - | jq '. + {
  "run_id": "20240115T120000Z",
  "updated_at": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}' | aws s3 cp - s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json

Step 4: Fix Issue, Re-run Backfill

Identify root cause of incorrect output
Fix validation logic, schema, or input data
Generate new run_id and rerun (Steps 1-6)

Summary Checklist

Before backfill:

Identify affected date range
Verify Bronze data exists
Test on single date
Generate unique backfill run_id

During backfill:

Trigger Step Functions with backfill input JSON
Monitor execution in CloudWatch/Step Functions console
Check for errors in Glue logs

After backfill:

Verify Silver layer output exists
Compare metrics with original run (if applicable)
Update Gold layer or _LATEST.json if needed
Run sample Athena queries to verify correctness
Document backfill in wiki/ticket

Overview​

When Backfills Are Needed​

Common Scenarios​

Safety Guarantees​

run_id Isolation Pattern​

Prerequisites​

1. Verify Bronze Layer Data Exists​

2. Identify Affected Date Range​

3. Backup Current Silver/Gold Data (Optional but Recommended)​

4. Test on Single Date First​

Step-by-Step Backfill Procedure​

Step 1: Generate Unique Backfill run_id​

Step 2: Identify Source Data in Bronze Layer​

Step 3: Prepare Step Functions Input JSON​

Step 4: Trigger Step Functions Execution​

Step 5: Monitor Execution​

Step 6: Validate Backfill Output​

Step 7: Update Gold Layer (If Needed)​

Step 8: Verify End-to-End​

Parallel Backfills​

Safe Concurrent Execution​

Batch Backfill Script​

Cleanup After Backfill​

Option 1: Keep Both Original and Backfill Data (Recommended)​

Option 2: Delete Original Data (Use with Caution)​

Option 3: Delete Backfill Data (If Test Failed)​

Troubleshooting​

Issue: Step Functions Execution Failed​

Issue: Backfill Data Missing in Silver Layer​

Issue: Duplicate Data in Athena Queries​

Best Practices​

Rollback Procedure​

Step 1: Delete Backfill Data​

Step 2: Verify Original Data Intact​

Step 3: Revert _LATEST.json (If Updated)​

Step 4: Fix Issue, Re-run Backfill​

Summary Checklist​

See also​