Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Backfill Playbook

Use Case: Reprocess historical data after schema fixes, bug fixes, or data corrections
Safety Guarantee: run_id isolation prevents overwriting existing data

Overview

This playbook provides a safe, repeatable procedure for backfilling historical data in the ETL pipeline. The run_id isolation pattern ensures that backfills never overwrite existing Silver/Gold data, making the process safe for production use. Part of Operational Runbooks.

GenAI: Backfill playbook text (“what to reprocess and in what order”) and post-mortem drafts from run logs are natural fits for Bedrock. See GenAI in the Ohpen Case & Opportunities.


When Backfills Are Needed

Common Scenarios

  1. Schema Evolution

    • New column added to schema that requires recalculating historical data
    • Example: Adding TransactionType field to existing transactions
  2. Bug Fixes in Validation Logic

    • Validation rule was too strict, rejecting valid transactions
    • Example: Currency code validation incorrectly rejected "GBP"
  3. Data Quality Corrections

    • Upstream system corrected historical data
    • Example: Customer IDs were renumbered in source system
  4. Quarantine Recovery

    • Previously quarantined rows now pass validation after fix
    • Example: Timestamp parser fixed to handle additional formats
  5. Business Logic Changes

    • Calculation logic changed (e.g., interest rate formula)
    • Requires reprocessing to apply new logic to historical data

Safety Guarantees

run_id Isolation Pattern

Every ETL run writes to a unique S3 path determined by run_id:

silver/mortgages/transactions/
year=2024/
month=01/
schema_v=v1/
run_id=20260121T120000Z/ # Original run
run_id=backfill-20260129T150000Z/ # Backfill run (separate path)

Key Safety Properties:

  1. No Overwrites: Backfills write to new run_id paths, never overwrite existing data
  2. Audit Trail: Both original and backfill data retained for comparison
  3. Rollback Safety: Can delete backfill data without affecting original
  4. Parallel Execution: Multiple backfills can run concurrently (different date ranges)
  5. Idempotency: Re-running same backfill creates new run_id, does not corrupt existing

Prerequisites

Before starting a backfill:

1. Verify Bronze Layer Data Exists

aws s3 ls s3://ohpen-bronze/bronze/mortgages/transactions/ingest_date=2024-01-15/ --recursive

2. Identify Affected Date Range

  • Determine which ingest_date partitions need reprocessing
  • Example: January 2024 = ingest_date=2024-01-01 through ingest_date=2024-01-31
# Create backup prefix in S3
aws s3 sync s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/ \
s3://ohpen-silver/backup-pre-backfill-20260129/year=2024/month=01/

4. Test on Single Date First

  • Always test backfill on one date before running full date range
  • Verify output matches expectations

Step-by-Step Backfill Procedure

Step 1: Generate Unique Backfill run_id

Format: backfill-{original_date}-{timestamp}

# Example for backfilling 2024-01-15 data
export BACKFILL_RUN_ID="backfill-2024-01-15-$(date -u +%Y%m%dT%H%M%SZ)"
echo "Backfill run_id: $BACKFILL_RUN_ID"

Output: backfill-2024-01-15-20260129T150000Z

Step 2: Identify Source Data in Bronze Layer

# List available Bronze data for target date
aws s3 ls s3://ohpen-bronze/bronze/mortgages/transactions/ingest_date=2024-01-15/ --recursive

# Expected output:
# bronze/mortgages/transactions/ingest_date=2024-01-15/run_id=20240115T120000Z/transactions.csv.gz

Note the S3 key for use in Step Functions input.

Step 3: Prepare Step Functions Input JSON

Create input file backfill-input.json:

{
"run_key": "backfill-2024-01-15-20260129T150000Z",
"s3_bucket": "ohpen-bronze",
"s3_key": "bronze/mortgages/transactions/ingest_date=2024-01-15/run_id=20240115T120000Z/transactions.csv.gz",
"schema_version": "v1",
"trigger_source": "manual_backfill",
"trigger_event_id": "backfill-2024-01-15",
"trigger_event_time": "2026-01-29T15:00:00Z"
}

Key Points:

  • run_key: Use the backfill run_id generated in Step 1
  • s3_key: Full path to Bronze data (including original run_id)
  • trigger_source: Set to manual_backfill for tracking

Step 4: Trigger Step Functions Execution

Method A: AWS Console

  1. Navigate to AWS Console → Step Functions
  2. Select state machine: ohpen-etl-orchestration
  3. Click "Start execution"
  4. Paste backfill-input.json content
  5. Name execution: backfill-2024-01-15-20260129T150000Z
  6. Click "Start execution"

Method B: AWS CLI

aws stepfunctions start-execution \
--state-machine-arn "arn:aws:states:eu-west-1:ACCOUNT_ID:stateMachine:ohpen-etl-orchestration" \
--name "backfill-2024-01-15-20260129T150000Z" \
--input file://backfill-input.json

Step 5: Monitor Execution

Watch execution progress:

# Get execution ARN from Step 4 output
export EXECUTION_ARN="arn:aws:states:eu-west-1:ACCOUNT_ID:execution:ohpen-etl-orchestration:backfill-2024-01-15-20260129T150000Z"

# Poll execution status
aws stepfunctions describe-execution --execution-arn $EXECUTION_ARN --query 'status'

Expected states: RUNNINGSUCCEEDED (or FAILED if errors)

Check CloudWatch Logs:

  • Glue job logs: /aws-glue/jobs/output
  • Search for: run_id: backfill-2024-01-15-20260129T150000Z

Step 6: Validate Backfill Output

Check Silver Layer:

# Verify backfill data exists
aws s3 ls s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/schema_v=v1/run_id=backfill-2024-01-15-20260129T150000Z/ --recursive

# Expected output:
# part-00000.parquet
# part-00001.parquet
# _SUCCESS

Read _SUCCESS metadata:

aws s3 cp s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS - | jq .

# Verify:
# - run_id matches backfill run_id
# - execution_arn matches Step Functions execution
# - metrics show expected row counts

Compare with Original Data (if testing):

# Original run row count
aws s3 cp s3://ohpen-silver/.../run_id=20240115T120000Z/_SUCCESS - | jq '.metrics.valid_rows'

# Backfill run row count
aws s3 cp s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS - | jq '.metrics.valid_rows'

# Row counts should match (unless bug fix changed validation)

Step 7: Update Gold Layer (If Needed)

If Gold layer aggregations depend on Silver data:

Option A: Trigger Gold Layer Aggregation

# Run Gold aggregation SQL on both run_ids
athena query --query "
SELECT account_id, SUM(amount) as total
FROM silver_transactions
WHERE year=2024 AND month=01 AND run_id IN (
'20240115T120000Z',
'backfill-2024-01-15-20260129T150000Z'
)
GROUP BY account_id
"

Option B: Point _LATEST.json to Backfill Data

Only if backfill should replace original data as canonical:

# Update _LATEST.json to point to backfill run
aws s3 cp s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json - | jq '. + {
"run_id": "backfill-2024-01-15-20260129T150000Z",
"updated_at": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}' | aws s3 cp - s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json

Warning: This makes the backfill data the "current" data. Original data still exists but is not the default query target.

Step 8: Verify End-to-End

Run sample Athena query:

SELECT COUNT(*), MIN(tx_date), MAX(tx_date)
FROM silver_transactions
WHERE year=2024 AND month=01 AND run_id='backfill-2024-01-15-20260129T150000Z';

Expected: Row counts match Bronze input, date range matches target date.


Parallel Backfills

Safe Concurrent Execution

Multiple backfills can run in parallel safely because each writes to a unique run_id path:

# Backfill January 1-15 (Execution 1)
run_id: backfill-2024-01-01-to-15-20260129T150000Z

# Backfill January 16-31 (Execution 2, runs concurrently)
run_id: backfill-2024-01-16-to-31-20260129T160000Z

# No conflicts - different run_id paths

Recommendations:

  1. Keep backfill date ranges non-overlapping for clarity
  2. Use descriptive run_id names that include date range
  3. Monitor Glue DPU quota (max concurrent jobs)

Batch Backfill Script

For backfilling multiple dates:

#!/bin/bash
# batch-backfill.sh - Backfill multiple dates

START_DATE="2024-01-01"
END_DATE="2024-01-31"

# Generate date range
for date in $(seq -f "%04g-01-%02g" 2024 01 2024 31); do
export BACKFILL_RUN_ID="backfill-$date-$(date -u +%Y%m%dT%H%M%SZ)"

# Create input JSON
cat > backfill-input-$date.json <<EOF
{
"run_key": "$BACKFILL_RUN_ID",
"s3_bucket": "ohpen-bronze",
"s3_key": "bronze/mortgages/transactions/ingest_date=$date/run_id=<ORIGINAL_RUN_ID>/transactions.csv.gz",
"schema_version": "v1",
"trigger_source": "manual_backfill"
}
EOF

# Trigger execution
aws stepfunctions start-execution \
--state-machine-arn "arn:aws:states:eu-west-1:ACCOUNT_ID:stateMachine:ohpen-etl-orchestration" \
--name "$BACKFILL_RUN_ID" \
--input file://backfill-input-$date.json

echo "Started backfill for $date: $BACKFILL_RUN_ID"

# Optional: Wait between executions to avoid overwhelming Glue
sleep 60
done

Cleanup After Backfill

  • Retain audit trail showing before/after comparison
  • Disk space impact: 2x storage for backfilled partitions
  • No action needed

Option 2: Delete Original Data (Use with Caution)

Only if:

  • Backfill verified correct
  • Business approved data replacement
  • Audit trail documented elsewhere
# Delete original run_id data
aws s3 rm s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/schema_v=v1/run_id=20240115T120000Z/ --recursive

# Verify backfill data still exists
aws s3 ls s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive

Warning: This is irreversible. Ensure backups exist.

Option 3: Delete Backfill Data (If Test Failed)

# Delete failed backfill
aws s3 rm s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive

# Original data unaffected

Troubleshooting

Issue: Step Functions Execution Failed

Check: Glue job logs for error details

aws logs filter-log-events \
--log-group-name /aws-glue/jobs/output \
--filter-pattern "backfill-2024-01-15" \
--limit 50

Common errors:

  • Bronze data not found → Verify S3 key in input JSON
  • Validation errors → Check if schema_version matches data
  • Permission errors → Verify IAM role has S3 read access

Issue: Backfill Data Missing in Silver Layer

Check: _SUCCESS marker exists

aws s3 ls s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS

If missing: Job may have failed silently. Check Glue job status.

Issue: Duplicate Data in Athena Queries

Cause: Query not filtering by run_id

Solution: Always include run_id in WHERE clause:

-- Correct: Filter by specific run_id
SELECT * FROM silver_transactions WHERE run_id='backfill-2024-01-15-20260129T150000Z';

-- Wrong: Scans all run_ids (includes original + backfill = duplicates)
SELECT * FROM silver_transactions WHERE year=2024 AND month=01;

Best Practices

  1. Always test on single date first before full date range backfill
  2. Use descriptive run_id names that include date range and purpose
  3. Document backfill reason in execution notes or wiki
  4. Verify output metrics match expectations (row counts, validation pass rates)
  5. Keep backfill audit trail (do not delete original data immediately)
  6. Monitor Glue DPU quota during large backfills
  7. Communicate backfill schedule to stakeholders (may impact query performance)

Rollback Procedure

If backfill produces incorrect data:

Step 1: Delete Backfill Data

aws s3 rm s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive

Step 2: Verify Original Data Intact

aws s3 ls s3://ohpen-silver/.../run_id=20240115T120000Z/ --recursive

Step 3: Revert _LATEST.json (If Updated)

aws s3 cp s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json - | jq '. + {
"run_id": "20240115T120000Z",
"updated_at": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}' | aws s3 cp - s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json

Step 4: Fix Issue, Re-run Backfill

  • Identify root cause of incorrect output
  • Fix validation logic, schema, or input data
  • Generate new run_id and rerun (Steps 1-6)

Summary Checklist

Before backfill:

  • Identify affected date range
  • Verify Bronze data exists
  • Test on single date
  • Generate unique backfill run_id

During backfill:

  • Trigger Step Functions with backfill input JSON
  • Monitor execution in CloudWatch/Step Functions console
  • Check for errors in Glue logs

After backfill:

  • Verify Silver layer output exists
  • Compare metrics with original run (if applicable)
  • Update Gold layer or _LATEST.json if needed
  • Run sample Athena queries to verify correctness
  • Document backfill in wiki/ticket

See also

© 2026 Stephen AdeiCC BY 4.0