© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
Backfill Playbook
Use Case: Reprocess historical data after schema fixes, bug fixes, or data corrections
Safety Guarantee:run_idisolation prevents overwriting existing data
Overview
This playbook provides a safe, repeatable procedure for backfilling historical data in the ETL pipeline. The run_id isolation pattern ensures that backfills never overwrite existing Silver/Gold data, making the process safe for production use. Part of Operational Runbooks.
GenAI: Backfill playbook text (“what to reprocess and in what order”) and post-mortem drafts from run logs are natural fits for Bedrock. See GenAI in the Ohpen Case & Opportunities.
When Backfills Are Needed
Common Scenarios
-
Schema Evolution
- New column added to schema that requires recalculating historical data
- Example: Adding
TransactionTypefield to existing transactions
-
Bug Fixes in Validation Logic
- Validation rule was too strict, rejecting valid transactions
- Example: Currency code validation incorrectly rejected "GBP"
-
Data Quality Corrections
- Upstream system corrected historical data
- Example: Customer IDs were renumbered in source system
-
Quarantine Recovery
- Previously quarantined rows now pass validation after fix
- Example: Timestamp parser fixed to handle additional formats
-
Business Logic Changes
- Calculation logic changed (e.g., interest rate formula)
- Requires reprocessing to apply new logic to historical data
Safety Guarantees
run_id Isolation Pattern
Every ETL run writes to a unique S3 path determined by run_id:
silver/mortgages/transactions/
year=2024/
month=01/
schema_v=v1/
run_id=20260121T120000Z/ # Original run
run_id=backfill-20260129T150000Z/ # Backfill run (separate path)
Key Safety Properties:
- ✅ No Overwrites: Backfills write to new
run_idpaths, never overwrite existing data - ✅ Audit Trail: Both original and backfill data retained for comparison
- ✅ Rollback Safety: Can delete backfill data without affecting original
- ✅ Parallel Execution: Multiple backfills can run concurrently (different date ranges)
- ✅ Idempotency: Re-running same backfill creates new
run_id, does not corrupt existing
Prerequisites
Before starting a backfill:
1. Verify Bronze Layer Data Exists
aws s3 ls s3://ohpen-bronze/bronze/mortgages/transactions/ingest_date=2024-01-15/ --recursive
2. Identify Affected Date Range
- Determine which
ingest_datepartitions need reprocessing - Example: January 2024 =
ingest_date=2024-01-01throughingest_date=2024-01-31
3. Backup Current Silver/Gold Data (Optional but Recommended)
# Create backup prefix in S3
aws s3 sync s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/ \
s3://ohpen-silver/backup-pre-backfill-20260129/year=2024/month=01/
4. Test on Single Date First
- Always test backfill on one date before running full date range
- Verify output matches expectations
Step-by-Step Backfill Procedure
Step 1: Generate Unique Backfill run_id
Format: backfill-{original_date}-{timestamp}
# Example for backfilling 2024-01-15 data
export BACKFILL_RUN_ID="backfill-2024-01-15-$(date -u +%Y%m%dT%H%M%SZ)"
echo "Backfill run_id: $BACKFILL_RUN_ID"
Output: backfill-2024-01-15-20260129T150000Z
Step 2: Identify Source Data in Bronze Layer
# List available Bronze data for target date
aws s3 ls s3://ohpen-bronze/bronze/mortgages/transactions/ingest_date=2024-01-15/ --recursive
# Expected output:
# bronze/mortgages/transactions/ingest_date=2024-01-15/run_id=20240115T120000Z/transactions.csv.gz
Note the S3 key for use in Step Functions input.
Step 3: Prepare Step Functions Input JSON
Create input file backfill-input.json:
{
"run_key": "backfill-2024-01-15-20260129T150000Z",
"s3_bucket": "ohpen-bronze",
"s3_key": "bronze/mortgages/transactions/ingest_date=2024-01-15/run_id=20240115T120000Z/transactions.csv.gz",
"schema_version": "v1",
"trigger_source": "manual_backfill",
"trigger_event_id": "backfill-2024-01-15",
"trigger_event_time": "2026-01-29T15:00:00Z"
}
Key Points:
run_key: Use the backfillrun_idgenerated in Step 1s3_key: Full path to Bronze data (including originalrun_id)trigger_source: Set tomanual_backfillfor tracking
Step 4: Trigger Step Functions Execution
Method A: AWS Console
- Navigate to AWS Console → Step Functions
- Select state machine:
ohpen-etl-orchestration - Click "Start execution"
- Paste
backfill-input.jsoncontent - Name execution:
backfill-2024-01-15-20260129T150000Z - Click "Start execution"
Method B: AWS CLI
aws stepfunctions start-execution \
--state-machine-arn "arn:aws:states:eu-west-1:ACCOUNT_ID:stateMachine:ohpen-etl-orchestration" \
--name "backfill-2024-01-15-20260129T150000Z" \
--input file://backfill-input.json
Step 5: Monitor Execution
Watch execution progress:
# Get execution ARN from Step 4 output
export EXECUTION_ARN="arn:aws:states:eu-west-1:ACCOUNT_ID:execution:ohpen-etl-orchestration:backfill-2024-01-15-20260129T150000Z"
# Poll execution status
aws stepfunctions describe-execution --execution-arn $EXECUTION_ARN --query 'status'
Expected states: RUNNING → SUCCEEDED (or FAILED if errors)
Check CloudWatch Logs:
- Glue job logs:
/aws-glue/jobs/output - Search for:
run_id: backfill-2024-01-15-20260129T150000Z
Step 6: Validate Backfill Output
Check Silver Layer:
# Verify backfill data exists
aws s3 ls s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/schema_v=v1/run_id=backfill-2024-01-15-20260129T150000Z/ --recursive
# Expected output:
# part-00000.parquet
# part-00001.parquet
# _SUCCESS
Read _SUCCESS metadata:
aws s3 cp s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS - | jq .
# Verify:
# - run_id matches backfill run_id
# - execution_arn matches Step Functions execution
# - metrics show expected row counts
Compare with Original Data (if testing):
# Original run row count
aws s3 cp s3://ohpen-silver/.../run_id=20240115T120000Z/_SUCCESS - | jq '.metrics.valid_rows'
# Backfill run row count
aws s3 cp s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS - | jq '.metrics.valid_rows'
# Row counts should match (unless bug fix changed validation)
Step 7: Update Gold Layer (If Needed)
If Gold layer aggregations depend on Silver data:
Option A: Trigger Gold Layer Aggregation
# Run Gold aggregation SQL on both run_ids
athena query --query "
SELECT account_id, SUM(amount) as total
FROM silver_transactions
WHERE year=2024 AND month=01 AND run_id IN (
'20240115T120000Z',
'backfill-2024-01-15-20260129T150000Z'
)
GROUP BY account_id
"
Option B: Point _LATEST.json to Backfill Data
Only if backfill should replace original data as canonical:
# Update _LATEST.json to point to backfill run
aws s3 cp s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json - | jq '. + {
"run_id": "backfill-2024-01-15-20260129T150000Z",
"updated_at": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}' | aws s3 cp - s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json
Warning: This makes the backfill data the "current" data. Original data still exists but is not the default query target.
Step 8: Verify End-to-End
Run sample Athena query:
SELECT COUNT(*), MIN(tx_date), MAX(tx_date)
FROM silver_transactions
WHERE year=2024 AND month=01 AND run_id='backfill-2024-01-15-20260129T150000Z';
Expected: Row counts match Bronze input, date range matches target date.
Parallel Backfills
Safe Concurrent Execution
Multiple backfills can run in parallel safely because each writes to a unique run_id path:
# Backfill January 1-15 (Execution 1)
run_id: backfill-2024-01-01-to-15-20260129T150000Z
# Backfill January 16-31 (Execution 2, runs concurrently)
run_id: backfill-2024-01-16-to-31-20260129T160000Z
# No conflicts - different run_id paths
Recommendations:
- Keep backfill date ranges non-overlapping for clarity
- Use descriptive
run_idnames that include date range - Monitor Glue DPU quota (max concurrent jobs)
Batch Backfill Script
For backfilling multiple dates:
#!/bin/bash
# batch-backfill.sh - Backfill multiple dates
START_DATE="2024-01-01"
END_DATE="2024-01-31"
# Generate date range
for date in $(seq -f "%04g-01-%02g" 2024 01 2024 31); do
export BACKFILL_RUN_ID="backfill-$date-$(date -u +%Y%m%dT%H%M%SZ)"
# Create input JSON
cat > backfill-input-$date.json <<EOF
{
"run_key": "$BACKFILL_RUN_ID",
"s3_bucket": "ohpen-bronze",
"s3_key": "bronze/mortgages/transactions/ingest_date=$date/run_id=<ORIGINAL_RUN_ID>/transactions.csv.gz",
"schema_version": "v1",
"trigger_source": "manual_backfill"
}
EOF
# Trigger execution
aws stepfunctions start-execution \
--state-machine-arn "arn:aws:states:eu-west-1:ACCOUNT_ID:stateMachine:ohpen-etl-orchestration" \
--name "$BACKFILL_RUN_ID" \
--input file://backfill-input-$date.json
echo "Started backfill for $date: $BACKFILL_RUN_ID"
# Optional: Wait between executions to avoid overwhelming Glue
sleep 60
done
Cleanup After Backfill
Option 1: Keep Both Original and Backfill Data (Recommended)
- Retain audit trail showing before/after comparison
- Disk space impact: 2x storage for backfilled partitions
- No action needed
Option 2: Delete Original Data (Use with Caution)
Only if:
- Backfill verified correct
- Business approved data replacement
- Audit trail documented elsewhere
# Delete original run_id data
aws s3 rm s3://ohpen-silver/silver/mortgages/transactions/year=2024/month=01/schema_v=v1/run_id=20240115T120000Z/ --recursive
# Verify backfill data still exists
aws s3 ls s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive
Warning: This is irreversible. Ensure backups exist.
Option 3: Delete Backfill Data (If Test Failed)
# Delete failed backfill
aws s3 rm s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive
# Original data unaffected
Troubleshooting
Issue: Step Functions Execution Failed
Check: Glue job logs for error details
aws logs filter-log-events \
--log-group-name /aws-glue/jobs/output \
--filter-pattern "backfill-2024-01-15" \
--limit 50
Common errors:
- Bronze data not found → Verify S3 key in input JSON
- Validation errors → Check if schema_version matches data
- Permission errors → Verify IAM role has S3 read access
Issue: Backfill Data Missing in Silver Layer
Check: _SUCCESS marker exists
aws s3 ls s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/_SUCCESS
If missing: Job may have failed silently. Check Glue job status.
Issue: Duplicate Data in Athena Queries
Cause: Query not filtering by run_id
Solution: Always include run_id in WHERE clause:
-- Correct: Filter by specific run_id
SELECT * FROM silver_transactions WHERE run_id='backfill-2024-01-15-20260129T150000Z';
-- Wrong: Scans all run_ids (includes original + backfill = duplicates)
SELECT * FROM silver_transactions WHERE year=2024 AND month=01;
Best Practices
- Always test on single date first before full date range backfill
- Use descriptive
run_idnames that include date range and purpose - Document backfill reason in execution notes or wiki
- Verify output metrics match expectations (row counts, validation pass rates)
- Keep backfill audit trail (do not delete original data immediately)
- Monitor Glue DPU quota during large backfills
- Communicate backfill schedule to stakeholders (may impact query performance)
Rollback Procedure
If backfill produces incorrect data:
Step 1: Delete Backfill Data
aws s3 rm s3://ohpen-silver/.../run_id=backfill-2024-01-15-20260129T150000Z/ --recursive
Step 2: Verify Original Data Intact
aws s3 ls s3://ohpen-silver/.../run_id=20240115T120000Z/ --recursive
Step 3: Revert _LATEST.json (If Updated)
aws s3 cp s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json - | jq '. + {
"run_id": "20240115T120000Z",
"updated_at": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}' | aws s3 cp - s3://ohpen-silver/silver/mortgages/transactions/_LATEST.json
Step 4: Fix Issue, Re-run Backfill
- Identify root cause of incorrect output
- Fix validation logic, schema, or input data
- Generate new
run_idand rerun (Steps 1-6)
Summary Checklist
Before backfill:
- Identify affected date range
- Verify Bronze data exists
- Test on single date
- Generate unique backfill
run_id
During backfill:
- Trigger Step Functions with backfill input JSON
- Monitor execution in CloudWatch/Step Functions console
- Check for errors in Glue logs
After backfill:
- Verify Silver layer output exists
- Compare metrics with original run (if applicable)
- Update Gold layer or
_LATEST.jsonif needed - Run sample Athena queries to verify correctness
- Document backfill in wiki/ticket
See also
run_idIsolation Design: Data Lake Architecture- Step Functions Orchestration: CI/CD Workflow
- Traceability Design: Traceability Design