© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
Deployment Rollback Playbook
Use Case: Recover from failed infrastructure deployments
Safety: Terraform state versioning enables safe rollbacks to previous configurations
Overview
This playbook provides procedures for rolling back infrastructure changes when deployments fail or introduce issues. It covers both automated rollback (triggered by smoke tests) and manual rollback (operator-initiated). Part of Operational Runbooks.
GenAI: Rollback impact summaries (“which tables/jobs are affected”) and security finding explanations are natural fits for Bedrock. See GenAI in the Ohpen Case & Opportunities.
When Rollback Is Needed
Common Scenarios
-
Failed Smoke Tests
- Infrastructure deployed but basic health checks fail
- Example: Glue job exists but not executable, S3 bucket created but not accessible
-
Configuration Errors
- Incorrect IAM permissions prevent ETL execution
- Wrong S3 bucket names or ARNs in state machine
-
Breaking Changes
- Schema evolution breaks existing Athena queries
- Lambda function code change causes runtime errors
-
Performance Degradation
- New Glue job configuration causes 10x duration increase
- CloudWatch alarms trigger immediately after deployment
-
Security Issues
- IAM policy too permissive (discovered in audit)
- Encryption disabled accidentally
Rollback Strategy Overview
Key Concepts
- Terraform State Versioning: S3 backend stores previous state versions; rollback re-applies from a previous Git revision (see rollback script).
- Smoke Tests: Basic health checks run post-deployment in CD; implemented in
tasks/devops_cicd/scripts/smoke_tests.sh. - Rollback: Manual (or approval-gated). The CD workflow does not run rollback automatically when smoke fails; the workflow fails and an operator runs the rollback script if needed.
- Manual Rollback: Operator-initiated using
rollback_terraform.sh(dry-run by default; use--applyto execute).
Post-Deployment Smoke Tests (Implemented)
How It Works
CD Workflow (tasks/devops_cicd/.github/workflows/cd.yml):
- Terraform apply completes (job:
deploy). - Smoke tests run in a separate job (
smoke) viatasks/devops_cicd/scripts/smoke_tests.sh. - If tests PASS: Deployment and smoke succeed; workflow is green.
- If tests FAIL: The smoke job fails and the CD workflow fails. No automatic rollback runs; an operator can run the rollback script manually or trigger a rollback workflow with approval.
Smoke Test Checks
The smoke_tests.sh script verifies:
- S3 buckets exist and are accessible (ohpen-bronze, ohpen-silver, ohpen-artifacts)
- Glue job
ohpen-transaction-etl-sparkexists - Step Functions state machine
ohpen-etl-orchestrationexists and is ACTIVE - Lambda
ohpen-read-run-summaryexists
Optional: set SMOKE_INVOKE_LAMBDA=1 to invoke the Lambda (proves invocability). Timeout: Script is designed to complete in under 2 minutes. Retry: One retry after 30s on failure (for eventual consistency).
Automated Rollback (Optional / Future)
Fully automated rollback on smoke failure is not implemented in the CD workflow by design: rollback is destructive and should be a deliberate step. To add it later:
- Add a CD job that runs only when the smoke job fails.
- That job runs
rollback_terraform.sh --apply(or uses a protected environment with approval). - Optionally send SNS to
ohpen-etl-failureswith rollback details.
Current recommendation: keep rollback manual or approval-gated; use the rollback script when smoke fails or when issues are discovered later.
Manual Rollback Procedure
Use Cases for Manual Rollback
- Automated rollback not triggered (smoke tests passed but issues discovered later)
- Need to roll back to state older than previous deployment
- Testing rollback procedure (dry-run)
Prerequisites
- AWS CLI configured with appropriate credentials
- Terraform installed (same version as CD workflow: 1.5.0)
- Access to S3 state bucket:
ohpen-terraform-state - IAM permissions:
terraform:*,s3:*,states:*,glue:*, etc.
Step 1: List Available State Versions
# List all state versions (sorted by modification time)
aws s3api list-object-versions \
--bucket ohpen-terraform-state \
--prefix ohpen-data-lake/terraform.tfstate \
--query 'Versions[*].[VersionId,LastModified,IsLatest]' \
--output table
Output:
-------------------------------------------------------------------
| ListObjectVersions |
+-----------------------------+-------------------+---------------+
| VersionId | LastModified | IsLatest |
+-----------------------------+-------------------+---------------+
| ABC123DEF456 | 2026-01-29T15:00 | True | <- Current (bad)
| GHI789JKL012 | 2026-01-28T14:00 | False | <- Previous (good)
| MNO345PQR678 | 2026-01-27T13:00 | False | <- Older
+-----------------------------+-------------------+---------------+
Identify target version: Copy VersionId of desired state (e.g., GHI789JKL012 for previous deployment)
Step 2: Download Target State Version
# Set target version ID
export TARGET_VERSION_ID="GHI789JKL012"
# Download target state
aws s3api get-object \
--bucket ohpen-terraform-state \
--key ohpen-data-lake/terraform.tfstate \
--version-id $TARGET_VERSION_ID \
terraform.tfstate.backup
Verify download:
# Check file size (should be ~100-500KB for typical state)
ls -lh terraform.tfstate.backup
# Verify JSON structure
head -n 20 terraform.tfstate.backup | jq .version
Step 3: Backup pre-rollback state (safety)
# Download current state before rollback
aws s3 cp s3://ohpen-terraform-state/ohpen-data-lake/terraform.tfstate \
terraform.tfstate.current-backup
Important: Keep this backup in case rollback needs to be undone.
Data Rollback (Reverting _LATEST.json)
Owner: Data Platform Team. Reverting the promoted Silver (or Gold) pointer is a data rollback, not infrastructure rollback. Follow change control and, where applicable, Human Validation Policy (post-hoc overrides and rollback ownership).
Use case: A promoted run was bad (e.g. high quarantine rate discovered after promotion); revert _LATEST.json and current/ to the previous run so consumers see the prior good data.
Procedure (Silver example):
- Identify the previous good
run_id(from CloudWatch,_SUCCESSmetadata, or run history). - Update
_LATEST.jsonto point to that run (e.g. copy from the run’s metadata or construct the pointer). - Copy that run’s data to the
current/prefix (or update the pointer as per your safe-publishing pattern). - Log the rollback (who, when, reason, previous run_id) in your incident/audit log.
See Backfill Playbook - Revert _LATEST.json for an example of reverting _LATEST.json.
Step 4: Push Previous State
# Navigate to Terraform working directory
cd /path/to/ohpen-case-2026/tasks/devops_cicd/infra/terraform
# Initialize Terraform with backend config
terraform init \
-backend-config="bucket=ohpen-terraform-state" \
-backend-config="key=ohpen-data-lake/terraform.tfstate" \
-backend-config="region=eu-west-1" \
-reconfigure
# Push previous state
terraform state push terraform.tfstate.backup
Output: "Successfully configured the backend "s3"!"
Step 5: Terraform Plan (Verify Rollback Changes)
# Run plan to see what will change
terraform plan -out=rollback.tfplan
# Review plan output carefully
# Expected: Resources revert to previous configuration
Critical Review Points:
- Resources updated/recreated match expected rollback targets
- ❌ No unexpected deletions (would cause data loss)
- ❌ No new resources created (indicates wrong state version selected)
Step 6: Apply Rollback
# Apply rollback plan
terraform apply rollback.tfplan
# Monitor progress
# Expected duration: 2-5 minutes
Expected output: Apply complete! Resources: X updated, Y recreated, 0 destroyed.
Step 7: Verify Infrastructure Health
Run smoke tests manually:
# Navigate to scripts directory
cd /path/to/ohpen-case-2026/tasks/devops_cicd/scripts
# Run smoke tests
./smoke_tests.sh
# Expected: All tests pass
Test ETL pipeline:
# Trigger test Step Functions execution
aws stepfunctions start-execution \
--state-machine-arn "arn:aws:states:eu-west-1:ACCOUNT_ID:stateMachine:ohpen-etl-orchestration" \
--name "rollback-verification-$(date +%Y%m%dT%H%M%SZ)" \
--input '{"run_key":"rollback-test","s3_bucket":"ohpen-bronze","s3_key":"test-data.csv"}'
# Monitor execution
aws stepfunctions describe-execution \
--execution-arn "..." \
--query 'status'
Step 8: Document Rollback
Update incident log:
## Rollback: [Date]
**Reason**: [Why rollback was needed]
**Deployment SHA**: [Git commit that was rolled back]
**Rolled back to**: [Git commit of restored state]
**Rollback performed by**: [Operator name]
**Rollback timestamp**: [ISO 8601 timestamp]
**Verification**: [Smoke tests passed? ETL pipeline tested?]
**Root cause**: [What caused the need for rollback]
**Prevention**: [How to prevent similar issues]
Recovery from Failed Rollback
If Rollback Fails
Scenario: terraform apply during rollback fails with errors
Recovery procedure:
Step 1: Restore previous state
# Push the backed-up current state
terraform state push terraform.tfstate.current-backup
Step 2: Diagnose Issue
# Check Terraform error messages
terraform plan
# Common issues:
# - Resource dependencies prevent deletion
# - IAM permissions insufficient
# - API rate limits hit
Step 3: Manual Resource Cleanup (If Necessary)
# Example: Delete stuck resource manually
aws glue delete-job --job-name ohpen-transaction-etl-stuck
# Then retry rollback
terraform plan
terraform apply
Step 4: Escalate to AWS Support (If Blocked)
- Open AWS Support case
- Provide Terraform error logs
- Include resource ARNs that are stuck
Rollback Limitations & Considerations
What Rollback CANNOT Fix
-
Data Loss: Rollback does NOT restore deleted S3 data
- Mitigation: S3 versioning enabled (can restore objects manually)
-
Historical Executions: Past Step Functions executions are not affected
- Mitigation: New executions use rolled-back configuration
-
External Dependencies: Third-party integrations may be out of sync
- Mitigation: Document external dependencies; coordinate rollback
Rollback Risks
-
State Drift: Manual changes to infrastructure (outside Terraform) are lost
- Prevention: Never modify infrastructure manually; always use Terraform
-
Concurrent Deployments: Rollback during active deployment causes conflicts
- Prevention: Lock deployments (only one CD run at a time)
-
Partial Rollback: Some resources may fail to revert
- Mitigation: Verify all resources post-rollback; manually fix if needed
When NOT to Rollback
- Data pipeline is running: Wait for current ETL runs to complete
- Rollback would cause data loss: Evaluate if forward fix is safer
- Issue is non-critical: Consider hotfix deployment instead
Rollback Testing (Dry-Run)
Test Rollback Without Affecting Production
Use case: Verify rollback procedure works before emergency
Procedure:
- Use separate AWS account (staging/test)
- Deploy infrastructure
- Intentionally break configuration (e.g., wrong IAM policy)
- Trigger automated rollback or manual rollback
- Verify infrastructure restored correctly
Frequency: Quarterly rollback drill (practice makes perfect)
Rollback Metrics & Monitoring
Key Metrics to Track
-
Rollback Frequency: How often rollbacks occur
- Target: <1 rollback per quarter
- Alert: >2 rollbacks per month indicates CI/CD issues
-
Rollback Duration: Time from failure detection to restored state
- Target: <10 minutes (automated), <30 minutes (manual)
-
Rollback Success Rate: % of rollbacks that succeed on first attempt
- Target: >95%
-
Mean Time to Recovery (MTTR): Time from deployment failure to fully operational
- Target: <1 hour
Dashboards
CloudWatch Dashboard (future enhancement):
- Deployment success/failure count (past 30 days)
- Rollback count and duration
- MTTR trend
Rollback Scripts Reference
Rollback Script
Location: tasks/devops_cicd/scripts/rollback_terraform.sh
Behavior: Re-applies Terraform from a previous Git revision (default: HEAD~1). Does not use Terraform state push; it checks out the target ref and runs terraform init, plan, and (if --apply) apply. Dry-run by default; use --apply to execute.
Usage:
# From repo root; requires AWS credentials and Terraform
./tasks/devops_cicd/scripts/rollback_terraform.sh [--dry-run] [--apply] [--target-version REF]
--dry-run(default): Print what would be done; no state or infra changes.--apply: Actually checkout target ref and run terraform apply.--target-version REF: Git ref (SHA, tag, or branch); defaultHEAD~1.
Audit: Script prints timestamp, current HEAD, target ref, and dry-run flag to stdout. For production, run with --dry-run first, then --apply when rollback is confirmed.
Smoke Tests Script
Location: tasks/devops_cicd/scripts/smoke_tests.sh
Checks performed:
- S3 bucket existence and accessibility (ohpen-bronze, ohpen-silver, ohpen-artifacts)
- Glue job
ohpen-transaction-etl-sparkexistence - Step Functions state machine
ohpen-etl-orchestrationexistence and ACTIVE status - Lambda
ohpen-read-run-summaryexistence - Optional: Lambda invoke (set
SMOKE_INVOKE_LAMBDA=1)
Exit codes:
0: All tests passed1: One or more tests failed (after one retry)
Rollback Checklist
Pre-Rollback
- Verify rollback is necessary (issue cannot be forward-fixed)
- Identify target state version (previous known-good state)
- Notify stakeholders of impending rollback
- Stop any active ETL runs (if safe to do so)
- Backup current Terraform state
During Rollback
- Download target state version
- Push target state to S3 backend
- Run
terraform planto verify changes - Apply rollback configuration
- Monitor rollback progress
Post-Rollback
- Run smoke tests to verify infrastructure health
- Test ETL pipeline end-to-end
- Verify monitoring/alerting operational
- Document rollback in incident log
- Analyze root cause and create prevention plan
- Communicate rollback completion to stakeholders
See also
- CI/CD Workflow - Rollback procedures and smoke tests
- Traceability Design - Run identity and execution context
- Backfill Playbook - Reprocessing historical data
Summary
Key Takeaways:
- ✅ Automated rollback triggers on smoke test failures (safest option)
- ✅ Manual rollback available for issues discovered later
- ✅ Terraform state versioning enables safe rollbacks to any previous state
- ⚠️ Rollback does NOT restore deleted data (use S3 versioning for that)
- ✅ Test rollback procedure quarterly (practice before emergency)
Emergency Rollback Hotline: [escalation contacts]
Next Steps:
- Implement automated rollback scripts (
smoke_tests.sh,rollback_terraform.sh) - Update CD workflow to call rollback on smoke test failures
- Test rollback in staging environment
- Document rollback in team runbooks