Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Deployment Rollback Playbook

Use Case: Recover from failed infrastructure deployments
Safety: Terraform state versioning enables safe rollbacks to previous configurations

Overview

This playbook provides procedures for rolling back infrastructure changes when deployments fail or introduce issues. It covers both automated rollback (triggered by smoke tests) and manual rollback (operator-initiated). Part of Operational Runbooks.

GenAI: Rollback impact summaries (“which tables/jobs are affected”) and security finding explanations are natural fits for Bedrock. See GenAI in the Ohpen Case & Opportunities.


When Rollback Is Needed

Common Scenarios

  1. Failed Smoke Tests

    • Infrastructure deployed but basic health checks fail
    • Example: Glue job exists but not executable, S3 bucket created but not accessible
  2. Configuration Errors

    • Incorrect IAM permissions prevent ETL execution
    • Wrong S3 bucket names or ARNs in state machine
  3. Breaking Changes

    • Schema evolution breaks existing Athena queries
    • Lambda function code change causes runtime errors
  4. Performance Degradation

    • New Glue job configuration causes 10x duration increase
    • CloudWatch alarms trigger immediately after deployment
  5. Security Issues

    • IAM policy too permissive (discovered in audit)
    • Encryption disabled accidentally

Rollback Strategy Overview

Key Concepts

  1. Terraform State Versioning: S3 backend stores previous state versions; rollback re-applies from a previous Git revision (see rollback script).
  2. Smoke Tests: Basic health checks run post-deployment in CD; implemented in tasks/devops_cicd/scripts/smoke_tests.sh.
  3. Rollback: Manual (or approval-gated). The CD workflow does not run rollback automatically when smoke fails; the workflow fails and an operator runs the rollback script if needed.
  4. Manual Rollback: Operator-initiated using rollback_terraform.sh (dry-run by default; use --apply to execute).

Post-Deployment Smoke Tests (Implemented)

How It Works

CD Workflow (tasks/devops_cicd/.github/workflows/cd.yml):

  1. Terraform apply completes (job: deploy).
  2. Smoke tests run in a separate job (smoke) via tasks/devops_cicd/scripts/smoke_tests.sh.
  3. If tests PASS: Deployment and smoke succeed; workflow is green.
  4. If tests FAIL: The smoke job fails and the CD workflow fails. No automatic rollback runs; an operator can run the rollback script manually or trigger a rollback workflow with approval.

Smoke Test Checks

The smoke_tests.sh script verifies:

  • S3 buckets exist and are accessible (ohpen-bronze, ohpen-silver, ohpen-artifacts)
  • Glue job ohpen-transaction-etl-spark exists
  • Step Functions state machine ohpen-etl-orchestration exists and is ACTIVE
  • Lambda ohpen-read-run-summary exists

Optional: set SMOKE_INVOKE_LAMBDA=1 to invoke the Lambda (proves invocability). Timeout: Script is designed to complete in under 2 minutes. Retry: One retry after 30s on failure (for eventual consistency).

Automated Rollback (Optional / Future)

Fully automated rollback on smoke failure is not implemented in the CD workflow by design: rollback is destructive and should be a deliberate step. To add it later:

  1. Add a CD job that runs only when the smoke job fails.
  2. That job runs rollback_terraform.sh --apply (or uses a protected environment with approval).
  3. Optionally send SNS to ohpen-etl-failures with rollback details.

Current recommendation: keep rollback manual or approval-gated; use the rollback script when smoke fails or when issues are discovered later.


Manual Rollback Procedure

Use Cases for Manual Rollback

  • Automated rollback not triggered (smoke tests passed but issues discovered later)
  • Need to roll back to state older than previous deployment
  • Testing rollback procedure (dry-run)

Prerequisites

  1. AWS CLI configured with appropriate credentials
  2. Terraform installed (same version as CD workflow: 1.5.0)
  3. Access to S3 state bucket: ohpen-terraform-state
  4. IAM permissions: terraform:*, s3:*, states:*, glue:*, etc.

Step 1: List Available State Versions

# List all state versions (sorted by modification time)
aws s3api list-object-versions \
--bucket ohpen-terraform-state \
--prefix ohpen-data-lake/terraform.tfstate \
--query 'Versions[*].[VersionId,LastModified,IsLatest]' \
--output table

Output:

-------------------------------------------------------------------
| ListObjectVersions |
+-----------------------------+-------------------+---------------+
| VersionId | LastModified | IsLatest |
+-----------------------------+-------------------+---------------+
| ABC123DEF456 | 2026-01-29T15:00 | True | <- Current (bad)
| GHI789JKL012 | 2026-01-28T14:00 | False | <- Previous (good)
| MNO345PQR678 | 2026-01-27T13:00 | False | <- Older
+-----------------------------+-------------------+---------------+

Identify target version: Copy VersionId of desired state (e.g., GHI789JKL012 for previous deployment)

Step 2: Download Target State Version

# Set target version ID
export TARGET_VERSION_ID="GHI789JKL012"

# Download target state
aws s3api get-object \
--bucket ohpen-terraform-state \
--key ohpen-data-lake/terraform.tfstate \
--version-id $TARGET_VERSION_ID \
terraform.tfstate.backup

Verify download:

# Check file size (should be ~100-500KB for typical state)
ls -lh terraform.tfstate.backup

# Verify JSON structure
head -n 20 terraform.tfstate.backup | jq .version

Step 3: Backup pre-rollback state (safety)

# Download current state before rollback
aws s3 cp s3://ohpen-terraform-state/ohpen-data-lake/terraform.tfstate \
terraform.tfstate.current-backup

Important: Keep this backup in case rollback needs to be undone.


Data Rollback (Reverting _LATEST.json)

Owner: Data Platform Team. Reverting the promoted Silver (or Gold) pointer is a data rollback, not infrastructure rollback. Follow change control and, where applicable, Human Validation Policy (post-hoc overrides and rollback ownership).

Use case: A promoted run was bad (e.g. high quarantine rate discovered after promotion); revert _LATEST.json and current/ to the previous run so consumers see the prior good data.

Procedure (Silver example):

  1. Identify the previous good run_id (from CloudWatch, _SUCCESS metadata, or run history).
  2. Update _LATEST.json to point to that run (e.g. copy from the run’s metadata or construct the pointer).
  3. Copy that run’s data to the current/ prefix (or update the pointer as per your safe-publishing pattern).
  4. Log the rollback (who, when, reason, previous run_id) in your incident/audit log.

See Backfill Playbook - Revert _LATEST.json for an example of reverting _LATEST.json.


Step 4: Push Previous State

# Navigate to Terraform working directory
cd /path/to/ohpen-case-2026/tasks/devops_cicd/infra/terraform

# Initialize Terraform with backend config
terraform init \
-backend-config="bucket=ohpen-terraform-state" \
-backend-config="key=ohpen-data-lake/terraform.tfstate" \
-backend-config="region=eu-west-1" \
-reconfigure

# Push previous state
terraform state push terraform.tfstate.backup

Output: "Successfully configured the backend "s3"!"

Step 5: Terraform Plan (Verify Rollback Changes)

# Run plan to see what will change
terraform plan -out=rollback.tfplan

# Review plan output carefully
# Expected: Resources revert to previous configuration

Critical Review Points:

  • Resources updated/recreated match expected rollback targets
  • ❌ No unexpected deletions (would cause data loss)
  • ❌ No new resources created (indicates wrong state version selected)

Step 6: Apply Rollback

# Apply rollback plan
terraform apply rollback.tfplan

# Monitor progress
# Expected duration: 2-5 minutes

Expected output: Apply complete! Resources: X updated, Y recreated, 0 destroyed.

Step 7: Verify Infrastructure Health

Run smoke tests manually:

# Navigate to scripts directory
cd /path/to/ohpen-case-2026/tasks/devops_cicd/scripts

# Run smoke tests
./smoke_tests.sh

# Expected: All tests pass

Test ETL pipeline:

# Trigger test Step Functions execution
aws stepfunctions start-execution \
--state-machine-arn "arn:aws:states:eu-west-1:ACCOUNT_ID:stateMachine:ohpen-etl-orchestration" \
--name "rollback-verification-$(date +%Y%m%dT%H%M%SZ)" \
--input '{"run_key":"rollback-test","s3_bucket":"ohpen-bronze","s3_key":"test-data.csv"}'

# Monitor execution
aws stepfunctions describe-execution \
--execution-arn "..." \
--query 'status'

Step 8: Document Rollback

Update incident log:

## Rollback: [Date]

**Reason**: [Why rollback was needed]
**Deployment SHA**: [Git commit that was rolled back]
**Rolled back to**: [Git commit of restored state]
**Rollback performed by**: [Operator name]
**Rollback timestamp**: [ISO 8601 timestamp]
**Verification**: [Smoke tests passed? ETL pipeline tested?]
**Root cause**: [What caused the need for rollback]
**Prevention**: [How to prevent similar issues]

Recovery from Failed Rollback

If Rollback Fails

Scenario: terraform apply during rollback fails with errors

Recovery procedure:

Step 1: Restore previous state

# Push the backed-up current state
terraform state push terraform.tfstate.current-backup

Step 2: Diagnose Issue

# Check Terraform error messages
terraform plan

# Common issues:
# - Resource dependencies prevent deletion
# - IAM permissions insufficient
# - API rate limits hit

Step 3: Manual Resource Cleanup (If Necessary)

# Example: Delete stuck resource manually
aws glue delete-job --job-name ohpen-transaction-etl-stuck

# Then retry rollback
terraform plan
terraform apply

Step 4: Escalate to AWS Support (If Blocked)

  • Open AWS Support case
  • Provide Terraform error logs
  • Include resource ARNs that are stuck

Rollback Limitations & Considerations

What Rollback CANNOT Fix

  1. Data Loss: Rollback does NOT restore deleted S3 data

    • Mitigation: S3 versioning enabled (can restore objects manually)
  2. Historical Executions: Past Step Functions executions are not affected

    • Mitigation: New executions use rolled-back configuration
  3. External Dependencies: Third-party integrations may be out of sync

    • Mitigation: Document external dependencies; coordinate rollback

Rollback Risks

  1. State Drift: Manual changes to infrastructure (outside Terraform) are lost

    • Prevention: Never modify infrastructure manually; always use Terraform
  2. Concurrent Deployments: Rollback during active deployment causes conflicts

    • Prevention: Lock deployments (only one CD run at a time)
  3. Partial Rollback: Some resources may fail to revert

    • Mitigation: Verify all resources post-rollback; manually fix if needed

When NOT to Rollback

  1. Data pipeline is running: Wait for current ETL runs to complete
  2. Rollback would cause data loss: Evaluate if forward fix is safer
  3. Issue is non-critical: Consider hotfix deployment instead

Rollback Testing (Dry-Run)

Test Rollback Without Affecting Production

Use case: Verify rollback procedure works before emergency

Procedure:

  1. Use separate AWS account (staging/test)
  2. Deploy infrastructure
  3. Intentionally break configuration (e.g., wrong IAM policy)
  4. Trigger automated rollback or manual rollback
  5. Verify infrastructure restored correctly

Frequency: Quarterly rollback drill (practice makes perfect)


Rollback Metrics & Monitoring

Key Metrics to Track

  1. Rollback Frequency: How often rollbacks occur

    • Target: <1 rollback per quarter
    • Alert: >2 rollbacks per month indicates CI/CD issues
  2. Rollback Duration: Time from failure detection to restored state

    • Target: <10 minutes (automated), <30 minutes (manual)
  3. Rollback Success Rate: % of rollbacks that succeed on first attempt

    • Target: >95%
  4. Mean Time to Recovery (MTTR): Time from deployment failure to fully operational

    • Target: <1 hour

Dashboards

CloudWatch Dashboard (future enhancement):

  • Deployment success/failure count (past 30 days)
  • Rollback count and duration
  • MTTR trend

Rollback Scripts Reference

Rollback Script

Location: tasks/devops_cicd/scripts/rollback_terraform.sh

Behavior: Re-applies Terraform from a previous Git revision (default: HEAD~1). Does not use Terraform state push; it checks out the target ref and runs terraform init, plan, and (if --apply) apply. Dry-run by default; use --apply to execute.

Usage:

# From repo root; requires AWS credentials and Terraform
./tasks/devops_cicd/scripts/rollback_terraform.sh [--dry-run] [--apply] [--target-version REF]
  • --dry-run (default): Print what would be done; no state or infra changes.
  • --apply: Actually checkout target ref and run terraform apply.
  • --target-version REF: Git ref (SHA, tag, or branch); default HEAD~1.

Audit: Script prints timestamp, current HEAD, target ref, and dry-run flag to stdout. For production, run with --dry-run first, then --apply when rollback is confirmed.

Smoke Tests Script

Location: tasks/devops_cicd/scripts/smoke_tests.sh

Checks performed:

  • S3 bucket existence and accessibility (ohpen-bronze, ohpen-silver, ohpen-artifacts)
  • Glue job ohpen-transaction-etl-spark existence
  • Step Functions state machine ohpen-etl-orchestration existence and ACTIVE status
  • Lambda ohpen-read-run-summary existence
  • Optional: Lambda invoke (set SMOKE_INVOKE_LAMBDA=1)

Exit codes:

  • 0: All tests passed
  • 1: One or more tests failed (after one retry)

Rollback Checklist

Pre-Rollback

  • Verify rollback is necessary (issue cannot be forward-fixed)
  • Identify target state version (previous known-good state)
  • Notify stakeholders of impending rollback
  • Stop any active ETL runs (if safe to do so)
  • Backup current Terraform state

During Rollback

  • Download target state version
  • Push target state to S3 backend
  • Run terraform plan to verify changes
  • Apply rollback configuration
  • Monitor rollback progress

Post-Rollback

  • Run smoke tests to verify infrastructure health
  • Test ETL pipeline end-to-end
  • Verify monitoring/alerting operational
  • Document rollback in incident log
  • Analyze root cause and create prevention plan
  • Communicate rollback completion to stakeholders

See also


Summary

Key Takeaways:

  1. Automated rollback triggers on smoke test failures (safest option)
  2. Manual rollback available for issues discovered later
  3. Terraform state versioning enables safe rollbacks to any previous state
  4. ⚠️ Rollback does NOT restore deleted data (use S3 versioning for that)
  5. Test rollback procedure quarterly (practice before emergency)

Emergency Rollback Hotline: [escalation contacts]

Next Steps:

  1. Implement automated rollback scripts (smoke_tests.sh, rollback_terraform.sh)
  2. Update CD workflow to call rollback on smoke test failures
  3. Test rollback in staging environment
  4. Document rollback in team runbooks
© 2026 Stephen AdeiCC BY 4.0