© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
Operational Runbooks
Overview
This page consolidates operational procedures and troubleshooting guides scattered across the documentation. Each runbook provides step-by-step procedures for common operational scenarios, failure recovery, and troubleshooting. Runbooks apply to the OLAP analytics platform; source systems are upstream (Scope & Assumptions).
Extended context: Runtime Scenarios, Data Lake Architecture.
🤖 GenAI possibility — Runbook step expansion, alarm runbook suggestions, and post-mortem drafts from logs are great spots for Bedrock. See GenAI in the Ohpen Case & Opportunities.
Available Runbooks
Production Operations
-
Backfill Playbook - Step-by-step procedure for reprocessing historical data safely using
run_idisolation- When to use: Schema fixes, bug fixes, data corrections, quarantine recovery
- Safety guarantee: No overwrites, complete audit trail
- See also: Runtime Scenarios - Backfill, Data Lake Architecture - Backfills
-
Schema Evolution Playbook - Deploying schema changes with backward compatibility
- When to use: Adding nullable columns, changing data types, schema versioning
- Safety guarantee: Dual-read period, gradual migration, no breaking changes
- See also: Runtime Scenarios - Schema Evolution, Parquet Schema Specification
-
Rollback Playbook - Automated and manual rollback procedures for infrastructure deployments
- When to use: Failed smoke tests, configuration errors, breaking changes, performance degradation
- Safety guarantee: Terraform state versioning, precise rollback to previous known-good state
- See also: Runtime Scenarios - Failure Recovery, CI/CD Workflow - Rollback
Incident Response
-
Quarantine Review Workflow - Human review and retry process for quarantined data
- When to use: Validation failures requiring human review, quarantine rate spikes
- Process: Review → Identify root cause → Correct source/ETL → Retry → Promote or Condemn
- See also: Runtime Scenarios - Quarantine Retry, ETL Flow - Error Handling
-
Failure Recovery - Recovering from ETL run failures and deployment failures
- When to use: ETL job failures, Step Functions execution failures, deployment smoke test failures
- Process: Identify failure → Review execution history → Fix root cause → Retry or rollback
- See also: Data Lake Architecture - Failure Mode Analysis, CI/CD Workflow - Failure Handling
-
Circuit Breaker Triggered - Responding when circuit breaker halts pipeline
- When to use: >100 same errors/hour, pipeline halted automatically
- Process: Investigate error pattern → Fix root cause → Clear circuit breaker → Restart pipeline
- See also: Audit & Notifications, ETL Flow - Circuit Breaker
Troubleshooting
-
Common Failure Scenarios - Analysis of what breaks if critical components are removed
- Scenarios: Missing
_SUCCESSmarker, missing_LATEST.json, missing Glue Data Catalog, removed quarantine layer - Impact: Incomplete queries, promotion failures, query failures, audit trail loss
- See also: Data Lake Architecture - Failure Mode Analysis
- Scenarios: Missing
-
Query Performance Issues - Troubleshooting slow Athena queries
- Common causes: Missing WHERE clause (full table scan), partition pruning not working, large partition sizes
- Solutions: Add partition filters, verify partition metadata, optimize Parquet file sizes
- See also: SQL Breakdown - Partition Pruning, Tooling & Controls - Athena
-
Deployment Failures - Troubleshooting CI/CD pipeline failures
- Common causes: Terraform state conflicts, IAM permission errors, smoke test failures
- Solutions: Check Terraform state, verify IAM policies, review smoke test logs
- See also: CI/CD Workflow - Smoke Tests, Rollback Playbook
Runbook Template
For future runbooks, use this standard template:
Runbook: [Title]
Use Case: [When to use this runbook]
Safety: [Safety guarantees or precautions]
Symptoms
- [Observable symptoms that indicate this scenario]
Root Causes
- [Common root causes]
Detection
- [How to detect this scenario (alerts, logs, metrics)]
Mitigation Steps
- [Step 1]
- [Step 2]
- [Step 3]
Verification
- [How to verify the issue is resolved]
Prevention
- [How to prevent this scenario in the future]
See also
- [Links to relevant architecture, implementation, or operational docs]
Operational Procedures by Category
Data Quality
- Quarantine Review: Quarantine Review Workflow - Human review and retry process
- Circuit Breaker: Circuit Breaker Response - Pipeline halt recovery
- Validation Failures: ETL Flow - Error Handling - Understanding validation errors
Data Operations
- Backfills: Backfill Playbook - Reprocessing historical data
- Schema Evolution: Schema Evolution Strategy - Deploying schema changes
- Promotion: ETL Flow - Promotion Gate - Promoting Silver to production
Infrastructure Operations
- Rollback: Rollback Playbook - Infrastructure rollback procedures
- Deployment: CI/CD Workflow - Deployment automation
- Monitoring: Audit & Notifications - CloudWatch and alerting
Troubleshooting
- Query Performance: SQL Breakdown - Partition Pruning - Query optimization
- ETL Failures: Runtime Scenarios - Failure Recovery - ETL failure recovery
- Common Failures: Data Lake Architecture - Failure Mode Analysis - Failure scenario analysis
See also
- Runtime Scenarios - Operational flows and failure recovery
- Data Lake Architecture - Architecture patterns and failure modes
- ETL Flow - ETL pipeline logic and error handling
- CI/CD Workflow - Deployment automation and rollback
- Governance Diagrams - Approval workflows and ownership models