© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Operational Runbooks

Overview

This page consolidates operational procedures and troubleshooting guides scattered across the documentation. Each runbook provides step-by-step procedures for common operational scenarios, failure recovery, and troubleshooting. Runbooks apply to the OLAP analytics platform; source systems are upstream (Scope & Assumptions).

Extended context: Runtime Scenarios, Data Lake Architecture.

🤖 GenAI possibility — Runbook step expansion, alarm runbook suggestions, and post-mortem drafts from logs are great spots for Bedrock. See GenAI in the Ohpen Case & Opportunities.

Available Runbooks

Production Operations

Backfill Playbook - Step-by-step procedure for reprocessing historical data safely using run_id isolation
- When to use: Schema fixes, bug fixes, data corrections, quarantine recovery
- Safety guarantee: No overwrites, complete audit trail
- See also: Runtime Scenarios - Backfill, Data Lake Architecture - Backfills
Schema Evolution Playbook - Deploying schema changes with backward compatibility
- When to use: Adding nullable columns, changing data types, schema versioning
- Safety guarantee: Dual-read period, gradual migration, no breaking changes
- See also: Runtime Scenarios - Schema Evolution, Parquet Schema Specification
Rollback Playbook - Automated and manual rollback procedures for infrastructure deployments
- When to use: Failed smoke tests, configuration errors, breaking changes, performance degradation
- Safety guarantee: Terraform state versioning, precise rollback to previous known-good state
- See also: Runtime Scenarios - Failure Recovery, CI/CD Workflow - Rollback

Incident Response

Quarantine Review Workflow - Human review and retry process for quarantined data
- When to use: Validation failures requiring human review, quarantine rate spikes
- Process: Review → Identify root cause → Correct source/ETL → Retry → Promote or Condemn
- See also: Runtime Scenarios - Quarantine Retry, ETL Flow - Error Handling
Failure Recovery - Recovering from ETL run failures and deployment failures
- When to use: ETL job failures, Step Functions execution failures, deployment smoke test failures
- Process: Identify failure → Review execution history → Fix root cause → Retry or rollback
- See also: Data Lake Architecture - Failure Mode Analysis, CI/CD Workflow - Failure Handling
Circuit Breaker Triggered - Responding when circuit breaker halts pipeline
- When to use: >100 same errors/hour, pipeline halted automatically
- Process: Investigate error pattern → Fix root cause → Clear circuit breaker → Restart pipeline
- See also: Audit & Notifications, ETL Flow - Circuit Breaker

Troubleshooting

Common Failure Scenarios - Analysis of what breaks if critical components are removed
- Scenarios: Missing _SUCCESS marker, missing _LATEST.json, missing Glue Data Catalog, removed quarantine layer
- Impact: Incomplete queries, promotion failures, query failures, audit trail loss
- See also: Data Lake Architecture - Failure Mode Analysis
Query Performance Issues - Troubleshooting slow Athena queries
- Common causes: Missing WHERE clause (full table scan), partition pruning not working, large partition sizes
- Solutions: Add partition filters, verify partition metadata, optimize Parquet file sizes
- See also: SQL Breakdown - Partition Pruning, Tooling & Controls - Athena
Deployment Failures - Troubleshooting CI/CD pipeline failures
- Common causes: Terraform state conflicts, IAM permission errors, smoke test failures
- Solutions: Check Terraform state, verify IAM policies, review smoke test logs
- See also: CI/CD Workflow - Smoke Tests, Rollback Playbook

Runbook Template

For future runbooks, use this standard template:

Runbook: [Title]

Use Case: [When to use this runbook]
Safety: [Safety guarantees or precautions]

Symptoms

[Observable symptoms that indicate this scenario]

Root Causes

[Common root causes]

Detection

[How to detect this scenario (alerts, logs, metrics)]

Mitigation Steps

[Step 1]
[Step 2]
[Step 3]

Verification

[How to verify the issue is resolved]

Prevention

[How to prevent this scenario in the future]

Operational Procedures by Category

Data Quality

Quarantine Review: Quarantine Review Workflow - Human review and retry process
Circuit Breaker: Circuit Breaker Response - Pipeline halt recovery
Validation Failures: ETL Flow - Error Handling - Understanding validation errors

Data Operations

Backfills: Backfill Playbook - Reprocessing historical data
Schema Evolution: Schema Evolution Strategy - Deploying schema changes
Promotion: ETL Flow - Promotion Gate - Promoting Silver to production

Infrastructure Operations

Rollback: Rollback Playbook - Infrastructure rollback procedures
Deployment: CI/CD Workflow - Deployment automation
Monitoring: Audit & Notifications - CloudWatch and alerting

Troubleshooting

Query Performance: SQL Breakdown - Partition Pruning - Query optimization
ETL Failures: Runtime Scenarios - Failure Recovery - ETL failure recovery
Common Failures: Data Lake Architecture - Failure Mode Analysis - Failure scenario analysis

Overview​

Available Runbooks​

Production Operations​

Incident Response​

Troubleshooting​

Runbook Template​

Runbook: [Title]​

Symptoms​

Root Causes​

Detection​

Mitigation Steps​

Verification​

Prevention​

See also​

Operational Procedures by Category​

Data Quality​

Data Operations​

Infrastructure Operations​

Troubleshooting​

See also​

Overview

Available Runbooks

Production Operations

Incident Response

Troubleshooting

Runbook Template

Runbook: [Title]

Symptoms

Root Causes

Detection

Mitigation Steps

Verification

Prevention

See also

Operational Procedures by Category

Data Quality

Data Operations

Infrastructure Operations

Troubleshooting

See also