CI/CD Workflow
© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
This document is the main CI/CD deliverable for the case study: workflow design, failure handling, Step Functions orchestration, IAM, monitoring, and governance. For implementation artifacts (YAML, Terraform), see CI/CD Artifacts and CI/CD Complete Reference.
Failure Scenarios
Critical Rule: Failed runs never update _LATEST.json or current/ prefix.
Failure Types:
- ETL Job Failure: Non-zero exit, no
_SUCCESS, no data written → Alert triggers, safe rerun - Partial Write: Job crashes mid-execution → Partial files ignored, new
run_idon rerun - Validation Failure: Quarantine rate > threshold → Data Quality Team reviews, fixes source, reruns
- Circuit Breaker: >100 same errors/hour → Pipeline halts, Platform Team investigates
- Critical errors: Business duplicate or circuit-breaker in quarantine → Promotion blocked; review and rerun (promotion gate does not check schema drift)
Safe Rerun: Each rerun uses new run_id, failed runs preserved for audit, only successful runs promoted.
Promotion Workflow: ETL writes to isolated run_id path → _SUCCESS marker → CloudWatch alarm → Human review (Domain Analyst + Platform Team) → Approval → Promote to production.
Step Functions State Machine
See Infrastructure Details below for orchestration behavior.
Failure Handling
See Failure Scenarios above for failure types and safe rerun behavior.
Rollback Procedures
Rollback is achieved by not updating _LATEST.json or current/ on failure; only successful runs are promoted. See Failure Scenarios and Governance Details.
Smoke Tests
Smoke tests (e.g. post-deploy validation of S3, Glue, Step Functions) are documented in the CI/CD Testing Guide.
Safety-First Pipeline Design
CI uses OIDC; CD runs only after CI success; staging pointer and manual approval gate for production. See Governance Details and IAM Security.
Automated Rollback
Failed runs never update production pointers; only successful runs are promoted. See Failure Scenarios and Rollback Procedures above.
Infrastructure Details
Step Functions Orchestration:
- RunETL State: Invokes Glue job synchronously, auto-retries (≤3 attempts, exponential backoff)
- ValidateOutput State: Checks
_SUCCESSmarker, retries on eventual consistency - Error Handling: Catches failures, publishes CloudWatch metrics, logs execution details
IAM Prefix-Scoped Permissions (bucket names: ohpen-bronze, ohpen-silver, ohpen-gold, ohpen-quarantine):
- ETL Job: Bronze bucket (read), Silver bucket (write), Quarantine bucket (write)
- Platform Team: Bronze, Silver, Gold, Quarantine (read/write)
- Domain Teams: Silver bucket
{domain}/*(write), Gold bucket{domain}/*(read) - Business/Analysts: Gold bucket (read-only via Athena)
- Compliance: Bronze, Quarantine (read-only for audit)
Monitoring Details
Volume Metrics: run_id, input_rows, valid_rows_count, quarantined_rows_count, condemned_rows_count
Quality Metrics: quarantine_rate, validation_failure_rate, error_type_distribution
Loop Prevention: avg_attempt_count, duplicate_detection_rate, auto_condemnation_rate, circuit_breaker_triggers
Performance: rows_processed_per_run, duration_seconds, missing_partitions, runtime_anomalies
Alert Ownership:
- P1 (Immediate): Job failures, infrastructure errors, circuit breaker, SLA breaches → Data Platform Team
- P2 (2-4 hours): Quarantine spikes, validation failures, high attempt counts → Data Quality Team
- P3 (8 hours): Volume anomalies → Domain Teams
Governance Details
Ownership Matrix (abbreviated):
- Pipeline/CI/CD/Infrastructure: Data Platform Team
- Validation Rules: Domain Teams (Silver) / Business (Gold)
- Data Quality: Data Quality Team
- Schema: Domain Teams (Silver) / Business (Gold) approve; Platform implements
- Backfill: Platform executes; Domain/Business approves
Governance Workflows:
- Schema Change: Request → Layer-based review (Domain/Business) → Platform feasibility → Approval → Implementation → Versioning → Validation → Promotion
- Quality Issue: Alert → Data Quality triage → Source/Validation/Platform issue → Fix → Backfill approval → Reprocess → Validate → Promote
- Backfill: Request → Layer-based approval → Platform assessment → Schedule → Execute → Validate → Promote
Key Rules:
- Infrastructure changes via Terraform IaC and CI/CD only
- Failed runs never update
_LATEST.jsonorcurrent/ - Run isolation via
run_idmandatory - Human approval required for Silver promotion and condemned data deletion
- Quarantine rate thresholds configurable per dataset (default: 5%)
- Schema changes versioned via
schema_vfor backward compatibility
See also
- CI/CD Artifacts - Workflow YAML and Terraform
- CI/CD Infrastructure Design - Complete CI/CD workflow and infrastructure details
- CI/CD Complete Reference - Testing guides and workflow details
- CI/CD Testing Guide - Testing strategies and local development workflows
- Architecture Boundaries - CI/CD assumptions and edge cases