Deployment & Orchestration Boundaries
© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
This document captures the explicit assumptions and edge cases considered for the CI/CD workflow and infrastructure deployment.
CI/CD Workflow Assumptions
Deployment Frequency
- Scheduled Runs: ETL pipeline runs daily at 2 AM UTC (EventBridge cron schedule)
- Deployment Cadence: Assumes regular but not continuous deployments (PR-based workflow)
- Backfill Support: Infrastructure supports on-demand backfills with versioned artifacts
Workflow Stages
- Validation (CI): PR-triggered linting (
ruff) and unit tests (pytest) - Artifact Build: ETL code packaged and tagged with Git SHA (e.g.,
etl-v1.0.0-a1b2c3d.zip) - Deployment (CD): Upload to S3, Terraform plan/apply, update Glue Job configuration
Safety Features
- Determinism: Same input → same output (idempotent ETL)
- Partitioning: Correct
year=YYYY/month=MMmapping enforced - Quarantine: Invalid rows preserved (never dropped)
- Failure Handling: Failed runs never update
_LATEST.jsonorcurrent/prefix - Human Approval: Required before promoting Silver layer data to production
Infrastructure Assumptions
AWS Services
- S3: Object storage for data lake layers in medallion layout: Bronze =
ohpen-bronze, Silver =ohpen-silver, Gold =ohpen-gold, Quarantine =ohpen-quarantine, Artifacts =ohpen-artifacts(build outputs, not data). - Glue: ETL execution engine (Python Shell for development, PySpark for production)
- Athena: SQL query engine for analytics
- Step Functions: ETL orchestration workflow
- EventBridge: Scheduled trigger for daily ETL runs
- Terraform: Infrastructure as Code (IaC) for provisioning
Infrastructure Capacity
- Glue Workers: 2 DPUs (G.1X workers) assumed sufficient for current volume
- Auto-Scaling: Glue auto-scales (no manual capacity planning required)
- Storage: S3 scales to exabytes (no practical limit)
- Cost Optimization: Lifecycle policies, partition pruning, Parquet compression
Multi-Environment Support
- Environment Isolation: Assumes separate AWS accounts or resource tagging for dev/staging/prod
- Terraform State: Assumes remote state management (S3 backend) for team collaboration
- Secrets Management: Assumes AWS Secrets Manager or Parameter Store for sensitive configuration
Operational Assumptions
Deployment Process
- Terraform Plan: Always run before apply (prevents unexpected infrastructure changes)
- Rollback Strategy: Failed deployments do not corrupt previous infrastructure state
- Version Control: All infrastructure changes tracked in Git (Terraform files)
- Artifact Versioning: Git SHA tags enable reproducible deployments and safe rollbacks
Monitoring & Alerting
- CloudWatch Metrics: ETL job metrics (input_rows, valid_rows, quarantined_rows, duration)
- Alerts: Job failure, quarantine rate spike (>5% threshold), volume anomalies
- Logging: Structured logs to CloudWatch Logs for debugging and audit
Failure Recovery
- ETL Job Failure: Rerun with new
run_id(safe, no data loss) - Infrastructure Failure: AWS handles (multi-AZ, 99.99% SLA)
- Deployment Failure: Terraform rollback or manual intervention required
Edge Cases & Failure Scenarios
CI/CD Pipeline Failures
- Lint Failures: PR blocked until code quality issues resolved
- Test Failures: PR blocked until tests pass
- Terraform Plan Failures: Deployment blocked if infrastructure changes detected
- Artifact Build Failures: Deployment blocked if packaging fails
Deployment Edge Cases
- Concurrent Deployments: Multiple PRs deploying simultaneously may cause conflicts (mitigated by PR approval workflow)
- Infrastructure Drift: Manual changes to infrastructure may cause Terraform plan failures
- State Lock: Concurrent Terraform operations may cause state lock conflicts
Runtime Edge Cases
- ETL Job Timeout: Long-running jobs may exceed Glue timeout limits (requires worker scaling)
- S3 Throttling: High-volume writes may hit S3 request rate limits (requires retry logic)
- Glue Catalog Drift: Manual table schema changes may break Athena queries
Rollback Scenarios
- Failed ETL Run: New
run_idwritten, previous run remains intact (no rollback needed) - Infrastructure Rollback: Terraform state enables rollback to previous infrastructure version
- Artifact Rollback: Previous artifact versions available in S3 for rollback
Security Assumptions
Access Control
- IAM Roles: Least-privilege access for Glue, Step Functions, EventBridge
- S3 Bucket Policies: No public access, encryption at rest (AES256)
- Secrets: Sensitive configuration stored in AWS Secrets Manager or Parameter Store
Compliance
- Data Retention: Perpetual retention for financial/audit data; deletion only with legal/compliance approval
- Audit Trail: CloudTrail logs all infrastructure changes and data access
- Encryption: S3 encryption at rest, TLS in transit
See also
- CI/CD Workflow - Complete workflow design and rationale
- Data Lake Architecture - Architecture that this CI/CD deploys
- ETL Flow - Application code being deployed
- IAM Security Design - OIDC and least-privilege policies
- Runtime Scenarios - Deployment and rollback scenarios