Skip to main content

Deployment & Orchestration Boundaries

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This document captures the explicit assumptions and edge cases considered for the CI/CD workflow and infrastructure deployment.


CI/CD Workflow Assumptions

Deployment Frequency

  • Scheduled Runs: ETL pipeline runs daily at 2 AM UTC (EventBridge cron schedule)
  • Deployment Cadence: Assumes regular but not continuous deployments (PR-based workflow)
  • Backfill Support: Infrastructure supports on-demand backfills with versioned artifacts

Workflow Stages

  • Validation (CI): PR-triggered linting (ruff) and unit tests (pytest)
  • Artifact Build: ETL code packaged and tagged with Git SHA (e.g., etl-v1.0.0-a1b2c3d.zip)
  • Deployment (CD): Upload to S3, Terraform plan/apply, update Glue Job configuration

Safety Features

  • Determinism: Same input → same output (idempotent ETL)
  • Partitioning: Correct year=YYYY/month=MM mapping enforced
  • Quarantine: Invalid rows preserved (never dropped)
  • Failure Handling: Failed runs never update _LATEST.json or current/ prefix
  • Human Approval: Required before promoting Silver layer data to production

Infrastructure Assumptions

AWS Services

  • S3: Object storage for data lake layers in medallion layout: Bronze = ohpen-bronze, Silver = ohpen-silver, Gold = ohpen-gold, Quarantine = ohpen-quarantine, Artifacts = ohpen-artifacts (build outputs, not data).
  • Glue: ETL execution engine (Python Shell for development, PySpark for production)
  • Athena: SQL query engine for analytics
  • Step Functions: ETL orchestration workflow
  • EventBridge: Scheduled trigger for daily ETL runs
  • Terraform: Infrastructure as Code (IaC) for provisioning

Infrastructure Capacity

  • Glue Workers: 2 DPUs (G.1X workers) assumed sufficient for current volume
  • Auto-Scaling: Glue auto-scales (no manual capacity planning required)
  • Storage: S3 scales to exabytes (no practical limit)
  • Cost Optimization: Lifecycle policies, partition pruning, Parquet compression

Multi-Environment Support

  • Environment Isolation: Assumes separate AWS accounts or resource tagging for dev/staging/prod
  • Terraform State: Assumes remote state management (S3 backend) for team collaboration
  • Secrets Management: Assumes AWS Secrets Manager or Parameter Store for sensitive configuration

Operational Assumptions

Deployment Process

  • Terraform Plan: Always run before apply (prevents unexpected infrastructure changes)
  • Rollback Strategy: Failed deployments do not corrupt previous infrastructure state
  • Version Control: All infrastructure changes tracked in Git (Terraform files)
  • Artifact Versioning: Git SHA tags enable reproducible deployments and safe rollbacks

Monitoring & Alerting

  • CloudWatch Metrics: ETL job metrics (input_rows, valid_rows, quarantined_rows, duration)
  • Alerts: Job failure, quarantine rate spike (>5% threshold), volume anomalies
  • Logging: Structured logs to CloudWatch Logs for debugging and audit

Failure Recovery

  • ETL Job Failure: Rerun with new run_id (safe, no data loss)
  • Infrastructure Failure: AWS handles (multi-AZ, 99.99% SLA)
  • Deployment Failure: Terraform rollback or manual intervention required

Edge Cases & Failure Scenarios

CI/CD Pipeline Failures

  • Lint Failures: PR blocked until code quality issues resolved
  • Test Failures: PR blocked until tests pass
  • Terraform Plan Failures: Deployment blocked if infrastructure changes detected
  • Artifact Build Failures: Deployment blocked if packaging fails

Deployment Edge Cases

  • Concurrent Deployments: Multiple PRs deploying simultaneously may cause conflicts (mitigated by PR approval workflow)
  • Infrastructure Drift: Manual changes to infrastructure may cause Terraform plan failures
  • State Lock: Concurrent Terraform operations may cause state lock conflicts

Runtime Edge Cases

  • ETL Job Timeout: Long-running jobs may exceed Glue timeout limits (requires worker scaling)
  • S3 Throttling: High-volume writes may hit S3 request rate limits (requires retry logic)
  • Glue Catalog Drift: Manual table schema changes may break Athena queries

Rollback Scenarios

  • Failed ETL Run: New run_id written, previous run remains intact (no rollback needed)
  • Infrastructure Rollback: Terraform state enables rollback to previous infrastructure version
  • Artifact Rollback: Previous artifact versions available in S3 for rollback

Security Assumptions

Access Control

  • IAM Roles: Least-privilege access for Glue, Step Functions, EventBridge
  • S3 Bucket Policies: No public access, encryption at rest (AES256)
  • Secrets: Sensitive configuration stored in AWS Secrets Manager or Parameter Store

Compliance

  • Data Retention: Perpetual retention for financial/audit data; deletion only with legal/compliance approval
  • Audit Trail: CloudTrail logs all infrastructure changes and data access
  • Encryption: S3 encryption at rest, TLS in transit

See also

© 2026 Stephen AdeiCC BY 4.0