© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
AWS Services Analysis & Opportunities
Based on the Ohpen Data Engineer job posting requirements:
"Leverage AWS services such as S3, Lambda, Glue, Athena, DynamoDB and Aurora in implementation and optimization." 📋 Terraform Code Examples: Complete Terraform configurations for all services discussed below are available in
appendix/APPENDIX_AWS_TERRAFORM_EXAMPLES.md.
AWS Services Usage Overview
Current AWS Service Usage
Currently Implemented
| Service | Usage | Location |
|---|---|---|
| S3 | Primary storage for Bronze/Silver/Gold layers | tasks/devops_cicd/infra/terraform/main.tf - 5 buckets (ohpen-bronze, ohpen-silver, ohpen-gold, ohpen-quarantine, ohpen-artifacts) with lifecycle policies (including Glacier transitions), versioning, encryption |
| Glue | ETL jobs (Python Shell + Spark), Data Catalog database & table | tasks/devops_cicd/infra/terraform/main.tf - 3 jobs: aws_glue_job.transaction_etl, transaction_etl_spark, cleanup_condemned; aws_glue_catalog_database.ohpen_data_lake, aws_glue_catalog_table.silver_transactions |
| Athena | Workgroup configured | tasks/devops_cicd/infra/terraform/main.tf - aws_athena_workgroup.ohpen_analytics with result bucket |
| Step Functions | ETL orchestration | tasks/devops_cicd/infra/terraform/main.tf - aws_sfn_state_machine.etl_orchestration |
| EventBridge | Scheduled ETL runs (daily 2 AM UTC) | tasks/devops_cicd/infra/terraform/main.tf - aws_cloudwatch_event_rule.daily_etl_schedule |
| CloudWatch | Monitoring, alarms, log groups | tasks/devops_cicd/infra/terraform/main.tf - aws_cloudwatch_metric_alarm, aws_cloudwatch_log_group |
| IAM | Roles and policies for all services | tasks/devops_cicd/infra/terraform/main.tf - Multiple roles: Glue service, Step Functions, EventBridge; policies for S3, CloudWatch, Data Catalog, Athena access |
| CloudTrail | Audit trail (management events and selective data events) | tasks/devops_cicd/infra/terraform/main.tf - aws_cloudtrail.main, S3 bucket for logs; selective data events on ohpen-gold and ohpen-quarantine buckets |
| SNS | ETL failure + quarantine alerts | tasks/devops_cicd/infra/terraform/main.tf - aws_sns_topic.etl_failures, aws_sns_topic.quarantine_alerts |
| SQS | ✅ Decoupling + DLQ for failure events | tasks/devops_cicd/infra/terraform/main.tf - aws_sqs_queue.etl_failures, aws_sqs_queue.etl_failures_dlq; EventBridge rule Step Functions FAILED → SNS |
See AUDIT_AND_NOTIFICATIONS_CONSIDERATIONS for who gets alerts (platform team, not analysts) and usage.
GenAI (Amazon Bedrock)
| Service | Usage | Location |
|---|---|---|
| Bedrock | Quarantine explanations, report narrative, catalog descriptions, SQL docs | BEDROCK_IMPLEMENTATIONS — tasks/01.../src/etl/bedrock_*.py, tasks/05.../scripts/bedrock_*.py |
GenAI: Bedrock is in use here; additional opportunities (catalog, runbooks, NL-to-SQL, ops narratives) are listed in GenAI in the Ohpen Case & Opportunities.
Architecture Decisions & Service Selection Rationale
This section documents AWS services that were evaluated but intentionally not included in the solution, with architectural rationale.
AWS Lambda — Partially Used
Implementation summary
- Pandas ETL: When data is small (< 10M rows or < 500MB),
ingest_transactions.py(Pandas) runs on Lambda. Same validation and quarantine logic as the Glue path. - Orchestration:
promote_silverLambda (Silver promotion after validation gate);read_run_summaryLambda (validation gate for Step Functions). - Summary: Pandas path → Lambda; PySpark path → Glue. Lambda is used both for ETL (Pandas) and for orchestration steps.
Not Used For:
- ❌ PySpark ETL — Large/batch ETL (≥ 10M rows or ≥ 500MB) runs on Glue; Lambda is single-instance and unsuitable for Spark workloads.
- ❌ S3 event triggers (scheduling is via EventBridge; Step Functions invokes Glue (PySpark) or Lambda (Pandas)).
Rationale:
- Workload Match: Lambda fits Pandas ETL (small batches, < 15 min) and orchestration tasks. PySpark ETL runs on Glue for distributed compute.
- Processing Model: PySpark on Glue handles distributed processing across multiple workers. Lambda runs Pandas (single-instance).
- Cost: For small batches, Lambda is cost-effective; for millions of rows daily, Glue's DPU-based pricing is more predictable.
For detailed cost analysis and serverless vs always-on compute trade-offs, see Design Decisions.
When Lambda Is Used: Pandas ETL (small batches), validation gate, Silver promotion, and other orchestration tasks (< 15 min runtime).
DynamoDB — Not Implemented
Rationale:
- State Management: ETL run metadata is already captured in S3
_SUCCESSfiles and CloudWatch logs. Adding DynamoDB would be redundant. - Query Patterns: No need for low-latency key-value lookups. Analytics queries use Athena on S3/Parquet.
- Cost: S3 metadata + CloudWatch Logs cost ~$0.50/month. DynamoDB would add $1-5/month for minimal benefit.
- Complexity: Additional service to manage, backup, and monitor.
When DynamoDB Would Fit: High-throughput transactional workloads, real-time dashboards, session state management, schema registry with frequent updates.
Aurora/RDS — Not Implemented
Rationale:
- Workload Type: This is an analytical workload (append-only, time-partitioned, batch processing), not transactional (CRUD operations).
- Storage Model: S3 + Parquet is optimized for analytical queries (columnar, compressed, partitioned). Aurora is row-based, designed for OLTP.
- Cost: S3 storage costs ~$0.023/GB/month. Aurora RDS costs ~$0.10/GB/month + instance costs ($50-500/month minimum).
- Query Engine: Athena is serverless and designed for S3 analytics. Aurora requires managing DB instances, backups, and scaling.
When Aurora Would Fit: Reference data (FX rates, country codes), operational metadata, transactional workloads, real-time dashboards requiring < 1 second response times.
Glue Crawlers — Not Implemented
Rationale:
- Schema Governance: Financial data requires strict schema control. Manual Glue Catalog table definition in Terraform ensures no schema drift.
- Validation: Schema is validated at ingestion (null checks, currency codes, data types). Crawlers would auto-discover schema changes, bypassing validation.
- Cost: Crawlers cost $0.44/DPU-hour. Manual schema definition is one-time Terraform configuration.
- Partition Discovery: Partitions follow fixed
year/monthstructure. No need for dynamic discovery.
When Crawlers Would Fit: Semi-structured data (JSON logs), unknown schemas, multi-format data sources, dynamic partition structures.
Athena Engine v3 — Implicit Default
Athena workgroup: Default engine version (likely v2 or v3 depending on AWS region/creation date).
Recommendation: Explicitly set engine version to v3 in Terraform:
resource "aws_athena_workgroup" "ohpen_analytics" {
engine_version {
selected_engine_version = "Athena engine version 3"
}
# ... rest of config
}
Benefits of v3:
- ~3x faster on complex queries (improved query optimizer, aggregate pushdown)
- Better cost efficiency (less data scanned)
- Enhanced Parquet column pruning
Impact for This Solution: Moderate. Queries are relatively simple (balance history aggregations). V3 would improve performance but is not critical.