Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

AWS Services Analysis & Opportunities

Based on the Ohpen Data Engineer job posting requirements:

"Leverage AWS services such as S3, Lambda, Glue, Athena, DynamoDB and Aurora in implementation and optimization." 📋 Terraform Code Examples: Complete Terraform configurations for all services discussed below are available in appendix/APPENDIX_AWS_TERRAFORM_EXAMPLES.md.

AWS Services Usage Overview

Current AWS Service Usage

Currently Implemented

ServiceUsageLocation
S3Primary storage for Bronze/Silver/Gold layerstasks/devops_cicd/infra/terraform/main.tf - 5 buckets (ohpen-bronze, ohpen-silver, ohpen-gold, ohpen-quarantine, ohpen-artifacts) with lifecycle policies (including Glacier transitions), versioning, encryption
GlueETL jobs (Python Shell + Spark), Data Catalog database & tabletasks/devops_cicd/infra/terraform/main.tf - 3 jobs: aws_glue_job.transaction_etl, transaction_etl_spark, cleanup_condemned; aws_glue_catalog_database.ohpen_data_lake, aws_glue_catalog_table.silver_transactions
AthenaWorkgroup configuredtasks/devops_cicd/infra/terraform/main.tf - aws_athena_workgroup.ohpen_analytics with result bucket
Step FunctionsETL orchestrationtasks/devops_cicd/infra/terraform/main.tf - aws_sfn_state_machine.etl_orchestration
EventBridgeScheduled ETL runs (daily 2 AM UTC)tasks/devops_cicd/infra/terraform/main.tf - aws_cloudwatch_event_rule.daily_etl_schedule
CloudWatchMonitoring, alarms, log groupstasks/devops_cicd/infra/terraform/main.tf - aws_cloudwatch_metric_alarm, aws_cloudwatch_log_group
IAMRoles and policies for all servicestasks/devops_cicd/infra/terraform/main.tf - Multiple roles: Glue service, Step Functions, EventBridge; policies for S3, CloudWatch, Data Catalog, Athena access
CloudTrailAudit trail (management events and selective data events)tasks/devops_cicd/infra/terraform/main.tf - aws_cloudtrail.main, S3 bucket for logs; selective data events on ohpen-gold and ohpen-quarantine buckets
SNSETL failure + quarantine alertstasks/devops_cicd/infra/terraform/main.tf - aws_sns_topic.etl_failures, aws_sns_topic.quarantine_alerts
SQS✅ Decoupling + DLQ for failure eventstasks/devops_cicd/infra/terraform/main.tf - aws_sqs_queue.etl_failures, aws_sqs_queue.etl_failures_dlq; EventBridge rule Step Functions FAILED → SNS

See AUDIT_AND_NOTIFICATIONS_CONSIDERATIONS for who gets alerts (platform team, not analysts) and usage.

GenAI (Amazon Bedrock)

ServiceUsageLocation
BedrockQuarantine explanations, report narrative, catalog descriptions, SQL docsBEDROCK_IMPLEMENTATIONStasks/01.../src/etl/bedrock_*.py, tasks/05.../scripts/bedrock_*.py

GenAI: Bedrock is in use here; additional opportunities (catalog, runbooks, NL-to-SQL, ops narratives) are listed in GenAI in the Ohpen Case & Opportunities.


Architecture Decisions & Service Selection Rationale

This section documents AWS services that were evaluated but intentionally not included in the solution, with architectural rationale.

AWS Lambda — Partially Used

Implementation summary

  • Pandas ETL: When data is small (< 10M rows or < 500MB), ingest_transactions.py (Pandas) runs on Lambda. Same validation and quarantine logic as the Glue path.
  • Orchestration: promote_silver Lambda (Silver promotion after validation gate); read_run_summary Lambda (validation gate for Step Functions).
  • Summary: Pandas path → Lambda; PySpark path → Glue. Lambda is used both for ETL (Pandas) and for orchestration steps.

Not Used For:

  • PySpark ETL — Large/batch ETL (≥ 10M rows or ≥ 500MB) runs on Glue; Lambda is single-instance and unsuitable for Spark workloads.
  • ❌ S3 event triggers (scheduling is via EventBridge; Step Functions invokes Glue (PySpark) or Lambda (Pandas)).

Rationale:

  • Workload Match: Lambda fits Pandas ETL (small batches, < 15 min) and orchestration tasks. PySpark ETL runs on Glue for distributed compute.
  • Processing Model: PySpark on Glue handles distributed processing across multiple workers. Lambda runs Pandas (single-instance).
  • Cost: For small batches, Lambda is cost-effective; for millions of rows daily, Glue's DPU-based pricing is more predictable.

For detailed cost analysis and serverless vs always-on compute trade-offs, see Design Decisions.

When Lambda Is Used: Pandas ETL (small batches), validation gate, Silver promotion, and other orchestration tasks (< 15 min runtime).


DynamoDB — Not Implemented

Rationale:

  • State Management: ETL run metadata is already captured in S3 _SUCCESS files and CloudWatch logs. Adding DynamoDB would be redundant.
  • Query Patterns: No need for low-latency key-value lookups. Analytics queries use Athena on S3/Parquet.
  • Cost: S3 metadata + CloudWatch Logs cost ~$0.50/month. DynamoDB would add $1-5/month for minimal benefit.
  • Complexity: Additional service to manage, backup, and monitor.

When DynamoDB Would Fit: High-throughput transactional workloads, real-time dashboards, session state management, schema registry with frequent updates.


Aurora/RDS — Not Implemented

Rationale:

  • Workload Type: This is an analytical workload (append-only, time-partitioned, batch processing), not transactional (CRUD operations).
  • Storage Model: S3 + Parquet is optimized for analytical queries (columnar, compressed, partitioned). Aurora is row-based, designed for OLTP.
  • Cost: S3 storage costs ~$0.023/GB/month. Aurora RDS costs ~$0.10/GB/month + instance costs ($50-500/month minimum).
  • Query Engine: Athena is serverless and designed for S3 analytics. Aurora requires managing DB instances, backups, and scaling.

When Aurora Would Fit: Reference data (FX rates, country codes), operational metadata, transactional workloads, real-time dashboards requiring < 1 second response times.


Glue Crawlers — Not Implemented

Rationale:

  • Schema Governance: Financial data requires strict schema control. Manual Glue Catalog table definition in Terraform ensures no schema drift.
  • Validation: Schema is validated at ingestion (null checks, currency codes, data types). Crawlers would auto-discover schema changes, bypassing validation.
  • Cost: Crawlers cost $0.44/DPU-hour. Manual schema definition is one-time Terraform configuration.
  • Partition Discovery: Partitions follow fixed year/month structure. No need for dynamic discovery.

When Crawlers Would Fit: Semi-structured data (JSON logs), unknown schemas, multi-format data sources, dynamic partition structures.


Athena Engine v3 — Implicit Default

Athena workgroup: Default engine version (likely v2 or v3 depending on AWS region/creation date).

Recommendation: Explicitly set engine version to v3 in Terraform:

resource "aws_athena_workgroup" "ohpen_analytics" {
engine_version {
selected_engine_version = "Athena engine version 3"
}
# ... rest of config
}

Benefits of v3:

  • ~3x faster on complex queries (improved query optimizer, aggregate pushdown)
  • Better cost efficiency (less data scanned)
  • Enhanced Parquet column pruning

Impact for This Solution: Moderate. Queries are relatively simple (balance history aggregations). V3 would improve performance but is not critical.

© 2026 Stephen AdeiCC BY 4.0