© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

AWS Services Analysis & Opportunities

Based on the Ohpen Data Engineer job posting requirements:

"Leverage AWS services such as S3, Lambda, Glue, Athena, DynamoDB and Aurora in implementation and optimization." 📋 Terraform Code Examples: Complete Terraform configurations for all services discussed below are available in appendix/APPENDIX_AWS_TERRAFORM_EXAMPLES.md.

AWS Services Usage Overview

Current AWS Service Usage

Currently Implemented

Service	Usage	Location
S3	Primary storage for Bronze/Silver/Gold layers	`tasks/devops_cicd/infra/terraform/main.tf` - 5 buckets (ohpen-bronze, ohpen-silver, ohpen-gold, ohpen-quarantine, ohpen-artifacts) with lifecycle policies (including Glacier transitions), versioning, encryption
Glue	ETL jobs (Python Shell + Spark), Data Catalog database & table	`tasks/devops_cicd/infra/terraform/main.tf` - 3 jobs: `aws_glue_job.transaction_etl`, `transaction_etl_spark`, `cleanup_condemned`; `aws_glue_catalog_database.ohpen_data_lake`, `aws_glue_catalog_table.silver_transactions`
Athena	Workgroup configured	`tasks/devops_cicd/infra/terraform/main.tf` - `aws_athena_workgroup.ohpen_analytics` with result bucket
Step Functions	ETL orchestration	`tasks/devops_cicd/infra/terraform/main.tf` - `aws_sfn_state_machine.etl_orchestration`
EventBridge	Scheduled ETL runs (daily 2 AM UTC)	`tasks/devops_cicd/infra/terraform/main.tf` - `aws_cloudwatch_event_rule.daily_etl_schedule`
CloudWatch	Monitoring, alarms, log groups	`tasks/devops_cicd/infra/terraform/main.tf` - `aws_cloudwatch_metric_alarm`, `aws_cloudwatch_log_group`
IAM	Roles and policies for all services	`tasks/devops_cicd/infra/terraform/main.tf` - Multiple roles: Glue service, Step Functions, EventBridge; policies for S3, CloudWatch, Data Catalog, Athena access
CloudTrail	Audit trail (management events and selective data events)	`tasks/devops_cicd/infra/terraform/main.tf` - `aws_cloudtrail.main`, S3 bucket for logs; selective data events on ohpen-gold and ohpen-quarantine buckets
SNS	ETL failure + quarantine alerts	`tasks/devops_cicd/infra/terraform/main.tf` - `aws_sns_topic.etl_failures`, `aws_sns_topic.quarantine_alerts`
SQS	✅ Decoupling + DLQ for failure events	`tasks/devops_cicd/infra/terraform/main.tf` - `aws_sqs_queue.etl_failures`, `aws_sqs_queue.etl_failures_dlq`; EventBridge rule Step Functions FAILED → SNS

See AUDIT_AND_NOTIFICATIONS_CONSIDERATIONS for who gets alerts (platform team, not analysts) and usage.

GenAI (Amazon Bedrock)

Service	Usage	Location
Bedrock	Quarantine explanations, report narrative, catalog descriptions, SQL docs	BEDROCK_IMPLEMENTATIONS — `tasks/01.../src/etl/bedrock_.py`, `tasks/05.../scripts/bedrock_.py`

GenAI: Bedrock is in use here; additional opportunities (catalog, runbooks, NL-to-SQL, ops narratives) are listed in GenAI in the Ohpen Case & Opportunities.

Architecture Decisions & Service Selection Rationale

This section documents AWS services that were evaluated but intentionally not included in the solution, with architectural rationale.

AWS Lambda — Partially Used

Implementation summary

Pandas ETL: When data is small (< 10M rows or < 500MB), ingest_transactions.py (Pandas) runs on Lambda. Same validation and quarantine logic as the Glue path.
Orchestration: promote_silver Lambda (Silver promotion after validation gate); read_run_summary Lambda (validation gate for Step Functions).
Summary: Pandas path → Lambda; PySpark path → Glue. Lambda is used both for ETL (Pandas) and for orchestration steps.

Not Used For:

❌ PySpark ETL — Large/batch ETL (≥ 10M rows or ≥ 500MB) runs on Glue; Lambda is single-instance and unsuitable for Spark workloads.
❌ S3 event triggers (scheduling is via EventBridge; Step Functions invokes Glue (PySpark) or Lambda (Pandas)).

Rationale:

Workload Match: Lambda fits Pandas ETL (small batches, < 15 min) and orchestration tasks. PySpark ETL runs on Glue for distributed compute.
Processing Model: PySpark on Glue handles distributed processing across multiple workers. Lambda runs Pandas (single-instance).
Cost: For small batches, Lambda is cost-effective; for millions of rows daily, Glue's DPU-based pricing is more predictable.

For detailed cost analysis and serverless vs always-on compute trade-offs, see Design Decisions.

When Lambda Is Used: Pandas ETL (small batches), validation gate, Silver promotion, and other orchestration tasks (< 15 min runtime).

DynamoDB — Not Implemented

Rationale:

State Management: ETL run metadata is already captured in S3 _SUCCESS files and CloudWatch logs. Adding DynamoDB would be redundant.
Query Patterns: No need for low-latency key-value lookups. Analytics queries use Athena on S3/Parquet.
Cost: S3 metadata + CloudWatch Logs cost ~$0.50/month. DynamoDB would add $1-5/month for minimal benefit.
Complexity: Additional service to manage, backup, and monitor.

When DynamoDB Would Fit: High-throughput transactional workloads, real-time dashboards, session state management, schema registry with frequent updates.

Aurora/RDS — Not Implemented

Rationale:

Workload Type: This is an analytical workload (append-only, time-partitioned, batch processing), not transactional (CRUD operations).
Storage Model: S3 + Parquet is optimized for analytical queries (columnar, compressed, partitioned). Aurora is row-based, designed for OLTP.
Cost: S3 storage costs ~$0.023/GB/month. Aurora RDS costs ~$0.10/GB/month + instance costs ($50-500/month minimum).
Query Engine: Athena is serverless and designed for S3 analytics. Aurora requires managing DB instances, backups, and scaling.

When Aurora Would Fit: Reference data (FX rates, country codes), operational metadata, transactional workloads, real-time dashboards requiring < 1 second response times.

Glue Crawlers — Not Implemented

Rationale:

Schema Governance: Financial data requires strict schema control. Manual Glue Catalog table definition in Terraform ensures no schema drift.
Validation: Schema is validated at ingestion (null checks, currency codes, data types). Crawlers would auto-discover schema changes, bypassing validation.
Cost: Crawlers cost $0.44/DPU-hour. Manual schema definition is one-time Terraform configuration.
Partition Discovery: Partitions follow fixed year/month structure. No need for dynamic discovery.

When Crawlers Would Fit: Semi-structured data (JSON logs), unknown schemas, multi-format data sources, dynamic partition structures.

Athena Engine v3 — Implicit Default

Athena workgroup: Default engine version (likely v2 or v3 depending on AWS region/creation date).

Recommendation: Explicitly set engine version to v3 in Terraform:

resource "aws_athena_workgroup" "ohpen_analytics" {
  engine_version {
    selected_engine_version = "Athena engine version 3"
  }
  # ... rest of config
}

Benefits of v3:

~3x faster on complex queries (improved query optimizer, aggregate pushdown)
Better cost efficiency (less data scanned)
Enhanced Parquet column pruning

Impact for This Solution: Moderate. Queries are relatively simple (balance history aggregations). V3 would improve performance but is not critical.

AWS Services Usage Overview​

Current AWS Service Usage​

Currently Implemented​

GenAI (Amazon Bedrock)​

Architecture Decisions & Service Selection Rationale​

AWS Lambda — Partially Used​

DynamoDB — Not Implemented​

Aurora/RDS — Not Implemented​

Glue Crawlers — Not Implemented​

Athena Engine v3 — Implicit Default​

AWS Services Usage Overview

Current AWS Service Usage

Currently Implemented

GenAI (Amazon Bedrock)

Architecture Decisions & Service Selection Rationale

AWS Lambda — Partially Used

DynamoDB — Not Implemented

Aurora/RDS — Not Implemented

Glue Crawlers — Not Implemented

Athena Engine v3 — Implicit Default