Tooling & Controls
© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
This document provides a comprehensive inventory of AWS services and tools used in the OLAP analytics data lake platform, explicitly separating Implemented vs Recommended enhancements, with architectural rationale for each decision. Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions).
1. Implemented Services (Production-Ready)
Data Lake & Query
S3
Glue
| Service | Usage | Location | Rationale |
|---|---|---|---|
| S3 | Primary storage for Bronze/Silver/Gold/Quarantine/Artifacts layers | tasks/devops_cicd/infra/terraform/main.tf | Object storage optimized for analytical workloads (OLAP), scales to exabytes, cost-effective lifecycle policies |
| Glue | ETL jobs (Python Shell + Spark), Data Catalog database & table | tasks/devops_cicd/infra/terraform/main.tf | Managed Spark runtime for distributed batch processing, purpose-built for ETL workloads |
| Athena | Workgroup configured for SQL querying | tasks/devops_cicd/infra/terraform/main.tf | Serverless SQL engine for analytical queries, pay-per-query, seamless S3/Glue Catalog integration |
Orchestration & Triggers
| Service | Usage | Location | Rationale |
|---|---|---|---|
| Step Functions | ETL orchestration state machine | tasks/devops_cicd/infra/terraform/main.tf | Multi-service orchestration (S3, Glue, SNS), built-in retry/error handling, visual workflow. Provides run identity propagation for observability. |
| EventBridge | Scheduled ETL (2 AM UTC) + S3 event triggers | tasks/devops_cicd/infra/terraform/main.tf | Event-driven triggers, decoupled architecture, cost-effective scheduling |
Event Flow Patterns:
Notifications & Failure Handling
| Service | Usage | Location | Rationale |
|---|---|---|---|
| SNS | ETL failure + quarantine alerts | tasks/devops_cicd/infra/terraform/main.tf | Pub/sub messaging for alerts, integrates with SQS for decoupling |
| SQS + DLQ | Decoupling + poison message handling | tasks/devops_cicd/infra/terraform/main.tf | Asynchronous processing, dead-letter queue for failed messages, prevents message loss |
Observability & Audit
KMS
| Service | Usage | Location | Rationale |
|---|---|---|---|
| CloudWatch | Logs, metrics, alarms, custom metrics | tasks/devops_cicd/infra/terraform/main.tf | Centralized monitoring, custom ETL metrics (quarantine rate, job duration), alerting |
| CloudTrail | Audit trail (management events + selective data events) | tasks/devops_cicd/infra/terraform/main.tf | Compliance/audit logging, tracks infrastructure changes and sensitive data access |
| KMS | Customer-managed keys (CMK) for sensitive bucket encryption | tasks/devops_cicd/infra/terraform/main.tf | SSE-KMS encryption for gold and quarantine buckets, automatic key rotation, fine-grained access control |
CloudTrail Configuration:
- Management Events: Enabled for all infrastructure changes (IAM, S3 policies, etc.)
- Data Events: Selectively enabled for sensitive buckets only (
ohpen-gold,ohpen-quarantine)- Rationale: Cost-aware approach; data events can generate millions of events for high-volume buckets
- Fintech Positioning: "CloudTrail org/account trail for management events is enabled; for high-risk buckets/prefixes S3 data events are selectively enabled and retain them per policy."
KMS Configuration:
- Customer-Managed Keys: CMK with automatic rotation enabled for sensitive buckets (
ohpen-gold,ohpen-quarantine) - Key Policy: Configured to allow Glue and Athena services to decrypt/encrypt data
- Encryption: SSE-S3 (AES256) for standard buckets (bronze, silver, artifacts); SSE-KMS for sensitive buckets (gold, quarantine)
KMS Encryption Flow:
DevOps & Quality Gates
| Service | Usage | Location | Rationale |
|---|---|---|---|
| Terraform | Infrastructure as Code (IaC) | tasks/devops_cicd/infra/terraform/main.tf | Reproducible infrastructure, version control, state management |
| GitHub Actions | CI/CD pipeline (OIDC authentication) | .github/workflows/ (repo root) | Keyless authentication via OIDC, developer-friendly, Git-integrated |
| Docker + docker-compose | Test infrastructure (CI-focused) | tasks/data_ingestion_transformation/Dockerfile.test | Containerized test environment, local development, CI validation |
| ruff + pytest | Quality gates (linting, unit tests) | .github/workflows/ci.yml | Code quality enforcement, automated testing, prevents regressions |
Data Processing Libraries
| Tool | Usage | Location | Rationale |
|---|---|---|---|
| PySpark | Distributed ETL processing | tasks/data_ingestion_transformation/src/etl/*_spark.py | Handles large files (100GB+), distributed processing, optimized for batch workloads |
| pandas | Single-file/small batch processing | tasks/data_ingestion_transformation/src/etl/validation.py | Development/testing, small datasets, simpler debugging |
| pyarrow, boto3, s3fs/fsspec | S3 operations, Parquet I/O | tasks/data_ingestion_transformation/src/etl/ | Efficient S3 access, Parquet format support, optimized I/O |
Validation & Data Quality
Validation Tool: Python in ETL (pandas or PySpark)
Validation Rules Location:
- Shared Configuration:
tasks/data_ingestion_transformation/src/etl/config.pyALLOWED_CURRENCIES,REQUIRED_COLUMNS,ERROR_*constants
- Pandas Implementation:
tasks/data_ingestion_transformation/src/etl/validation.py - PySpark Implementation:
tasks/data_ingestion_transformation/src/etl/validation_spark.py
Validation Modes:
- Schema Validation: Enforce columns + types (
validate_schema()) - Domain Validation: Null rules, ranges, referential checks (
apply_validation_rules()) - Quarantine: Write failed rows/files to
s3://ohpen-quarantine/...+ publish metrics + notifications
Both Spark + pandas paths share rule definitions via config.py, ensuring consistency.
2. Recommended Enhancements (Not Implemented)
High Priority (Fintech Compliance)
| Service | Recommendation | Rationale | Implementation Effort |
|---|---|---|---|
| Secrets Manager / SSM Parameter Store | Store runtime config if external APIs/credentials needed | Centralized secret management, rotation, audit trail | Low (if needed) |
Encryption: SSE-S3 (AES-256) for bronze, silver, and artifacts buckets; SSE-KMS with customer-managed keys (CMK) for ohpen-gold and ohpen-quarantine.
Medium Priority (Operational Improvements)
| Service | Recommendation | Rationale | Implementation Effort |
|---|---|---|---|
| Lambda | Pre-processing (file validation), post-processing (promotion) | Event-driven automation, cost-effective for lightweight tasks | Medium |
| DynamoDB | ETL run metadata, data quality metrics, schema registry | Queryable metadata store, sub-millisecond lookups for operational dashboards | Medium |
| Glue Data Quality | Standardize checks + scorecards | Automated quality monitoring, integration with Glue Catalog | Medium |
Low Priority (Future Enhancements)
| Service | Recommendation | Rationale | Implementation Effort |
|---|---|---|---|
| Aurora/RDS | Reference data (FX rates, customer master) | Only if OLTP requirements emerge (real-time lookups, ACID transactions) | High |
| Glue Crawlers | Auto-detect partitions and schema changes | Only if schema evolves frequently or unknown schemas | Low |
| Iceberg | Schema evolution, time travel queries | Only if advanced schema evolution features needed | High |
3. Architectural Rationale: Why Certain Tools Are Not Used
AWS Lambda
Used For:
- Pandas ETL — Small batches (< 10M rows or < 500MB):
ingest_transactions.pyruns on Lambda. Same validation and quarantine logic as the Glue path. - Orchestration —
promote_silver,read_run_summary(validation gate). Step Functions invokes these Lambdas.
Summary: Pandas path → Lambda; PySpark path → Glue. Lambda is used for both Pandas ETL and orchestration steps.
Why Not for PySpark ETL:
- Workload: PySpark ETL (large files, 100GB+) requires distributed compute; Lambda is single-instance, 15-minute timeout.
- Processing Model: PySpark on Glue handles multi-worker processing; Lambda runs Pandas only.
- Cost: For small batches Lambda is cost-effective; for millions of rows daily, Glue's DPU-based pricing is more predictable.
When Lambda Is Used: Pandas ETL (small batches), validation gate, Silver promotion, orchestration tasks (< 15 min runtime).
DynamoDB
Why Not:
- Access Pattern Mismatch: DynamoDB targets point lookups (get item by key). This workload uses batch analytical queries (scan large datasets, aggregations, time-range filters).
- Data Model: Append-only, immutable, time-partitioned data. DynamoDB is optimized for mutable, key-value operations.
- Query Pattern: Athena queries scan partitions, aggregate, and join. DynamoDB does not support SQL aggregations or analytical queries.
When It Fits: Operational metadata (run status, job tracking), real-time dashboards, point lookups. Not for analytical data storage.
Aurora/RDS
Why Not:
- OLAP vs OLTP: Aurora is row-based, transactional (OLTP). This is analytical (OLAP) — columnar storage, aggregations, time-series analysis.
- Storage Model: Append-only, immutable, partitioned by time. Relational databases are optimized for mutable, normalized data with ACID transactions.
- Scale: S3 scales to exabytes; Aurora has practical limits (64TB per instance, scaling complexity).
- Cost: Aurora charges per instance-hour + storage + I/O. S3 + Athena is pay-per-query, no infrastructure to manage.
When It Fits: Reference data (FX rates, customer master), operational metadata, transactional workloads. Not for analytical data lake storage.
Secrets Manager / SSM Parameter Store
Why Not:
- No Secrets Exist: All authentication is IAM role-based (Glue service role, Step Functions execution role). No database passwords, API keys, or credentials to store.
- OIDC Authentication: GitHub Actions uses OIDC federation (temporary credentials), not static access keys.
- Configuration vs Secrets: Bucket names, prefixes, job names are configuration (Terraform variables), not secrets.
When It Fits: External API keys, database credentials, third-party service tokens. Not needed when everything is IAM role-based.
KMS (Customer-Managed Keys)
Implementation summary
- SSE-KMS: Customer-managed keys (CMK) with automatic rotation enabled for sensitive buckets (
ohpen-gold,ohpen-quarantine). - Key policy: Configured to allow Glue and Athena services to decrypt and encrypt data.
- Cost: SSE-S3 is used for standard buckets (bronze, silver, artifacts) to manage cost.
Why Not For All Buckets:
- Cost/Complexity: KMS adds $1/month per key + $0.03 per 10K requests. For high-volume ETL (millions of S3 operations), this adds significant cost.
- Operational Overhead: Key rotation, key policies, CloudTrail logging for key usage. SSE-S3 requires no key management.
When It Fits: Compliance requiring customer-managed keys (CMK), multi-account key sharing, fine-grained access control via key policies. Implemented for sensitive buckets (gold, quarantine).
Compliance escalation: If compliance requirements for the lower layers (bronze, silver, artifacts) prove higher than assumed, SSE-KMS may be adopted for all buckets to provide a single, uniform encryption model.
Glue Data Quality
Why Not:
- Custom Validation Logic: Business rules (currency allowlist, timestamp parsing, loop prevention) are implemented in Python. Glue Data Quality is rule-based and may not cover all custom logic.
- Quarantine Pattern: Failed rows go to S3 quarantine bucket with error metadata. Glue Data Quality focuses on metrics/scorecards, not custom quarantine workflows.
- Circuit Breaker: Custom logic halts pipeline if >100 same errors/hour. Glue Data Quality does not provide this pattern.
When It Fits: Standardized checks (null rates, value ranges), automated quality monitoring, integration with Glue Catalog scorecards. Not a replacement for custom validation logic.
Glue Crawlers
Why Not:
- Schema Is Known and Stable: Transaction schema (TransactionID, CustomerID, Amount, Currency, Timestamp) is fixed. Crawlers are for discovering unknown or evolving schemas.
- Explicit Control: Manual table definition in Terraform provides version control, explicit schema documentation, prevents schema drift.
- Performance: Manual definition is faster (no crawling overhead) and more predictable.
When It Fits: Unknown schemas, frequent schema evolution, auto-partition discovery. Not needed when schema is stable and known.
Glue Workflows
Why Not:
- Multi-Service Orchestration: Pipeline involves S3, EventBridge, Step Functions, Glue, SNS, SQS. Step Functions orchestrates across services; Glue Workflows are Glue-native.
- Better Error Handling: Step Functions provides retry logic, parallel execution, integration with non-Glue services.
When It Fits: Pure Glue pipelines (Crawlers → ETL → Catalog), simpler workflows, cost optimization. Step Functions is better for multi-service orchestration.
ECR/ECS (Container Deployment)
Why Not:
- Native Execution: Glue provides managed Spark runtime. Containers add complexity (building, pushing, managing) without benefit.
- Workload Fit: ETL runs in Glue's optimized Spark environment. Containers are for custom runtimes or microservices, not managed ETL.
When It Fits: Custom Spark/Python versions, containerized microservices, multi-cloud portability. Not needed when Glue's native runtime suffices.
4. Summary: Architectural Fit Matrix
| Tool | Fundamental Mismatch | When It Fits |
|---|---|---|
| Lambda | Batch ETL vs event-driven microservices | Pre/post-processing, orchestration glue |
| DynamoDB | Analytical queries vs point lookups | Operational metadata, real-time dashboards |
| Aurora | OLAP vs OLTP, append-only vs mutable | Reference data, transactional workloads |
| Secrets Manager | No secrets exist (IAM roles only) | External API keys, database credentials |
| KMS | SSE-S3 sufficient, cost/complexity | Compliance requiring CMK, multi-account |
| Glue Data Quality | Custom validation logic required | Standardized checks, quality scorecards |
| Glue Crawlers | Schema is known and stable | Unknown schemas, frequent evolution |
| Glue Workflows | Multi-service orchestration needed | Pure Glue pipelines only |
| ECR/ECS | Native Glue runtime sufficient | Custom runtimes, microservices |
5. Fintech Compliance Positioning
Controls in place
CloudTrail management events are enabled for all infrastructure changes; data events are selectively enabled for sensitive buckets (gold, quarantine). Encryption: SSE-S3 (AES-256) for bronze, silver, and artifacts; SSE-KMS with customer-managed keys for gold and quarantine. IAM role-based authentication is used throughout; OIDC federation is used for CI/CD. S3 versioning is enabled for audit trail; lifecycle policies are configured for cost optimization (Glacier transitions).
Mature Fintech Positioning:
"CloudTrail org/account trail for management events is enabled; for high-risk buckets/prefixes S3 data events are selectively enabled and retain them per policy. Encryption uses SSE-S3 for standard data and SSE-KMS with customer-managed keys for sensitive financial data (gold layer, quarantine). All authentication is IAM role-based with OIDC federation for CI/CD, eliminating static credentials."
See also
- Data Lake Architecture - Complete data lake architecture
- ETL Flow - Validation and data quality logic using AWS services
- CI/CD Workflow - OIDC authentication and deployment pipeline
- Traceability Design - AWS-native identifiers and observability
- Audit & Monitoring - CloudWatch, CloudTrail integration