Tooling & Controls

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This document provides a comprehensive inventory of AWS services and tools used in the OLAP analytics data lake platform, explicitly separating Implemented vs Recommended enhancements, with architectural rationale for each decision. Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions).

1. Implemented Services (Production-Ready)

Data Lake & Query

S3

Glue

Service	Usage	Location	Rationale
S3	Primary storage for Bronze/Silver/Gold/Quarantine/Artifacts layers	`tasks/devops_cicd/infra/terraform/main.tf`	Object storage optimized for analytical workloads (OLAP), scales to exabytes, cost-effective lifecycle policies
Glue	ETL jobs (Python Shell + Spark), Data Catalog database & table	`tasks/devops_cicd/infra/terraform/main.tf`	Managed Spark runtime for distributed batch processing, purpose-built for ETL workloads
Athena	Workgroup configured for SQL querying	`tasks/devops_cicd/infra/terraform/main.tf`	Serverless SQL engine for analytical queries, pay-per-query, seamless S3/Glue Catalog integration

Orchestration & Triggers

Service	Usage	Location	Rationale
Step Functions	ETL orchestration state machine	`tasks/devops_cicd/infra/terraform/main.tf`	Multi-service orchestration (S3, Glue, SNS), built-in retry/error handling, visual workflow. Provides run identity propagation for observability.
EventBridge	Scheduled ETL (2 AM UTC) + S3 event triggers	`tasks/devops_cicd/infra/terraform/main.tf`	Event-driven triggers, decoupled architecture, cost-effective scheduling

Event Flow Patterns:

Service	Usage	Location	Rationale
SNS	ETL failure + quarantine alerts	`tasks/devops_cicd/infra/terraform/main.tf`	Pub/sub messaging for alerts, integrates with SQS for decoupling
SQS + DLQ	Decoupling + poison message handling	`tasks/devops_cicd/infra/terraform/main.tf`	Asynchronous processing, dead-letter queue for failed messages, prevents message loss

Observability & Audit

KMS

Service	Usage	Location	Rationale
CloudWatch	Logs, metrics, alarms, custom metrics	`tasks/devops_cicd/infra/terraform/main.tf`	Centralized monitoring, custom ETL metrics (quarantine rate, job duration), alerting
CloudTrail	Audit trail (management events + selective data events)	`tasks/devops_cicd/infra/terraform/main.tf`	Compliance/audit logging, tracks infrastructure changes and sensitive data access
KMS	Customer-managed keys (CMK) for sensitive bucket encryption	`tasks/devops_cicd/infra/terraform/main.tf`	SSE-KMS encryption for gold and quarantine buckets, automatic key rotation, fine-grained access control

CloudTrail Configuration:

Management Events: Enabled for all infrastructure changes (IAM, S3 policies, etc.)
Data Events: Selectively enabled for sensitive buckets only (ohpen-gold, ohpen-quarantine)
- Rationale: Cost-aware approach; data events can generate millions of events for high-volume buckets
- Fintech Positioning: "CloudTrail org/account trail for management events is enabled; for high-risk buckets/prefixes S3 data events are selectively enabled and retain them per policy."

KMS Configuration:

Customer-Managed Keys: CMK with automatic rotation enabled for sensitive buckets (ohpen-gold, ohpen-quarantine)
Key Policy: Configured to allow Glue and Athena services to decrypt/encrypt data
Encryption: SSE-S3 (AES256) for standard buckets (bronze, silver, artifacts); SSE-KMS for sensitive buckets (gold, quarantine)

KMS Encryption Flow:

DevOps & Quality Gates

Service	Usage	Location	Rationale
Terraform	Infrastructure as Code (IaC)	`tasks/devops_cicd/infra/terraform/main.tf`	Reproducible infrastructure, version control, state management
GitHub Actions	CI/CD pipeline (OIDC authentication)	`.github/workflows/` (repo root)	Keyless authentication via OIDC, developer-friendly, Git-integrated
Docker + docker-compose	Test infrastructure (CI-focused)	`tasks/data_ingestion_transformation/Dockerfile.test`	Containerized test environment, local development, CI validation
ruff + pytest	Quality gates (linting, unit tests)	`.github/workflows/ci.yml`	Code quality enforcement, automated testing, prevents regressions

Data Processing Libraries

Tool	Usage	Location	Rationale
PySpark	Distributed ETL processing	`tasks/data_ingestion_transformation/src/etl/*_spark.py`	Handles large files (100GB+), distributed processing, optimized for batch workloads
pandas	Single-file/small batch processing	`tasks/data_ingestion_transformation/src/etl/validation.py`	Development/testing, small datasets, simpler debugging
pyarrow, boto3, s3fs/fsspec	S3 operations, Parquet I/O	`tasks/data_ingestion_transformation/src/etl/`	Efficient S3 access, Parquet format support, optimized I/O

Validation & Data Quality

Validation Tool: Python in ETL (pandas or PySpark)

Validation Rules Location:

Shared Configuration: tasks/data_ingestion_transformation/src/etl/config.py
- ALLOWED_CURRENCIES, REQUIRED_COLUMNS, ERROR_* constants
Pandas Implementation: tasks/data_ingestion_transformation/src/etl/validation.py
PySpark Implementation: tasks/data_ingestion_transformation/src/etl/validation_spark.py

Validation Modes:

Schema Validation: Enforce columns + types (validate_schema())
Domain Validation: Null rules, ranges, referential checks (apply_validation_rules())
Quarantine: Write failed rows/files to s3://ohpen-quarantine/... + publish metrics + notifications

Both Spark + pandas paths share rule definitions via config.py, ensuring consistency.

2. Recommended Enhancements (Not Implemented)

High Priority (Fintech Compliance)

Service	Recommendation	Rationale	Implementation Effort
Secrets Manager / SSM Parameter Store	Store runtime config if external APIs/credentials needed	Centralized secret management, rotation, audit trail	Low (if needed)

Encryption: SSE-S3 (AES-256) for bronze, silver, and artifacts buckets; SSE-KMS with customer-managed keys (CMK) for ohpen-gold and ohpen-quarantine.

Medium Priority (Operational Improvements)

Service	Recommendation	Rationale	Implementation Effort
Lambda	Pre-processing (file validation), post-processing (promotion)	Event-driven automation, cost-effective for lightweight tasks	Medium
DynamoDB	ETL run metadata, data quality metrics, schema registry	Queryable metadata store, sub-millisecond lookups for operational dashboards	Medium
Glue Data Quality	Standardize checks + scorecards	Automated quality monitoring, integration with Glue Catalog	Medium

Low Priority (Future Enhancements)

Service	Recommendation	Rationale	Implementation Effort
Aurora/RDS	Reference data (FX rates, customer master)	Only if OLTP requirements emerge (real-time lookups, ACID transactions)	High
Glue Crawlers	Auto-detect partitions and schema changes	Only if schema evolves frequently or unknown schemas	Low
Iceberg	Schema evolution, time travel queries	Only if advanced schema evolution features needed	High

3. Architectural Rationale: Why Certain Tools Are Not Used

AWS Lambda

Used For:

Pandas ETL — Small batches (< 10M rows or < 500MB): ingest_transactions.py runs on Lambda. Same validation and quarantine logic as the Glue path.
Orchestration — promote_silver, read_run_summary (validation gate). Step Functions invokes these Lambdas.

Summary: Pandas path → Lambda; PySpark path → Glue. Lambda is used for both Pandas ETL and orchestration steps.

Why Not for PySpark ETL:

Workload: PySpark ETL (large files, 100GB+) requires distributed compute; Lambda is single-instance, 15-minute timeout.
Processing Model: PySpark on Glue handles multi-worker processing; Lambda runs Pandas only.
Cost: For small batches Lambda is cost-effective; for millions of rows daily, Glue's DPU-based pricing is more predictable.

When Lambda Is Used: Pandas ETL (small batches), validation gate, Silver promotion, orchestration tasks (< 15 min runtime).

DynamoDB

Why Not:

Access Pattern Mismatch: DynamoDB targets point lookups (get item by key). This workload uses batch analytical queries (scan large datasets, aggregations, time-range filters).
Data Model: Append-only, immutable, time-partitioned data. DynamoDB is optimized for mutable, key-value operations.
Query Pattern: Athena queries scan partitions, aggregate, and join. DynamoDB does not support SQL aggregations or analytical queries.

When It Fits: Operational metadata (run status, job tracking), real-time dashboards, point lookups. Not for analytical data storage.

Aurora/RDS

Why Not:

OLAP vs OLTP: Aurora is row-based, transactional (OLTP). This is analytical (OLAP) — columnar storage, aggregations, time-series analysis.
Storage Model: Append-only, immutable, partitioned by time. Relational databases are optimized for mutable, normalized data with ACID transactions.
Scale: S3 scales to exabytes; Aurora has practical limits (64TB per instance, scaling complexity).
Cost: Aurora charges per instance-hour + storage + I/O. S3 + Athena is pay-per-query, no infrastructure to manage.

When It Fits: Reference data (FX rates, customer master), operational metadata, transactional workloads. Not for analytical data lake storage.

Secrets Manager / SSM Parameter Store

Why Not:

No Secrets Exist: All authentication is IAM role-based (Glue service role, Step Functions execution role). No database passwords, API keys, or credentials to store.
OIDC Authentication: GitHub Actions uses OIDC federation (temporary credentials), not static access keys.
Configuration vs Secrets: Bucket names, prefixes, job names are configuration (Terraform variables), not secrets.

When It Fits: External API keys, database credentials, third-party service tokens. Not needed when everything is IAM role-based.

KMS (Customer-Managed Keys)

Implementation summary

SSE-KMS: Customer-managed keys (CMK) with automatic rotation enabled for sensitive buckets (ohpen-gold, ohpen-quarantine).
Key policy: Configured to allow Glue and Athena services to decrypt and encrypt data.
Cost: SSE-S3 is used for standard buckets (bronze, silver, artifacts) to manage cost.

Why Not For All Buckets:

Cost/Complexity: KMS adds $1/month per key + $0.03 per 10K requests. For high-volume ETL (millions of S3 operations), this adds significant cost.
Operational Overhead: Key rotation, key policies, CloudTrail logging for key usage. SSE-S3 requires no key management.

When It Fits: Compliance requiring customer-managed keys (CMK), multi-account key sharing, fine-grained access control via key policies. Implemented for sensitive buckets (gold, quarantine).

Compliance escalation: If compliance requirements for the lower layers (bronze, silver, artifacts) prove higher than assumed, SSE-KMS may be adopted for all buckets to provide a single, uniform encryption model.

Glue Data Quality

Why Not:

Custom Validation Logic: Business rules (currency allowlist, timestamp parsing, loop prevention) are implemented in Python. Glue Data Quality is rule-based and may not cover all custom logic.
Quarantine Pattern: Failed rows go to S3 quarantine bucket with error metadata. Glue Data Quality focuses on metrics/scorecards, not custom quarantine workflows.
Circuit Breaker: Custom logic halts pipeline if >100 same errors/hour. Glue Data Quality does not provide this pattern.

When It Fits: Standardized checks (null rates, value ranges), automated quality monitoring, integration with Glue Catalog scorecards. Not a replacement for custom validation logic.

Glue Crawlers

Why Not:

Schema Is Known and Stable: Transaction schema (TransactionID, CustomerID, Amount, Currency, Timestamp) is fixed. Crawlers are for discovering unknown or evolving schemas.
Explicit Control: Manual table definition in Terraform provides version control, explicit schema documentation, prevents schema drift.
Performance: Manual definition is faster (no crawling overhead) and more predictable.

When It Fits: Unknown schemas, frequent schema evolution, auto-partition discovery. Not needed when schema is stable and known.

Glue Workflows

Why Not:

Multi-Service Orchestration: Pipeline involves S3, EventBridge, Step Functions, Glue, SNS, SQS. Step Functions orchestrates across services; Glue Workflows are Glue-native.
Better Error Handling: Step Functions provides retry logic, parallel execution, integration with non-Glue services.

When It Fits: Pure Glue pipelines (Crawlers → ETL → Catalog), simpler workflows, cost optimization. Step Functions is better for multi-service orchestration.

ECR/ECS (Container Deployment)

Why Not:

Native Execution: Glue provides managed Spark runtime. Containers add complexity (building, pushing, managing) without benefit.
Workload Fit: ETL runs in Glue's optimized Spark environment. Containers are for custom runtimes or microservices, not managed ETL.

When It Fits: Custom Spark/Python versions, containerized microservices, multi-cloud portability. Not needed when Glue's native runtime suffices.

4. Summary: Architectural Fit Matrix

Tool	Fundamental Mismatch	When It Fits
Lambda	Batch ETL vs event-driven microservices	Pre/post-processing, orchestration glue
DynamoDB	Analytical queries vs point lookups	Operational metadata, real-time dashboards
Aurora	OLAP vs OLTP, append-only vs mutable	Reference data, transactional workloads
Secrets Manager	No secrets exist (IAM roles only)	External API keys, database credentials
KMS	SSE-S3 sufficient, cost/complexity	Compliance requiring CMK, multi-account
Glue Data Quality	Custom validation logic required	Standardized checks, quality scorecards
Glue Crawlers	Schema is known and stable	Unknown schemas, frequent evolution
Glue Workflows	Multi-service orchestration needed	Pure Glue pipelines only
ECR/ECS	Native Glue runtime sufficient	Custom runtimes, microservices

5. Fintech Compliance Positioning

Controls in place

CloudTrail management events are enabled for all infrastructure changes; data events are selectively enabled for sensitive buckets (gold, quarantine). Encryption: SSE-S3 (AES-256) for bronze, silver, and artifacts; SSE-KMS with customer-managed keys for gold and quarantine. IAM role-based authentication is used throughout; OIDC federation is used for CI/CD. S3 versioning is enabled for audit trail; lifecycle policies are configured for cost optimization (Glacier transitions).

Mature Fintech Positioning:

"CloudTrail org/account trail for management events is enabled; for high-risk buckets/prefixes S3 data events are selectively enabled and retain them per policy. Encryption uses SSE-S3 for standard data and SSE-KMS with customer-managed keys for sensitive financial data (gold layer, quarantine). All authentication is IAM role-based with OIDC federation for CI/CD, eliminating static credentials."

1. Implemented Services (Production-Ready)​

Data Lake & Query​

S3​

Glue​

Orchestration & Triggers​

Notifications & Failure Handling​

Observability & Audit​

KMS​

DevOps & Quality Gates​

Data Processing Libraries​

Validation & Data Quality​

2. Recommended Enhancements (Not Implemented)​

High Priority (Fintech Compliance)​

Medium Priority (Operational Improvements)​

Low Priority (Future Enhancements)​

3. Architectural Rationale: Why Certain Tools Are Not Used​

AWS Lambda​

DynamoDB​

Aurora/RDS​

Secrets Manager / SSM Parameter Store​

KMS (Customer-Managed Keys)​

Glue Data Quality​

Glue Crawlers​

Glue Workflows​

ECR/ECS (Container Deployment)​

4. Summary: Architectural Fit Matrix​

5. Fintech Compliance Positioning​

See also​

1. Implemented Services (Production-Ready)

Data Lake & Query

S3

Glue

Orchestration & Triggers

Notifications & Failure Handling

Observability & Audit

KMS

DevOps & Quality Gates

Data Processing Libraries

Validation & Data Quality

2. Recommended Enhancements (Not Implemented)

High Priority (Fintech Compliance)

Medium Priority (Operational Improvements)

Low Priority (Future Enhancements)

3. Architectural Rationale: Why Certain Tools Are Not Used

AWS Lambda

DynamoDB

Aurora/RDS

Secrets Manager / SSM Parameter Store

KMS (Customer-Managed Keys)

Glue Data Quality

Glue Crawlers

Glue Workflows

ECR/ECS (Container Deployment)

4. Summary: Architectural Fit Matrix

5. Fintech Compliance Positioning

See also