Skip to main content

Tooling & Controls

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This document provides a comprehensive inventory of AWS services and tools used in the OLAP analytics data lake platform, explicitly separating Implemented vs Recommended enhancements, with architectural rationale for each decision. Ohpen core banking (OLTP) is upstream and out of scope (Scope & Assumptions).


1. Implemented Services (Production-Ready)

Data Lake & Query

S3

Glue

ServiceUsageLocationRationale
S3Primary storage for Bronze/Silver/Gold/Quarantine/Artifacts layerstasks/devops_cicd/infra/terraform/main.tfObject storage optimized for analytical workloads (OLAP), scales to exabytes, cost-effective lifecycle policies
GlueETL jobs (Python Shell + Spark), Data Catalog database & tabletasks/devops_cicd/infra/terraform/main.tfManaged Spark runtime for distributed batch processing, purpose-built for ETL workloads
AthenaWorkgroup configured for SQL queryingtasks/devops_cicd/infra/terraform/main.tfServerless SQL engine for analytical queries, pay-per-query, seamless S3/Glue Catalog integration

Orchestration & Triggers

ServiceUsageLocationRationale
Step FunctionsETL orchestration state machinetasks/devops_cicd/infra/terraform/main.tfMulti-service orchestration (S3, Glue, SNS), built-in retry/error handling, visual workflow. Provides run identity propagation for observability.
EventBridgeScheduled ETL (2 AM UTC) + S3 event triggerstasks/devops_cicd/infra/terraform/main.tfEvent-driven triggers, decoupled architecture, cost-effective scheduling

Event Flow Patterns:

Notifications & Failure Handling

ServiceUsageLocationRationale
SNSETL failure + quarantine alertstasks/devops_cicd/infra/terraform/main.tfPub/sub messaging for alerts, integrates with SQS for decoupling
SQS + DLQDecoupling + poison message handlingtasks/devops_cicd/infra/terraform/main.tfAsynchronous processing, dead-letter queue for failed messages, prevents message loss

Observability & Audit

KMS

ServiceUsageLocationRationale
CloudWatchLogs, metrics, alarms, custom metricstasks/devops_cicd/infra/terraform/main.tfCentralized monitoring, custom ETL metrics (quarantine rate, job duration), alerting
CloudTrailAudit trail (management events + selective data events)tasks/devops_cicd/infra/terraform/main.tfCompliance/audit logging, tracks infrastructure changes and sensitive data access
KMSCustomer-managed keys (CMK) for sensitive bucket encryptiontasks/devops_cicd/infra/terraform/main.tfSSE-KMS encryption for gold and quarantine buckets, automatic key rotation, fine-grained access control

CloudTrail Configuration:

  • Management Events: Enabled for all infrastructure changes (IAM, S3 policies, etc.)
  • Data Events: Selectively enabled for sensitive buckets only (ohpen-gold, ohpen-quarantine)
    • Rationale: Cost-aware approach; data events can generate millions of events for high-volume buckets
    • Fintech Positioning: "CloudTrail org/account trail for management events is enabled; for high-risk buckets/prefixes S3 data events are selectively enabled and retain them per policy."

KMS Configuration:

  • Customer-Managed Keys: CMK with automatic rotation enabled for sensitive buckets (ohpen-gold, ohpen-quarantine)
  • Key Policy: Configured to allow Glue and Athena services to decrypt/encrypt data
  • Encryption: SSE-S3 (AES256) for standard buckets (bronze, silver, artifacts); SSE-KMS for sensitive buckets (gold, quarantine)

KMS Encryption Flow:

DevOps & Quality Gates

ServiceUsageLocationRationale
TerraformInfrastructure as Code (IaC)tasks/devops_cicd/infra/terraform/main.tfReproducible infrastructure, version control, state management
GitHub ActionsCI/CD pipeline (OIDC authentication).github/workflows/ (repo root)Keyless authentication via OIDC, developer-friendly, Git-integrated
Docker + docker-composeTest infrastructure (CI-focused)tasks/data_ingestion_transformation/Dockerfile.testContainerized test environment, local development, CI validation
ruff + pytestQuality gates (linting, unit tests).github/workflows/ci.ymlCode quality enforcement, automated testing, prevents regressions

Data Processing Libraries

ToolUsageLocationRationale
PySparkDistributed ETL processingtasks/data_ingestion_transformation/src/etl/*_spark.pyHandles large files (100GB+), distributed processing, optimized for batch workloads
pandasSingle-file/small batch processingtasks/data_ingestion_transformation/src/etl/validation.pyDevelopment/testing, small datasets, simpler debugging
pyarrow, boto3, s3fs/fsspecS3 operations, Parquet I/Otasks/data_ingestion_transformation/src/etl/Efficient S3 access, Parquet format support, optimized I/O

Validation & Data Quality

Validation Tool: Python in ETL (pandas or PySpark)

Validation Rules Location:

  • Shared Configuration: tasks/data_ingestion_transformation/src/etl/config.py
    • ALLOWED_CURRENCIES, REQUIRED_COLUMNS, ERROR_* constants
  • Pandas Implementation: tasks/data_ingestion_transformation/src/etl/validation.py
  • PySpark Implementation: tasks/data_ingestion_transformation/src/etl/validation_spark.py

Validation Modes:

  • Schema Validation: Enforce columns + types (validate_schema())
  • Domain Validation: Null rules, ranges, referential checks (apply_validation_rules())
  • Quarantine: Write failed rows/files to s3://ohpen-quarantine/... + publish metrics + notifications

Both Spark + pandas paths share rule definitions via config.py, ensuring consistency.


High Priority (Fintech Compliance)

ServiceRecommendationRationaleImplementation Effort
Secrets Manager / SSM Parameter StoreStore runtime config if external APIs/credentials neededCentralized secret management, rotation, audit trailLow (if needed)

Encryption: SSE-S3 (AES-256) for bronze, silver, and artifacts buckets; SSE-KMS with customer-managed keys (CMK) for ohpen-gold and ohpen-quarantine.

Medium Priority (Operational Improvements)

ServiceRecommendationRationaleImplementation Effort
LambdaPre-processing (file validation), post-processing (promotion)Event-driven automation, cost-effective for lightweight tasksMedium
DynamoDBETL run metadata, data quality metrics, schema registryQueryable metadata store, sub-millisecond lookups for operational dashboardsMedium
Glue Data QualityStandardize checks + scorecardsAutomated quality monitoring, integration with Glue CatalogMedium

Low Priority (Future Enhancements)

ServiceRecommendationRationaleImplementation Effort
Aurora/RDSReference data (FX rates, customer master)Only if OLTP requirements emerge (real-time lookups, ACID transactions)High
Glue CrawlersAuto-detect partitions and schema changesOnly if schema evolves frequently or unknown schemasLow
IcebergSchema evolution, time travel queriesOnly if advanced schema evolution features neededHigh

3. Architectural Rationale: Why Certain Tools Are Not Used

AWS Lambda

Used For:

  • Pandas ETL — Small batches (< 10M rows or < 500MB): ingest_transactions.py runs on Lambda. Same validation and quarantine logic as the Glue path.
  • Orchestrationpromote_silver, read_run_summary (validation gate). Step Functions invokes these Lambdas.

Summary: Pandas path → Lambda; PySpark path → Glue. Lambda is used for both Pandas ETL and orchestration steps.

Why Not for PySpark ETL:

  • Workload: PySpark ETL (large files, 100GB+) requires distributed compute; Lambda is single-instance, 15-minute timeout.
  • Processing Model: PySpark on Glue handles multi-worker processing; Lambda runs Pandas only.
  • Cost: For small batches Lambda is cost-effective; for millions of rows daily, Glue's DPU-based pricing is more predictable.

When Lambda Is Used: Pandas ETL (small batches), validation gate, Silver promotion, orchestration tasks (< 15 min runtime).


DynamoDB

Why Not:

  • Access Pattern Mismatch: DynamoDB targets point lookups (get item by key). This workload uses batch analytical queries (scan large datasets, aggregations, time-range filters).
  • Data Model: Append-only, immutable, time-partitioned data. DynamoDB is optimized for mutable, key-value operations.
  • Query Pattern: Athena queries scan partitions, aggregate, and join. DynamoDB does not support SQL aggregations or analytical queries.

When It Fits: Operational metadata (run status, job tracking), real-time dashboards, point lookups. Not for analytical data storage.


Aurora/RDS

Why Not:

  • OLAP vs OLTP: Aurora is row-based, transactional (OLTP). This is analytical (OLAP) — columnar storage, aggregations, time-series analysis.
  • Storage Model: Append-only, immutable, partitioned by time. Relational databases are optimized for mutable, normalized data with ACID transactions.
  • Scale: S3 scales to exabytes; Aurora has practical limits (64TB per instance, scaling complexity).
  • Cost: Aurora charges per instance-hour + storage + I/O. S3 + Athena is pay-per-query, no infrastructure to manage.

When It Fits: Reference data (FX rates, customer master), operational metadata, transactional workloads. Not for analytical data lake storage.


Secrets Manager / SSM Parameter Store

Why Not:

  • No Secrets Exist: All authentication is IAM role-based (Glue service role, Step Functions execution role). No database passwords, API keys, or credentials to store.
  • OIDC Authentication: GitHub Actions uses OIDC federation (temporary credentials), not static access keys.
  • Configuration vs Secrets: Bucket names, prefixes, job names are configuration (Terraform variables), not secrets.

When It Fits: External API keys, database credentials, third-party service tokens. Not needed when everything is IAM role-based.


KMS (Customer-Managed Keys)

Implementation summary

  • SSE-KMS: Customer-managed keys (CMK) with automatic rotation enabled for sensitive buckets (ohpen-gold, ohpen-quarantine).
  • Key policy: Configured to allow Glue and Athena services to decrypt and encrypt data.
  • Cost: SSE-S3 is used for standard buckets (bronze, silver, artifacts) to manage cost.

Why Not For All Buckets:

  • Cost/Complexity: KMS adds $1/month per key + $0.03 per 10K requests. For high-volume ETL (millions of S3 operations), this adds significant cost.
  • Operational Overhead: Key rotation, key policies, CloudTrail logging for key usage. SSE-S3 requires no key management.

When It Fits: Compliance requiring customer-managed keys (CMK), multi-account key sharing, fine-grained access control via key policies. Implemented for sensitive buckets (gold, quarantine).

Compliance escalation: If compliance requirements for the lower layers (bronze, silver, artifacts) prove higher than assumed, SSE-KMS may be adopted for all buckets to provide a single, uniform encryption model.


Glue Data Quality

Why Not:

  • Custom Validation Logic: Business rules (currency allowlist, timestamp parsing, loop prevention) are implemented in Python. Glue Data Quality is rule-based and may not cover all custom logic.
  • Quarantine Pattern: Failed rows go to S3 quarantine bucket with error metadata. Glue Data Quality focuses on metrics/scorecards, not custom quarantine workflows.
  • Circuit Breaker: Custom logic halts pipeline if >100 same errors/hour. Glue Data Quality does not provide this pattern.

When It Fits: Standardized checks (null rates, value ranges), automated quality monitoring, integration with Glue Catalog scorecards. Not a replacement for custom validation logic.


Glue Crawlers

Why Not:

  • Schema Is Known and Stable: Transaction schema (TransactionID, CustomerID, Amount, Currency, Timestamp) is fixed. Crawlers are for discovering unknown or evolving schemas.
  • Explicit Control: Manual table definition in Terraform provides version control, explicit schema documentation, prevents schema drift.
  • Performance: Manual definition is faster (no crawling overhead) and more predictable.

When It Fits: Unknown schemas, frequent schema evolution, auto-partition discovery. Not needed when schema is stable and known.


Glue Workflows

Why Not:

  • Multi-Service Orchestration: Pipeline involves S3, EventBridge, Step Functions, Glue, SNS, SQS. Step Functions orchestrates across services; Glue Workflows are Glue-native.
  • Better Error Handling: Step Functions provides retry logic, parallel execution, integration with non-Glue services.

When It Fits: Pure Glue pipelines (Crawlers → ETL → Catalog), simpler workflows, cost optimization. Step Functions is better for multi-service orchestration.


ECR/ECS (Container Deployment)

Why Not:

  • Native Execution: Glue provides managed Spark runtime. Containers add complexity (building, pushing, managing) without benefit.
  • Workload Fit: ETL runs in Glue's optimized Spark environment. Containers are for custom runtimes or microservices, not managed ETL.

When It Fits: Custom Spark/Python versions, containerized microservices, multi-cloud portability. Not needed when Glue's native runtime suffices.


4. Summary: Architectural Fit Matrix

ToolFundamental MismatchWhen It Fits
LambdaBatch ETL vs event-driven microservicesPre/post-processing, orchestration glue
DynamoDBAnalytical queries vs point lookupsOperational metadata, real-time dashboards
AuroraOLAP vs OLTP, append-only vs mutableReference data, transactional workloads
Secrets ManagerNo secrets exist (IAM roles only)External API keys, database credentials
KMSSSE-S3 sufficient, cost/complexityCompliance requiring CMK, multi-account
Glue Data QualityCustom validation logic requiredStandardized checks, quality scorecards
Glue CrawlersSchema is known and stableUnknown schemas, frequent evolution
Glue WorkflowsMulti-service orchestration neededPure Glue pipelines only
ECR/ECSNative Glue runtime sufficientCustom runtimes, microservices

5. Fintech Compliance Positioning

Controls in place

CloudTrail management events are enabled for all infrastructure changes; data events are selectively enabled for sensitive buckets (gold, quarantine). Encryption: SSE-S3 (AES-256) for bronze, silver, and artifacts; SSE-KMS with customer-managed keys for gold and quarantine. IAM role-based authentication is used throughout; OIDC federation is used for CI/CD. S3 versioning is enabled for audit trail; lifecycle policies are configured for cost optimization (Glacier transitions).

Mature Fintech Positioning:

"CloudTrail org/account trail for management events is enabled; for high-risk buckets/prefixes S3 data events are selectively enabled and retain them per policy. Encryption uses SSE-S3 for standard data and SSE-KMS with customer-managed keys for sensitive financial data (gold layer, quarantine). All authentication is IAM role-based with OIDC federation for CI/CD, eliminating static credentials."


See also

© 2026 Stephen AdeiCC BY 4.0