Skip to main content

CTO Communication: Financial Data Platform - Technical Feasibility & Architecture

Subject

Financial Data Platform - Technical Architecture & Build vs. Buy Analysis

To: CTO

From

Data Platform Team

Date

January 2026


Purpose

Requesting your approval of our technical approach for the financial transaction data platform. This covers architecture decisions, build vs. buy analysis, scalability considerations, and engineering delivery plan.


Build vs. Buy Analysis

Approach

Pros

  • Full control: Custom validation rules, business logic, schema evolution
  • Cost-effective: Serverless architecture, pay-per-use (~$27.81/month vs. estimated $2,000-5,000/month for vendor solution)
  • Vendor lock-in: Minimal (AWS-native, portable patterns)
  • Compliance fit: Designed for financial audit requirements (immutability, reproducibility)
  • Team capability: AWS expertise available in-house

Cons

  • Engineering time: 2 FTE × 3 months (one-time build)
  • Maintenance: Ongoing platform evolution (estimated 0.2 FTE)

Cost (Year 1)

Option B: Buy (Vendor Solution)

Approach (2)

Use vendor-managed data platform solution (e.g., Databricks, Snowflake, or similar).

Pros (2)

  • Faster time-to-market: 1-2 months vs. 3 months
  • Managed service: Less operational overhead

Cons (2)

  • Cost: Estimated $2,000-5,000/month = $24,000-60,000/year (70-180x higher than build)
  • Vendor lock-in: Proprietary formats, limited portability
  • Compliance fit: May not support immutable audit trail requirements
  • Customization: Limited ability to customize validation rules

Cost (Year 1) (2)

  • Vendor License: $2,000-5,000/month = $24,000-60,000/year
  • Implementation: $10,000-20,000 (one-time)
  • Total Year 1: $34,000-80,000

Recommendation: Build (Option A)

Rationale

  1. Cost: 10x cheaper in Year 1, even more over 3 years
  2. Compliance: Custom architecture ensures audit requirements are met
  3. Team Growth: Builds internal capability for future data initiatives
  4. Flexibility: Can evolve with business needs

Technical Architecture

High-Level Design


Raw CSV (S3) → ETL (AWS Glue) → Validated Parquet (S3) → Analytics (Athena)

Quarantine (Invalid Data)

Key Architectural Decisions

  1. Bronze/Silver/Gold Medallion Architecture
  • Rationale: Industry-standard pattern for data lakes, clear separation of concerns
  • Bronze: Immutable raw data (audit trail)
  • Silver: Validated, analytics-ready data
  • Gold: Business contracts (reporting-ready)
  1. Run Isolation via run_id
  • Rationale: Enables safe backfills, prevents data corruption, full audit trail
  • Pattern: Each ETL run writes to unique path (run_id=YYYYMMDDTHHMMSSZ)
  • Benefit: Can reprocess any historical period without risk
  1. Schema Versioning (schema_v)
  • Rationale: Enables schema evolution without breaking consumers
  • Pattern: All Silver/Gold paths include schema_v=v1/, schema_v=v2/, etc.
  • Benefit: Backward compatibility, safe migrations
  1. Serverless Architecture (AWS Glue + Athena)
  • Rationale: No infrastructure to manage, auto-scales, cost-effective
  • Alternative Considered: EMR clusters (rejected - higher cost, operational overhead)

Technology Stack

ComponentTechnologyRationale
StorageAWS S3Object storage, cost-effective, scalable
ETL EngineAWS Glue (Spark Job, Version 4.0)Serverless, distributed processing, scales automatically. PySpark implementation recommended for production; Pandas (Python Shell) available for development/testing
Data FormatParquet (Snappy)Columnar, compressed, optimized for analytics
Query EngineAmazon AthenaServerless SQL, no infrastructure, pay-per-query
OrchestrationAWS Step FunctionsEvent-driven, serverless workflow
IaCTerraformVersion-controlled infrastructure, reproducible

Scalability & Performance

Current Scale

  • Transaction Volume: ~1.5M transactions/month
  • Data Size: ~500MB/month raw CSV → ~50MB/month Parquet (10x compression)
  • Query Pattern: Monthly reporting, ad-hoc analytics

Scalability Design

Storage

  • S3 scales to exabytes (no limit)
  • Partitioning by year/month enables efficient queries
  • Lifecycle policies move old data to Infrequent Access (cost optimization)

Compute

  • AWS Glue auto-scales (no manual capacity planning)
  • For 10x growth: Upgrade to Glue Spark (distributed processing)
  • Current Python Shell sufficient for current volume

Query Performance

  • Athena partition pruning (only scans relevant partitions)
  • Parquet columnar format (only reads needed columns)
  • Estimated query time: < 30 seconds for month-end reports (100M row table) - Note: Estimated based on partition pruning analysis. Actual performance depends on query complexity and data distribution.

Scalability Test

  • Tested with 100M row simulation (Task 3 SQL design)
  • Partition pruning reduces scan by 95%+
  • Architecture supports 10x growth without redesign

Reliability & Resilience

Failure Modes & Handling

  1. ETL Job Failure
  • Detection: CloudWatch alarms (job failure, missing _SUCCESS marker)
  • Recovery: Rerun with new run_id (safe, no data loss)
  • Impact: Delayed reporting (same-day → next-day)
  1. Data Quality Issues
  • Detection: Quarantine rate alerts (> 5% threshold)
  • Recovery: Investigate quarantine, fix source data, backfill
  • Impact: Some records delayed until resolution
  1. Infrastructure Failure (AWS)
  • Detection: CloudWatch alarms, S3/Glue service health
  • Recovery: AWS handles (multi-AZ, 99.99% SLA)
  • Impact: Rare, AWS handles automatically

Data Loss Prevention

  • Immutable Bronze: Raw data never overwritten
  • Run Isolation: Failed runs don't corrupt previous runs
  • S3 Versioning: Can recover deleted files (if enabled)
  • Audit Trail: Full lineage via run_id + CloudWatch logs

Reliability Target: 99.5% uptime (batch processing, not real-time)


Developer Impact

Engineering Team Requirements

Build Phase (3 months)

  • 2 FTE Data Engineers: ETL development, testing, deployment
  • Skills Required: Python, AWS (Glue, S3, Athena), SQL, Terraform
  • Current team capability: Required skills are available in-house

Ongoing (post-build)

  • 0.2 FTE Data Engineer: Platform evolution, schema changes, backfills
  • 0.1 FTE DevOps: Infrastructure updates, monitoring

Developer Experience

Positive

  • Infrastructure as Code (Terraform) — reproducible, version-controlled
  • CI/CD pipeline (GitHub Actions) — automated testing, safe deployments
  • Local testing support (MinIO for S3 simulation)
  • Comprehensive documentation (architecture, runbooks)

Challenges

  • AWS learning curve (mitigated by existing AWS expertise)
  • Schema evolution requires coordination (governance workflows in place)

Delivery Plan

Phase 1: MVP (Month 1-2)

Deliverables

  • Bronze layer (raw data ingestion)
  • Silver layer (validated Parquet)
  • Basic ETL pipeline
  • CloudWatch monitoring

Success Criteria

  • Process 1 month of historical data
  • Quarantine invalid records
  • Basic reporting queries work

Phase 2: Production (Month 3)

Deliverables (2)

  • Gold layer structure design (Task 2) + SQL aggregation pattern (Task 3)
  • Schema versioning
  • Governance workflows
  • Production deployment

Success Criteria (2)

  • Automated monthly reporting
  • Full audit trail
  • Schema evolution process

Phase 3: Optimization (Month 4+)

Deliverables (3)

  • Performance tuning
  • Cost optimization
  • Advanced analytics

Technical Risks & Mitigation

RiskLikelihoodImpactMitigation
AWS Service LimitsLowMediumMonitor usage, request limit increases proactively
Schema Evolution ComplexityMediumMediumGovernance workflows, versioning strategy
Performance at ScaleLowHighPartitioning, tested with 100M rows
Team Learning CurveLowLowExisting AWS expertise, comprehensive docs

Overall technical risk: Low — Architecture is proven; team has required skills.


Approval Request

We request your approval for

  1. Technical Approach: Approve build (custom ETL) vs. buy (vendor)
  2. Architecture: Approve Bronze/Silver/Gold medallion architecture
  3. Technology Stack: Approve AWS serverless stack (Glue, S3, Athena)
  4. Team Commitment: Approve 2 FTE × 3 months for build phase
  5. Delivery Timeline: Approve 3-month delivery plan

Next Steps

If approved:

  1. Week 1: Kickoff, architecture deep-dive
  2. Week 2-4: Bronze/Silver layer development
  3. Week 5-8: Gold layer structure design (Task 2) + SQL aggregation pattern (Task 3) + governance
  4. Week 9-12: Testing, deployment, production rollout

Weekly Status Updates: Every Friday, technical progress + blockers

Your decision needed by

February 1, 2026 - To enable Q1 2026 build phase kickoff


Questions or Technical Concerns

Please reach out to discuss:

  • Architecture alternatives
  • Scalability concerns
  • Technology choices
  • Team capacity

Best regards, [Name] Data Engineering Leadership


Attachments


Task 5 Documentation

Task Documentation

Technical Documentation

© 2026 Stephen AdeiCC BY 4.0