Skip to main content

Stakeholder Update — Business Mail (MDX + Columns)

© 2026 Stephen Adei. All rights reserved. Non-technical audience: Finance Manager, Product Owner. We use The Vault (raw data) and The Bookshelf (ready for reports); we say the pipeline cleans the data so the numbers add up.


Subject

[Update] Financial Data Pipeline Optimization: 99.9% Data Accuracy Reached


Executive Summary (TL;DR)

We have successfully deployed the new automated financial data pipeline. This update removes manual file handling, cleans the data so your numbers add up, and gives you a single place to run reports — with a scalable foundation for 2026 analytics.

What we built (in plain terms):

  • The Vault (raw data): Incoming transaction files are stored as-is. Nothing is thrown away; we keep a full record for audit and compliance.
  • The Bookshelf (ready for reports): Only data that passes our checks is released for reporting. The pipeline cleans and validates every record — missing or invalid values are caught before they reach your reports.
  • Scale: We process around 1.5 million records per month today and the system is built to handle far more as we grow.

What’s in it for you?

BenefitHow we deliver it
Faster insightsData is available for reporting within the same run — no manual handoffs.
High integrityThe pipeline cleans the data so the numbers add up. Records that don’t pass are held for review — they never reach your reports.
Self-serviceYou can run your own queries on the validated data in Athena. The system supports very large volumes (100M+ records).

Key Findings & Metrics

Metrics below are from our first production-like run (January 2026).

Diagrams side by side

Outcome at a glance

Records held for review (22,500 total)

Metrics table

MetricValue
VolumeAround 1.45 million records processed in the latest run. The system supports ~1.5M records per month and is built to scale to 100M+ for analytics.
Accuracy98.5% of records were cleaned and validated and are ready for your reports. 22,500 records had issues and were held for review — they do not appear in your reports.
Cost savingsStorage for the data we use for reporting is much smaller than for the raw files — we achieved a large reduction in storage costs.
ComplianceA full audit trail is kept for every transaction processed.

Key findings — three metrics side by side

Volume

Accuracy

Cost & compliance

What was held for review (latest run):

  • Invalid currency: 1,800 records (e.g. codes we don’t support in reports).
  • Missing key fields: 350 records (e.g. missing amount or ID — the pipeline stops these from breaking your reports).
  • Invalid dates: 50 records (e.g. bad or future dates).

What is Next?

InitiativeWhat we’re doing
Phase 2: Alerts for critical data errorsWe’re adding real-time alerts (e.g. Slack or email) so the team is notified immediately when something needs attention.
Optimization: Account Balance History reportWe’re further refining the Account Balance History report so it runs faster on large datasets.

The "Details" Section (Technical Implementation)

For those interested in how this is implemented in code and infrastructure:

DetailSpecification (from repository)
Pipeline logicPython/PySpark ETL with automated schema validation. Validation: required columns, currency allowlist, type/timestamp checks. Loop prevention: max 3 attempts, duplicate detection, circuit breaker.
ArchitectureS3-based data lake with Medallion layout: Bronze → Silver → Gold. Error-handling layers: Quarantine, Condemned.
Schema evolutionAdditive-only; versioned paths (schema_v=v1, schema_v=v2). Glue Catalog; new columns (e.g. TransactionType) added as nullable.
CI/CDGitHub Actions: CI (lint, pytest, sqlfluff, MinIO integration) → CD (OIDC, Terraform apply, Glue jobs).

© 2026 Stephen AdeiCC BY 4.0