Skip to main content

Data Staff Communication: Financial Data Platform - Implementation Details

Subject

Financial Data Platform - Communication

To: Data Engineering Team, Data Analysts, Data Scientists

From

Data Platform Team

Date

January 2026


Purpose

This document outlines the specific implementation tasks, data definitions, pipeline requirements, and ownership for the financial transaction data platform. Use this as your execution guide.


Project Overview

Goal: Build automated data pipeline for financial transaction analytics

Project Timeline

Team


Data Definitions & Requirements

Source Data

Format

Location

Schema (Required Columns)

  • TransactionID (string, required, unique identifier)
  • CustomerID (string, required, customer identifier)
  • TransactionAmount (decimal, required, can be negative for withdrawals)
  • Currency (string, required, ISO-4217 code: EUR, USD, GBP, JPY, AUD, CAD, CHF, CNY, HKD, NZD)
  • TransactionTimestamp (timestamp, required, ISO-8601 format, UTC)

Data Quality Rules

  • No NULLs in required columns
  • Currency must be in allowlist
  • TransactionAmount must be numeric
  • TransactionTimestamp must be parseable
  • Duplicate CustomerID + TransactionTimestamp(date) combinations flagged (quarantined)

Target Data

Bronze Layer (Raw)

  • Location: s3://ohpen-bronze/bronze/mortgages/transactions/ingest_date=YYYY-MM-DD/run_id=.../file.csv
  • Format: Original CSV (verbatim copy, immutable)
  • Owner: Data Platform Team
  • Purpose: Audit trail, reprocessing source

Silver Layer (Validated)

  • Location: s3://ohpen-silver/silver/mortgages/transactions/year=YYYY/month=MM/schema_v=v1/run_id=.../part-*.parquet
  • Format: Parquet (Snappy compression), partitioned by year/month
  • Schema: Validated columns + partition columns (year, month)
  • Owner: Domain Teams (validation rules), Data Platform Team (execution)
  • Purpose: Analytics-ready data

Gold Layer (Business Reporting)

  • Location: s3://ohpen-gold/gold/finance/account_balances/schema_v=v1/current/as_of_month=YYYY-MM/part-*.parquet
  • Format: Parquet, partitioned by as_of_month
  • Schema: Business-defined metrics (month-end balances)
  • Owner: Business/Finance (schema), Data Platform Team (execution)
  • Purpose: Reporting, dashboards
  • Scope: Architecture, SQL Breakdown, ETL Flow.

Quarantine Layer

  • Location: s3://ohpen-quarantine/quarantine/mortgages/transactions/ingest_date=YYYY-MM-DD/run_id=.../invalid_rows.parquet
  • Format: Parquet with validation_error column
  • Owner: Data Quality Team (review), Data Platform Team (execution)
  • Purpose: Invalid data audit trail

Pipeline Implementation Tasks

Task 1: ETL Pipeline (Bronze → Silver)

Owner: Data Engineer #1

Task 1 Timeline

Deliverables

  • Python ETL script (ingest_transactions.py)
  • Validation logic (null checks, currency, amount, timestamp, duplicates)
  • Parquet write with partitioning (year/month)
  • Quarantine write with error details
  • _SUCCESS marker with metrics JSON
  • CloudWatch metrics publishing
  • Unit tests (test_validator.py)
  • Integration tests (test_integration.py)

Technical Specs

  • Language: Python 3.10+
  • Libraries: pandas, pyarrow, boto3, s3fs
  • AWS Services: S3, Glue, CloudWatch
  • Code Location: ETL Implementation

Acceptance Criteria

  • Processes 1.5M rows in < 10 minutes
  • Quarantines invalid rows (never drops)
  • Writes partitioned Parquet correctly
  • _SUCCESS marker present after successful run
  • CloudWatch metrics published

Task 2: Data Lake Architecture

Owner: Data Engineer #2

Task 2 Timeline

Deliverables (2)

  • S3 bucket structure documentation
  • Naming conventions document
  • Partitioning strategy
  • Schema versioning approach
  • _LATEST.json pointer mechanism
  • current/ prefix promotion logic

Technical Specs (2)

Acceptance Criteria (2)

  • Clear folder structure documented
  • Hive-style partitions (key=value)
  • Run isolation via run_id
  • Schema versioning (schema_v=v1)

Task 3: SQL Analytics (Silver → Gold Aggregation Pattern)

Owner: Data Analyst + Data Engineer #1

Scope: Architecture, SQL Breakdown.

Task 3 Timeline

Deliverables (3)

  • Month-end balance history SQL query (implements Silver → Gold aggregation)
  • Athena table definitions (Glue Catalog) for Silver layer
  • Query performance optimization
  • Test data + expected results
  • SQL unit tests

Technical Specs (3)

  • SQL File: Balance History Query
  • Engine: Amazon Athena (Presto/ANSI SQL)
  • Performance: < 30 seconds for 100M row table (with partition pruning) - Note: Estimated based on partition pruning analysis. Actual performance depends on query complexity and data distribution.

Acceptance Criteria (3)

  • Returns month-end balances for Q1 2024
  • Handles NULLs for months with no transactions
  • Partition pruning works (only scans Jan-Mar partitions)
  • Query runs in < 30 seconds

Task 4: CI/CD & Infrastructure

Owner: Data Engineer #2

Task 4 Timeline

Deliverables (4)

  • GitHub Actions CI workflow (linting, tests)
  • Terraform infrastructure code
  • Deployment automation
  • Monitoring dashboards (CloudWatch)
  • Alert configuration

Technical Specs (4)

Acceptance Criteria (4)

  • CI runs on every PR (linting + tests)
  • Terraform provisions all AWS resources
  • Alerts trigger on job failures
  • Monitoring dashboards show key metrics

Task 2: Complete Data Lake Architecture Design

Owner: Data Engineer #1 + Data Analyst

Scope: Architecture, SQL Breakdown, ETL Flow.

Task 2 Gold Layer Design

Deliverables (Gold Layer Structure)

  • Gold layer folder structure design
  • _LATEST.json pointer mechanism design
  • current/ prefix pattern design
  • Governance and ownership model
  • Schema versioning strategy

Acceptance Criteria (Gold Layer Design)

  • Gold layer structure documented
  • _LATEST.json pattern designed
  • current/ prefix pattern designed
  • Governance workflows defined
  • Task 3 SQL implements aggregation pattern

Metrics & Monitoring

Key Metrics (CloudWatch)

Volume Metrics

  • InputRows (Count): Total rows read from Bronze
  • ValidRows (Count): Rows written to Silver
  • QuarantinedRows (Count): Rows quarantined

Quality Metrics

  • QuarantineRate (Percent): QuarantinedRows / InputRows * 100
  • QuarantineByReason (Count by reason): Breakdown of validation errors

Performance Metrics

  • DurationSeconds (Seconds): ETL execution time

Alert Thresholds

  • Job failure: Immediate alert (P1)
  • Quarantine rate > 5%: Alert within 4 hours (P2)
  • Missing partitions: Alert within 4 hours (P2)

Monitoring Dashboards

CloudWatch Dashboard

  • ETL run status (success/failure)
  • Row counts (input/valid/quarantined)
  • Quarantine rate trend
  • Execution time trend

Ownership & Responsibilities

Data Platform Team

Responsibilities

  • ETL pipeline development & maintenance
  • Infrastructure provisioning (Terraform)
  • Monitoring & alerting setup
  • Run execution & troubleshooting
  • Schema implementation (after approval)

On-Call

Domain Teams

Responsibilities (2)

  • Validation rule definition (Silver layer)
  • Business metric definition (Gold layer - Task 2: Structure Design, Task 3: SQL Aggregation)
  • Schema change requests
  • Data quality review (quarantine investigation)

Contact

Data Quality Team

Responsibilities (3)

  • Quarantine review & resolution
  • Quality metric interpretation
  • Source data issue triage
  • Quality threshold monitoring

Contact (2)


Timeline & Milestones

WeekMilestoneOwnerDeliverable
Week 1Architecture DesignData Engineer #2Architecture doc, Terraform stub
Week 2ETL Development StartData Engineer #1ETL script skeleton, validation logic
Week 3Infrastructure SetupData Engineer #2Terraform, CI/CD, S3 buckets
Week 4ETL CompleteData Engineer #1Working ETL, tests, CloudWatch metrics
Week 5SQL DevelopmentData Analyst + DE #1SQL query, Athena tables
Week 6SQL TestingData AnalystTest results, performance validation
Week 7Gold Layer Design + SQL AggregationData Engineer #1 + Data AnalystTask 2: Gold structure design, Task 3: SQL aggregation pattern
Week 8Integration TestingAllEnd-to-end test, production deployment
Week 9-12Production RolloutAllGo-live, monitoring, documentation

Tickets & Tracking

JIRA Epic

Sub-tasks

  • DATA-001-1: ETL Pipeline (Bronze → Silver)
  • DATA-001-2: Data Lake Architecture
  • DATA-001-3: SQL Analytics (Silver → Gold)
  • DATA-001-4: CI/CD & Infrastructure
  • DATA-001-2: Gold Layer Structure Design (Task 2)
  • DATA-001-3: SQL Aggregation Pattern (Task 3)

Sprint Planning


Questions & Support

Technical Questions

Blockers

  • Escalate to Data Platform Lead
  • Daily standup: 10am, discuss blockers

Code Reviews

  • All PRs require 1 approval
  • Data Engineer #1 reviews DE #2's work, vice versa

Next Steps

  1. This Week:

    • Review architecture document
    • Set up local development environment
    • Clone repo, install dependencies
    • Run existing tests to verify setup
  2. Next Week:

    • Start ETL development (Data Engineer #1)
    • Start infrastructure setup (Data Engineer #2)
    • Domain Teams: Define validation rules
  3. Standup:

    • Daily 10am standup (15 min)
    • Share progress, blockers, next steps

Best regards, [Name] Data Platform Lead


Resources


Task 5 Documentation

Task Documentation

Technical Documentation

© 2026 Stephen AdeiCC BY 4.0