Task 3: SQL Query Breakdown & Flow
Task Deliverables
This document addresses the required deliverable for Task 3:
✅ Deliverable: SQL Query for Account Balance History
Location:
- Production query:
balance_history_2024_q1.sql - Query breakdown: This document (SQL_BREAKDOWN.md)
- Detailed diagrams: SQL Complete Reference
Content:
- SQL query that shows account balance history at the end of each month for each account
- Covers first three months of 2024 (January, February, March)
- Handles accounts with no activity (returns NULL for months with no transactions)
- Optimized for large tables (100M+ records) with partition pruning
- Complete query flow explanation with visual diagrams
Query Output Format:
Acct month balance
ACC001 2024-01 326.00
ACC001 2024-02 326.00
ACC001 2024-03 326.00
ACC002 2024-01 null
ACC002 2024-02 -25.50
ACC002 2024-03 -125.50
This document provides a high-level overview of the balance_history_2024_q1.sql query, including visual diagrams and pseudocode. It breaks down the query from the most inner table working outwards, explaining each CTE and the final SELECT statement.
Note: For detailed, comprehensive diagrams showing all steps and decision points, see Appendix G: Complete SQL Query Diagrams.
Query Flow Overview
This simplified diagram shows the high-level query flow:
Detailed flow diagrams: See SQL Complete Reference.
Data Model (Entity Relationship)
This ER diagram shows the data relationships and schema structure:
Textual Flow:
transactions (base table)
↓
months (CTE 1 - innermost)
↓
unique_accounts (CTE 2)
↓
account_months (CTE 3)
↓
monthly_last_tx (CTE 4)
↓
Final SELECT (outermost)
High-Level Algorithm
The query follows a systematic approach to generate month-end balances:
FUNCTION GenerateMonthEndBalances(): RESULT_SET;
BEGIN
{ 1. Generate Month Spine }
months := ['2024-01-31', '2024-02-29', '2024-03-31'];
{ Hardcoded month-end dates for Q1 2024 }
{ 2. Get Unique Accounts }
unique_accounts := SELECT DISTINCT account_id
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01';
{ Extract all unique account IDs from Q1 2024 transactions }
{ 3. Generate Account×Month Spine (Cartesian Product) }
account_months := CROSS_JOIN(unique_accounts, months);
{ Create all combinations of (account_id, month_end) }
{ Result: N accounts × 3 months = complete spine }
{ 4. Find Last Transaction per Account per Month }
monthly_last_tx := SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC
) AS row_number
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01';
{ Window function identifies last transaction per account per month }
{ 5. Filter to Only Last Transaction }
last_tx_per_month := FILTER monthly_last_tx WHERE row_number = 1;
{ Keep only row_number = 1 (the last transaction) }
{ 6. Join Spine with Data }
result := SELECT
account_months.account_id AS acct,
last_tx_per_month.new_balance AS balance,
TO_CHAR(account_months.month_end, 'YYYY-MM') AS month
FROM account_months
LEFT JOIN last_tx_per_month
ON account_months.account_id = last_tx_per_month.account_id
AND DATE_TRUNC('month', account_months.month_end) = last_tx_per_month.tx_month_start
ORDER BY account_months.account_id, account_months.month_end;
{ LEFT JOIN ensures all account×month combinations appear }
{ NULL balance for months with no transactions }
RETURN result;
END;
Query Objective
Generate month-end account balances for Q1 2024 (January, February, March), showing NULL for months with no transactions.
Key Design Decisions
- Spine Generation: CROSS JOIN ensures all account×month combinations are present, making gaps explicit
- Window Function: ROW_NUMBER() efficiently identifies last transaction without multiple scans
- LEFT JOIN: Preserves all account×month combinations, showing NULL for months with no activity
- Partition Pruning: WHERE clause on tx_date enables partition pruning for 100M row tables
- Deterministic Ordering: id DESC as tie-breaker ensures consistent results
Performance Optimizations
For 100M row tables:
-
Partitioning Strategy:
- Table partitioned by tx_date (month/day)
- WHERE clause enables partition pruning
- Only scans Jan-Mar 2024 partitions
-
Parquet Optimization (Athena):
- Parquet column statistics (min/max) enable efficient filtering
- Partition metadata enables partition pruning
- Speeds up DISTINCT and window function operations
- Reduces data scanned
-
Window Function Efficiency:
- Single pass over data (no multiple scans)
- Partitioned by account_id and month
- Ordered by tx_date DESC for last transaction
-
Join Strategy:
- LEFT JOIN preserves spine (no data loss)
- Join on account_id and month (optimized via partition pruning and Parquet stats)
Breakdown: Inner to Outer
1. Base Table: transactions (Innermost - Data Source)
Code Reference: See Base Table Filter in SQL Complete Reference.
What it does:
- Reads from the
transactionstable (as defined in Appendix A of the business case) - Filters to Q1 2024 (January, February, March)
- Partition pruning: If table is partitioned by
tx_date, only scans relevant partitions
Schema Note: This query uses the schema from Appendix A (id, account_id, amount, new_balance, tx_date). Task 1 ETL produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp). If this query needs to run on Task 1 output, a schema transformation/mapping step is required:
- TransactionID → id (or generate sequential id)
- CustomerID → account_id
- TransactionAmount → amount
- TransactionTimestamp → tx_date
- new_balance would need to be calculated from TransactionAmount (cumulative sum per account)
Columns used:
account_id- Account identifiernew_balance- Balance after transaction (per Appendix A spec)tx_date- Transaction dateid- Transaction ID (for tie-breaking)
2. CTE 1: months (First CTE - Month Spine)
Code Reference: See CTE 1: months in SQL Complete Reference.
What it does:
- Creates a hardcoded list of month-end dates for Q1 2024
- Purpose: Defines the reporting period explicitly
- Result: 3 rows (one per month)
Why month-end dates?
- January 31, February 29 (2024 is leap year), March 31
- These represent the "as-of" dates for month-end reporting
3. CTE 2: unique_accounts (Second CTE - Account Universe)
Code Reference: See CTE 2: unique_accounts in SQL Complete Reference.
What it does:
- Extracts all unique account IDs from the transactions table
- Purpose: Defines the "universe" of accounts to report on
- Result: One row per unique account
Note: The commented WHERE clause would optimize by filtering during the DISTINCT operation, but for clarity we filter in the next CTE.
4. CTE 3: account_months (Third CTE - Complete Spine)
Code Reference: See CTE 3: account_months in SQL Complete Reference.
What it does:
- CROSS JOIN: Creates Cartesian product of all accounts × all months
- Purpose: Generates a complete "spine" of (account, month) combinations
- Result: If there are N accounts and 3 months, produces N×3 rows
Why CROSS JOIN?
- Ensures every account appears for every month
- Months with no transactions will show NULL balance (explicit gaps)
- Without this, accounts with no activity in a month would be missing from results
Example output:
account_id | month_end
-----------|----------
ACC001 | 2024-01-31
ACC001 | 2024-02-29
ACC001 | 2024-03-31
ACC002 | 2024-01-31
ACC002 | 2024-02-29
ACC002 | 2024-03-31
...
5. CTE 4: monthly_last_tx (Fourth CTE - Last Transaction per Month)
Code Reference: See CTE 4: monthly_last_tx in SQL Complete Reference.
Detailed diagram: See CTE 4: Window Function Detailed Flow in SQL Complete Reference.
What it does:
- Window Function:
ROW_NUMBER()partitions by account and month - Ordering:
tx_date DESC, id DESC- most recent transaction first - Purpose: Identifies the last transaction for each account in each month
- Filter: Only rows where
rn = 1are the last transactions
How it works:
- Partitions transactions by
(account_id, month) - Within each partition, orders by
tx_date DESC(newest first) - Assigns row number 1 to the most recent transaction
- If multiple transactions on same day, uses
id DESCas tie-breaker
Example:
Account ACC001 in January:
- Transaction on 2024-01-15: new_balance = 150.75, rn = 2
- Transaction on 2024-01-22: new_balance = 326.00, rn = 1 ← Last transaction
Result: One row per (account, month) combination that has transactions
6. Final SELECT (Outermost - Final Report)
Code Reference: See Final SELECT in SQL Complete Reference.
What it does:
- LEFT JOIN: Joins the complete spine (
account_months) with the last transactions (monthly_last_tx) - Join conditions:
account_idmatches- Month matches (using
DATE_TRUNCto align month-end with month-start) - Only last transaction (
rn = 1)
- Result: One row per (account, month) combination
- If transaction exists: shows
new_balance - If no transaction: shows
NULL(from LEFT JOIN)
- If transaction exists: shows
Why LEFT JOIN?
- Preserves all rows from
account_months(the spine) - Accounts with no transactions in a month get NULL balance
- Without LEFT JOIN, those months would be missing from results
Final output format:
acct | balance | month
-------|---------|--------
ACC001 | 326.00 | 2024-01
ACC001 | NULL | 2024-02
ACC001 | NULL | 2024-03
ACC002 | NULL | 2024-01
ACC002 | -25.50 | 2024-02
ACC002 | -125.50 | 2024-03
...
Execution Order (How SQL Engine Processes It)
Detailed execution diagram: See Execution Order (Detailed) in Appendix G.
Step-by-Step:
- Base table scan:
transactionstable (with partition pruning) - CTE 1:
months- Hardcoded values (instant) - CTE 2:
unique_accounts- DISTINCT on transactions (requires scan) - CTE 3:
account_months- CROSS JOIN (fast, small result set) - CTE 4:
monthly_last_tx- Window function on transactions (requires full scan + sort) - Final SELECT: LEFT JOIN between spine and last transactions
Performance note: For 100M rows, CTE 4 (window function) is the most expensive operation, but it's necessary to identify the last transaction efficiently.
Performance Optimization Strategy
This simplified diagram shows the optimization approach:
Detailed optimization diagrams: See Performance Optimization Strategy in Appendix G.
Key Design Patterns
1. Spine Generation Pattern
- Problem: Need to show all account×month combinations, even when no data exists
- Solution: CROSS JOIN to create complete spine, then LEFT JOIN with data
- Result: Gaps are explicit (NULL) rather than missing rows
2. Window Function Pattern
- Problem: Need last transaction per account per month efficiently
- Solution:
ROW_NUMBER()with partitioning and ordering - Result: Single pass over data (O(n log n) vs O(n²) for nested loops)
3. Partition Pruning Pattern
- Problem: 100M row table, but only need Q1 2024
- Solution: WHERE clause on partitioned column (
tx_date) - Result: Only scans relevant partitions (Jan, Feb, Mar 2024)
Why This Approach?
Alternative 1: Simple GROUP BY (Wrong)
Code Reference: See Alternative 1: Simple GROUP BY in SQL Complete Reference.
Problem: Doesn't show months with no transactions (missing rows)
Alternative 2: SUM(amount) (Wrong per spec)
Code Reference: See Alternative 2: SUM(amount) in Appendix F.
Problem: Spec says to use new_balance from last transaction, not sum amounts
Our Approach: Spine + Window Function (Correct)
- ✅ Shows all account×month combinations (spine)
- ✅ Uses
new_balancefrom last transaction (per spec) - ✅ Efficient for 100M rows (window function, partition pruning)
- ✅ Handles NULL months correctly
Gold Layer Aggregation
This query demonstrates the Silver → Gold aggregation pattern. The Gold layer structure, governance, and ownership model are best described in Task 2 (Architecture). After running this query, the result can be stored in the Gold layer as a pre-computed aggregation for efficient reporting.
Storing Results in Gold Layer
The query output represents month-end balance snapshots that can be persisted.
Code Reference: See Gold Layer Storage Example in SQL Complete Reference.
Benefits of Gold Layer Storage
- Performance: Pre-computed aggregations enable O(log n) queries instead of O(n log n)
- Cost: Reduces Athena query costs (scan less data)
- Freshness: Can be refreshed monthly after month-end close
- Governance: Gold layer owned by Business/Finance (Gold layer structure, governance, and ownership model are best described in Task 2 architecture)
- Stability: Stable schema and business contracts for reporting
Monthly Refresh Pattern
For monthly reporting, this query would run:
- After month-end close (e.g., first day of next month)
- Compute month-end balances from Silver layer (this query)
- Store results in Gold layer (
s3://aggregated/monthly_balances/year=YYYY/month=MM/) - Update Glue Data Catalog for Athena queries
This pattern transforms ad-hoc Silver layer queries into efficient Gold layer lookups for business reporting.
See Also
Task 3 Documentation
- Full SQL Code - Complete SQL query implementation
- SQL Complete Reference - All SQL diagrams, code, and examples
- Isolated Testing Guide - How to test the SQL query
- Testing Documentation - SQL testing overview
Related Tasks
- Data Lake Architecture - Architecture this query runs on
- ETL Pipeline - What creates the data this query reads
Technical Documentation
- Testing Guide - Comprehensive testing documentation
- Unified Testing Convention - Testing standards
- Test Results Overview - Current test execution results