Skip to main content

SQL Query Breakdown & Flow

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Location:

  • Production query: balance_history_2024_q1.sql
  • Query breakdown: This document (SQL_BREAKDOWN.md)
  • Detailed diagrams: SQL Complete Reference

Content:

  • SQL query that shows account balance history at the end of each month for each account
  • Covers first three months of 2024 (January, February, March)
  • Carry-forward: months with no transactions show the last known balance; NULL only if the account has no prior history in the period
  • Optimized for large tables (100M+ records) with partition pruning (for cost optimization strategy, see Design Decisions)
  • Complete query flow explanation with visual diagrams

Scope: Analytical workload; source systems are upstream (Scope & Assumptions).

Query Output Format:

Acct    month     balance
ACC001 2024-01 326.00
ACC001 2024-02 326.00
ACC001 2024-03 326.00
ACC002 2024-01 null
ACC002 2024-02 -25.50
ACC002 2024-03 -125.50

🤖 GenAI possibility — This query (and any Athena SQL) is a great fit for Bedrock: “what this query does” in plain language, NL-to-SQL for Gold, optimization hints, report commentary. See GenAI in the Ohpen Case & Opportunities.


Gold Layer Output Schema

The query output becomes the Gold layer table account_balances. This implements the 1:N transformation pattern (Silver → Gold) where one validated Silver source produces multiple business aggregations.

Gold Layer Schema: gold/finance/account_balances/

ColumnTypePartition KeyNullableDescription
account_idVARCHARNoNOT NULLAccount identifier
monthVARCHAR(7)NoNOT NULLMonth in YYYY-MM format (e.g., "2024-01")
balanceDECIMAL(16,2)NoNULLABLEMonth-end balance (NULL if no prior history)
as_of_monthVARCHAR(7)YesNOT NULLPartition key: reporting month (same as month column)
schema_vVARCHARYesNOT NULLPartition key: schema version (e.g., "v1")
run_idVARCHARYesNOT NULLPartition key: aggregation run identifier

Additional columns (optional):

  • transaction_count (INT): Number of transactions in the month
  • first_tx_date (TIMESTAMP): Date of first transaction in month
  • last_tx_date (TIMESTAMP): Date of last transaction in month

Partition Strategy:

  • as_of_month: Enables month-level partition pruning for reporting queries (e.g., Q1 2024 scans only 3 partitions)
  • schema_v: Supports schema evolution (new columns can be added with schema_v=v2)
  • run_id: Isolates aggregation runs (safe backfills, idempotent reruns)

Gold Layer Path Example:

s3://ohpen-gold/gold/finance/account_balances/
schema_v=v1/
run_id=20240121T120000Z/
as_of_month=2024-01/
part-00000.parquet
as_of_month=2024-02/
part-00001.parquet
as_of_month=2024-03/
part-00002.parquet
_SUCCESS
current/ (promoted aggregations only)
_LATEST.json (pointer to current run)

Transformation Pattern: This query reads from Silver (silver/mortgages/transactions/) and writes to Gold (gold/finance/account_balances/), implementing the 1:N cardinality model (one Silver source → one Gold aggregation; other aggregations can be added without changing Silver).


Extended documentation: SQL Complete Reference. Code: SQL Implementation Code.

This document provides a high-level overview of the Analytical Query Implementation query, including visual diagrams and pseudocode. It breaks down the query from the most inner table working outwards, explaining each CTE and the final SELECT statement.

Note: For detailed, comprehensive diagrams showing all steps and decision points, see SQL Query Diagrams.


Query Flow Overview

For how this query leverages the partition strategy and Silver layer schema, see the referenced documentation. This simplified diagram shows the high-level query flow:

Detailed flow diagrams: See SQL Complete Reference.


Data Model (Entity Relationship)

For the complete schema specification, see Parquet Schema Specification. For data model design rationale, see Data Model Design Rationale. This ER diagram shows the data relationships and schema structure:

Textual Flow:

transactions (base table)

months (CTE 1 - innermost)

unique_accounts (CTE 2)

account_months (CTE 3)

monthly_last_tx (CTE 4)

Final SELECT (outermost)

High-Level Algorithm

For complete SQL implementation code, see SQL Implementation Code. For detailed algorithm explanation, see SQL Complete Reference - Algorithm. The query follows a systematic approach to generate month-end balances:

FUNCTION GenerateMonthEndBalances(): RESULT_SET;
BEGIN
{ 1. Generate Month Spine }
months := ['2024-01-31', '2024-02-29', '2024-03-31'];
{ Hardcoded month-end dates for Q1 2024 }

{ 2. Get Unique Accounts }
unique_accounts := SELECT DISTINCT account_id
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01';
{ Extract all unique account IDs from Q1 2024 transactions }

{ 3. Generate Account×Month Spine (Cartesian Product) }
account_months := CROSS_JOIN(unique_accounts, months);
{ Create all combinations of (account_id, month_end) }
{ Result: N accounts × 3 months = complete spine }

{ 4. Find Last Transaction per Account per Month }
monthly_last_tx := SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC
) AS row_number
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01';
{ Window function identifies last transaction per account per month }

{ 5. Filter to Only Last Transaction }
last_tx_per_month := FILTER monthly_last_tx WHERE row_number = 1;
{ Keep only row_number = 1 (the last transaction) }

{ 6. Join Spine with Data and Carry Forward }
raw_balances := SELECT
account_months.account_id,
account_months.month_end,
last_tx_per_month.new_balance
FROM account_months
LEFT JOIN last_tx_per_month
ON account_months.account_id = last_tx_per_month.account_id
AND DATE_TRUNC('month', account_months.month_end) = last_tx_per_month.tx_month_start;

result := SELECT
account_id AS acct,
LAST_VALUE(new_balance) IGNORE NULLS OVER (
PARTITION BY account_id ORDER BY month_end
) AS balance,
TO_CHAR(month_end, 'YYYY-MM') AS month
FROM raw_balances
ORDER BY account_id, month_end;
{ LEFT JOIN ensures all account×month combinations appear }
{ LAST_VALUE(IGNORE NULLS): Carries forward last known balance for gaps }
{ Result: NULL only if no previous history exists }

RETURN result;
END;

Query Objective

Generate month-end account balances for Q1 2024 (January, February, March) with carry-forward: months with no transactions show the last known balance; NULL only when the account has no prior history.

Key Design Decisions

  1. Spine Generation: CROSS JOIN ensures all account×month combinations are present, making gaps explicit (for cardinality pattern, see Data Lake Architecture - Transformation Cardinality)
  2. Window Function: ROW_NUMBER() efficiently identifies last transaction without multiple scans (optimized for 100M+ row tables)
  3. LEFT JOIN: Preserves all account×month combinations; carry-forward (LAST_VALUE IGNORE NULLS) fills months with no activity with the last known balance
  4. Partition Pruning: WHERE clause on tx_date enables partition pruning for 100M row tables (see Silver layer schema)
  5. Deterministic Ordering: id DESC as tie-breaker ensures consistent results (see Data Model Design Rationale)

Performance Optimizations

For 100M row tables (see Data Lake Architecture - Performance Benchmarks):

  1. Partitioning Strategy (see Parquet Schema Specification):

  2. Parquet Optimization (Athena) (see Tooling & Controls - Athena):

    • Parquet column statistics (min/max) enable efficient filtering
    • Partition metadata enables partition pruning
    • Speeds up DISTINCT and window function operations
    • Reduces data scanned (for cost optimization, see Design Decisions)
  3. Window Function Efficiency:

    • Single pass over data (no multiple scans)
    • Partitioned by account_id and month
    • Ordered by tx_date DESC for last transaction (see Data Model Design Rationale)
  4. Join Strategy:

    • LEFT JOIN preserves spine (no data loss)
    • Join on account_id and month (optimized via partition pruning and Parquet stats)

Breakdown: Inner to Outer

For complete SQL code breakdown, see SQL Complete Reference - Breakdown. For how this query reads from the Silver layer, see the referenced documentation.

1. Base Table: transactions (Innermost - Data Source)

Code Reference: See Base Table Filter in SQL Complete Reference.

What it does:

  • Reads from the transactions table (as defined in Appendix A of the business case)
  • Filters to Q1 2024 (January, February, March)
  • Partition pruning: If table is partitioned by tx_date, only scans relevant partitions (see Partitioning Strategy)

Schema Note: This query uses the schema from Appendix A (id, account_id, amount, new_balance, tx_date). The ETL Pipeline produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp). If this query needs to run on that output, a schema transformation/mapping step is required:

  • TransactionID → id (or generate sequential id)
  • CustomerID → account_id
  • TransactionAmount → amount
  • TransactionTimestamp → tx_date
  • new_balance would need to be calculated from TransactionAmount (cumulative sum per account)

Columns used:

  • account_id - Account identifier
  • new_balance - Balance after transaction (per Appendix A spec)
  • tx_date - Transaction date
  • id - Transaction ID (for tie-breaking)

2. CTE 1: months (First CTE - Month Spine)

Code Reference: See CTE 1: months in SQL Complete Reference.

What it does:

  • Creates a hardcoded list of month-end dates for Q1 2024
  • Purpose: Defines the reporting period explicitly
  • Result: 3 rows (one per month)

Why month-end dates?

  • January 31, February 29 (2024 is leap year), March 31
  • These represent the "as-of" dates for month-end reporting
  • January 31, February 29 (2024 is leap year), March 31
  • These represent the "as-of" dates for month-end reporting

3. CTE 2: unique_accounts (Second CTE - Account Universe)

Code Reference: See CTE 2: unique_accounts in the SQL Complete Reference.

What it does:

  • Extracts all unique account IDs from the transactions table
  • Purpose: Defines the "universe" of accounts to report on
  • Result: One row per unique account

Note: The commented WHERE clause would optimize by filtering during the DISTINCT operation, but for clarity the filter is applied in the next CTE.


4. CTE 3: account_months (Third CTE - Complete Spine)

Code Reference: See CTE 3: account_months in SQL Complete Reference.

What it does:

  • CROSS JOIN: Creates Cartesian product of all accounts × all months
  • Purpose: Generates a complete "spine" of (account, month) combinations
  • Result: If there are N accounts and 3 months, produces N×3 rows

Why CROSS JOIN?

  • Ensures every account appears for every month
  • Months with no transactions will show NULL balance (explicit gaps)
  • Without this, accounts with no activity in a month would be missing from results

Example output:

| account_id | month_end  |
| ---------- | ---------- |
| ACC001 | 2024-01-31 |
| ACC001 | 2024-02-29 |
| ACC001 | 2024-03-31 |
| ACC002 | 2024-01-31 |
| ACC002 | 2024-02-29 |
| ACC002 | 2024-03-31 |
...

5. CTE 4: monthly_last_tx (Fourth CTE - Last Transaction per Month)

Code Reference: See CTE 4: monthly_last_tx in SQL Complete Reference.

Detailed diagram: See Window Function Detailed Flow in SQL Complete Reference.

What it does:

  • Window Function: ROW_NUMBER() partitions by account and month
  • Ordering: tx_date DESC, id DESC - most recent transaction first
  • Purpose: Identifies the last transaction for each account in each month
  • Filter: Only rows where rn = 1 are the last transactions

How it works:

  1. Partitions transactions by (account_id, month)
  2. Within each partition, orders by tx_date DESC (newest first)
  3. Assigns row number 1 to the most recent transaction
  4. If multiple transactions on same day, uses id DESC as tie-breaker

Example:

Account ACC001 in January:
- Transaction on 2024-01-15: new_balance = 150.75, rn = 2
- Transaction on 2024-01-22: new_balance = 326.00, rn = 1 ← Last transaction

Result: One row per (account, month) combination that has transactions


6. Final SELECT (Outermost - Final Report)

Code Reference: See Final SELECT in SQL Complete Reference.

What it does:

  • LEFT JOIN: Joins the complete spine (account_months) with the last transactions (monthly_last_tx)
  • Join conditions:
    1. account_id matches
    2. Month matches (using DATE_TRUNC to align month-end with month-start)
    3. Only last transaction (rn = 1)
  • Result: One row per (account, month) combination
    • If transaction exists: shows new_balance
    • If no transaction: carries forward last known balance (or NULL if no history)

Why LEFT JOIN?

  • Preserves all rows from account_months (the spine)
  • Months with no transaction get carry-forward (last known balance); NULL only when the account has no prior history
  • Without LEFT JOIN, those months would be missing from results

Final output format:

| acct   | balance | month   |
| ------ | ------- | ------- |
| ACC001 | 326.00 | 2024-01 |
| ACC001 | 326.00 | 2024-02 |
| ACC001 | 326.00 | 2024-03 |
| ACC002 | NULL | 2024-01 |
| ACC002 | -25.50 | 2024-02 |
| ACC002 | -125.50 | 2024-03 |
...

Execution Order (How SQL Engine Processes It)

Detailed execution diagram: See Execution Order (Detailed).

Step-by-Step:

  1. Base table scan: transactions table (with partition pruning)
  2. CTE 1: months - Hardcoded values (instant)
  3. CTE 2: unique_accounts - DISTINCT on transactions (requires scan)
  4. CTE 3: account_months - CROSS JOIN (fast, small result set)
  5. CTE 4: monthly_last_tx - Window function on transactions (requires full scan + sort)
  6. Final SELECT: LEFT JOIN between spine and last transactions

Performance note: For 100M rows, CTE 4 (window function) is the most expensive operation, but it is necessary to identify the last transaction efficiently.


Performance Optimization Strategy

For how partition pruning achieves 95% cost reduction, see Design Decisions Summary. For performance benchmarks, see Data Lake Architecture - Performance Benchmarks. This simplified diagram shows the optimization approach:

Detailed optimization diagrams: See Performance Optimization Strategy.


Key Design Patterns

For how these patterns enable Gold layer aggregation, see Data Lake Architecture. For testing these patterns, see Testing Guide.

1. Spine Generation Pattern

  • Problem: Need to show all account×month combinations, even when no data exists
  • Solution: CROSS JOIN to create complete spine, then LEFT JOIN with data
  • Result: Gaps are explicit (NULL) rather than missing rows

2. Window Function Pattern

  • Problem: Need last transaction per account per month efficiently
  • Solution: ROW_NUMBER() with partitioning and ordering
  • Result: Single pass over data (O(n log n) vs O(n²) for nested loops)

3. Partition Pruning Pattern

  • Problem: 100M row table, but only need Q1 2024
  • Solution: WHERE clause on partitioned column (tx_date)
  • Result: Only scans relevant partitions (Jan, Feb, Mar 2024)

Why This Approach?

For design rationale and trade-offs, see Design Decisions Summary - Partition Strategy. For alternative approaches considered, see SQL Examples - Alternatives.

Alternative 1: Simple GROUP BY (Wrong)

Code Reference: See Alternative 1: Simple GROUP BY in SQL Complete Reference.

Problem: Doesn't show months with no transactions (missing rows)

Alternative 2: SUM(amount) (Wrong per spec)

Code Reference: See Alternative 2: SUM(amount) in SQL Examples.

Problem: Spec says to use new_balance from last transaction, not sum amounts

Approach: Spine + Window Function (Correct)

Verdict: Sufficient for spec: spine covers all account×month combinations, last-transaction balance per spec, efficient for 100M rows (window function and partition pruning), and NULL months handled correctly.


Gold Layer Aggregation

For Gold layer architecture, see Data Lake Architecture - Gold Layer and Layer Transformation Cardinality.

This query implements the Silver → Gold aggregation pattern (1:N relationship). Gold structure and governance: Architecture.

Storing Results in Gold Layer

The query output represents month-end balance snapshots that can be persisted.

Code Reference: See Gold Layer Storage Example in SQL Complete Reference.

Benefits of Gold Layer Storage

  • Performance: Pre-computed aggregations enable O(log n) queries instead of O(n log n)
  • Cost: Reduces Athena query costs (scan less data)
  • Freshness: Can be refreshed monthly after month-end close
  • Governance: Gold layer owned by Business/Finance (see Gold layer structure, governance, and ownership model)
  • Stability: Stable schema and business contracts for reporting

Monthly Refresh Pattern

For monthly reporting, this query would run:

  1. After month-end close (e.g., first day of next month)
  2. Compute month-end balances from Silver layer (this query)
  3. Store results in Gold layer (gold/finance/account_balances/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet)
  4. Update Glue Data Catalog for Athena queries

This pattern transforms ad-hoc Silver layer queries into efficient Gold layer lookups (one of multiple Gold aggregations in the 1:N pattern) for business reporting.


See also

For high-level context, see Executive Summary or System Architecture Overview.

© 2026 Stephen AdeiCC BY 4.0