SQL Query Breakdown & Flow

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Location:

Production query: balance_history_2024_q1.sql
Query breakdown: This document (SQL_BREAKDOWN.md)
Detailed diagrams: SQL Complete Reference

Content:

SQL query that shows account balance history at the end of each month for each account
Covers first three months of 2024 (January, February, March)
Carry-forward: months with no transactions show the last known balance; NULL only if the account has no prior history in the period
Optimized for large tables (100M+ records) with partition pruning (for cost optimization strategy, see Design Decisions)
Complete query flow explanation with visual diagrams

Scope: Analytical workload; source systems are upstream (Scope & Assumptions).

Query Output Format:

Acct    month     balance
ACC001  2024-01   326.00
ACC001  2024-02   326.00
ACC001  2024-03   326.00
ACC002  2024-01   null
ACC002  2024-02   -25.50
ACC002  2024-03   -125.50

🤖 GenAI possibility — This query (and any Athena SQL) is a great fit for Bedrock: “what this query does” in plain language, NL-to-SQL for Gold, optimization hints, report commentary. See GenAI in the Ohpen Case & Opportunities.

Gold Layer Output Schema

The query output becomes the Gold layer table account_balances. This implements the 1:N transformation pattern (Silver → Gold) where one validated Silver source produces multiple business aggregations.

Gold Layer Schema: gold/finance/account_balances/

Column	Type	Partition Key	Nullable	Description
`account_id`	VARCHAR	No	NOT NULL	Account identifier
`month`	VARCHAR(7)	No	NOT NULL	Month in YYYY-MM format (e.g., "2024-01")
`balance`	DECIMAL(16,2)	No	NULLABLE	Month-end balance (NULL if no prior history)
`as_of_month`	VARCHAR(7)	Yes	NOT NULL	Partition key: reporting month (same as `month` column)
`schema_v`	VARCHAR	Yes	NOT NULL	Partition key: schema version (e.g., "v1")
`run_id`	VARCHAR	Yes	NOT NULL	Partition key: aggregation run identifier

Additional columns (optional):

transaction_count (INT): Number of transactions in the month
first_tx_date (TIMESTAMP): Date of first transaction in month
last_tx_date (TIMESTAMP): Date of last transaction in month

Partition Strategy:

as_of_month: Enables month-level partition pruning for reporting queries (e.g., Q1 2024 scans only 3 partitions)
schema_v: Supports schema evolution (new columns can be added with schema_v=v2)
run_id: Isolates aggregation runs (safe backfills, idempotent reruns)

Gold Layer Path Example:

s3://ohpen-gold/gold/finance/account_balances/
  schema_v=v1/
    run_id=20240121T120000Z/
      as_of_month=2024-01/
        part-00000.parquet
      as_of_month=2024-02/
        part-00001.parquet
      as_of_month=2024-03/
        part-00002.parquet
    _SUCCESS
  current/  (promoted aggregations only)
  _LATEST.json  (pointer to current run)

Transformation Pattern: This query reads from Silver (silver/mortgages/transactions/) and writes to Gold (gold/finance/account_balances/), implementing the 1:N cardinality model (one Silver source → one Gold aggregation; other aggregations can be added without changing Silver).

Extended documentation: SQL Complete Reference. Code: SQL Implementation Code.

This document provides a high-level overview of the Analytical Query Implementation query, including visual diagrams and pseudocode. It breaks down the query from the most inner table working outwards, explaining each CTE and the final SELECT statement.

Note: For detailed, comprehensive diagrams showing all steps and decision points, see SQL Query Diagrams.

Query Flow Overview

For how this query leverages the partition strategy and Silver layer schema, see the referenced documentation. This simplified diagram shows the high-level query flow:

Detailed flow diagrams: See SQL Complete Reference.

Data Model (Entity Relationship)

For the complete schema specification, see Parquet Schema Specification. For data model design rationale, see Data Model Design Rationale. This ER diagram shows the data relationships and schema structure:

Textual Flow:

transactions (base table)
    ↓
months (CTE 1 - innermost)
    ↓
unique_accounts (CTE 2)
    ↓
account_months (CTE 3)
    ↓
monthly_last_tx (CTE 4)
    ↓
Final SELECT (outermost)

High-Level Algorithm

For complete SQL implementation code, see SQL Implementation Code. For detailed algorithm explanation, see SQL Complete Reference - Algorithm. The query follows a systematic approach to generate month-end balances:

FUNCTION GenerateMonthEndBalances(): RESULT_SET;
BEGIN
    { 1. Generate Month Spine }
    months := ['2024-01-31', '2024-02-29', '2024-03-31'];
    { Hardcoded month-end dates for Q1 2024 }

    { 2. Get Unique Accounts }
    unique_accounts := SELECT DISTINCT account_id
                       FROM transactions
                       WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01';
    { Extract all unique account IDs from Q1 2024 transactions }

    { 3. Generate Account×Month Spine (Cartesian Product) }
    account_months := CROSS_JOIN(unique_accounts, months);
    { Create all combinations of (account_id, month_end) }
    { Result: N accounts × 3 months = complete spine }

    { 4. Find Last Transaction per Account per Month }
    monthly_last_tx := SELECT
        account_id,
        new_balance,
        DATE_TRUNC('month', tx_date) AS tx_month_start,
        ROW_NUMBER() OVER (
            PARTITION BY account_id, DATE_TRUNC('month', tx_date)
            ORDER BY tx_date DESC, id DESC
        ) AS row_number
    FROM transactions
    WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01';
    { Window function identifies last transaction per account per month }

    { 5. Filter to Only Last Transaction }
    last_tx_per_month := FILTER monthly_last_tx WHERE row_number = 1;
    { Keep only row_number = 1 (the last transaction) }

    { 6. Join Spine with Data and Carry Forward }
    raw_balances := SELECT
        account_months.account_id,
        account_months.month_end,
        last_tx_per_month.new_balance
    FROM account_months
    LEFT JOIN last_tx_per_month
        ON account_months.account_id = last_tx_per_month.account_id
        AND DATE_TRUNC('month', account_months.month_end) = last_tx_per_month.tx_month_start;

    result := SELECT
        account_id AS acct,
        LAST_VALUE(new_balance) IGNORE NULLS OVER (
            PARTITION BY account_id ORDER BY month_end
        ) AS balance,
        TO_CHAR(month_end, 'YYYY-MM') AS month
    FROM raw_balances
    ORDER BY account_id, month_end;
    { LEFT JOIN ensures all account×month combinations appear }
    { LAST_VALUE(IGNORE NULLS): Carries forward last known balance for gaps }
    { Result: NULL only if no previous history exists }

    RETURN result;
END;

Query Objective

Generate month-end account balances for Q1 2024 (January, February, March) with carry-forward: months with no transactions show the last known balance; NULL only when the account has no prior history.

Key Design Decisions

Spine Generation: CROSS JOIN ensures all account×month combinations are present, making gaps explicit (for cardinality pattern, see Data Lake Architecture - Transformation Cardinality)
Window Function: ROW_NUMBER() efficiently identifies last transaction without multiple scans (optimized for 100M+ row tables)
LEFT JOIN: Preserves all account×month combinations; carry-forward (LAST_VALUE IGNORE NULLS) fills months with no activity with the last known balance
Partition Pruning: WHERE clause on tx_date enables partition pruning for 100M row tables (see Silver layer schema)
Deterministic Ordering: id DESC as tie-breaker ensures consistent results (see Data Model Design Rationale)

Performance Optimizations

For 100M row tables (see Data Lake Architecture - Performance Benchmarks):

Partitioning Strategy (see Parquet Schema Specification):
- Table partitioned by tx_date (month/day) as defined in Silver layer schema
- WHERE clause enables partition pruning (95% scan reduction)
- Only scans Jan-Mar 2024 partitions (see Data Lake Architecture - Key Design Choices)
Parquet Optimization (Athena) (see Tooling & Controls - Athena):
- Parquet column statistics (min/max) enable efficient filtering
- Partition metadata enables partition pruning
- Speeds up DISTINCT and window function operations
- Reduces data scanned (for cost optimization, see Design Decisions)
Window Function Efficiency:
- Single pass over data (no multiple scans)
- Partitioned by account_id and month
- Ordered by tx_date DESC for last transaction (see Data Model Design Rationale)
Join Strategy:
- LEFT JOIN preserves spine (no data loss)
- Join on account_id and month (optimized via partition pruning and Parquet stats)

Breakdown: Inner to Outer

For complete SQL code breakdown, see SQL Complete Reference - Breakdown. For how this query reads from the Silver layer, see the referenced documentation.

1. Base Table: `transactions` (Innermost - Data Source)

Code Reference: See Base Table Filter in SQL Complete Reference.

What it does:

Reads from the transactions table (as defined in Appendix A of the business case)
Filters to Q1 2024 (January, February, March)
Partition pruning: If table is partitioned by tx_date, only scans relevant partitions (see Partitioning Strategy)

Schema Note: This query uses the schema from Appendix A (id, account_id, amount, new_balance, tx_date). The ETL Pipeline produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp). If this query needs to run on that output, a schema transformation/mapping step is required:

TransactionID → id (or generate sequential id)
CustomerID → account_id
TransactionAmount → amount
TransactionTimestamp → tx_date
new_balance would need to be calculated from TransactionAmount (cumulative sum per account)

Columns used:

account_id - Account identifier
new_balance - Balance after transaction (per Appendix A spec)
tx_date - Transaction date
id - Transaction ID (for tie-breaking)

2. CTE 1: `months` (First CTE - Month Spine)

Code Reference: See CTE 1: months in SQL Complete Reference.

What it does:

Creates a hardcoded list of month-end dates for Q1 2024
Purpose: Defines the reporting period explicitly
Result: 3 rows (one per month)

Why month-end dates?

January 31, February 29 (2024 is leap year), March 31
These represent the "as-of" dates for month-end reporting
January 31, February 29 (2024 is leap year), March 31
These represent the "as-of" dates for month-end reporting

3. CTE 2: `unique_accounts` (Second CTE - Account Universe)

Code Reference: See CTE 2: unique_accounts in the SQL Complete Reference.

What it does:

Extracts all unique account IDs from the transactions table
Purpose: Defines the "universe" of accounts to report on
Result: One row per unique account

Note: The commented WHERE clause would optimize by filtering during the DISTINCT operation, but for clarity the filter is applied in the next CTE.

4. CTE 3: `account_months` (Third CTE - Complete Spine)

Code Reference: See CTE 3: account_months in SQL Complete Reference.

What it does:

CROSS JOIN: Creates Cartesian product of all accounts × all months
Purpose: Generates a complete "spine" of (account, month) combinations
Result: If there are N accounts and 3 months, produces N×3 rows

Why CROSS JOIN?

Ensures every account appears for every month
Months with no transactions will show NULL balance (explicit gaps)
Without this, accounts with no activity in a month would be missing from results

Example output:

| account_id | month_end  |
| ---------- | ---------- |
| ACC001     | 2024-01-31 |
| ACC001     | 2024-02-29 |
| ACC001     | 2024-03-31 |
| ACC002     | 2024-01-31 |
| ACC002     | 2024-02-29 |
| ACC002     | 2024-03-31 |
...

5. CTE 4: `monthly_last_tx` (Fourth CTE - Last Transaction per Month)

Code Reference: See CTE 4: monthly_last_tx in SQL Complete Reference.

Detailed diagram: See Window Function Detailed Flow in SQL Complete Reference.

What it does:

Window Function: ROW_NUMBER() partitions by account and month
Ordering: tx_date DESC, id DESC - most recent transaction first
Purpose: Identifies the last transaction for each account in each month
Filter: Only rows where rn = 1 are the last transactions

How it works:

Partitions transactions by (account_id, month)
Within each partition, orders by tx_date DESC (newest first)
Assigns row number 1 to the most recent transaction
If multiple transactions on same day, uses id DESC as tie-breaker

Example:

Account ACC001 in January:
- Transaction on 2024-01-15: new_balance = 150.75, rn = 2
- Transaction on 2024-01-22: new_balance = 326.00, rn = 1 ← Last transaction

Result: One row per (account, month) combination that has transactions

6. Final SELECT (Outermost - Final Report)

Code Reference: See Final SELECT in SQL Complete Reference.

What it does:

LEFT JOIN: Joins the complete spine (account_months) with the last transactions (monthly_last_tx)
Join conditions:
1. account_id matches
2. Month matches (using DATE_TRUNC to align month-end with month-start)
3. Only last transaction (rn = 1)
Result: One row per (account, month) combination
- If transaction exists: shows new_balance
- If no transaction: carries forward last known balance (or NULL if no history)

Why LEFT JOIN?

Preserves all rows from account_months (the spine)
Months with no transaction get carry-forward (last known balance); NULL only when the account has no prior history
Without LEFT JOIN, those months would be missing from results

Final output format:

| acct   | balance | month   |
| ------ | ------- | ------- |
| ACC001 | 326.00  | 2024-01 |
| ACC001 | 326.00  | 2024-02 |
| ACC001 | 326.00  | 2024-03 |
| ACC002 | NULL    | 2024-01 |
| ACC002 | -25.50  | 2024-02 |
| ACC002 | -125.50 | 2024-03 |
...

Execution Order (How SQL Engine Processes It)

Detailed execution diagram: See Execution Order (Detailed).

Step-by-Step:

Base table scan: transactions table (with partition pruning)
CTE 1: months - Hardcoded values (instant)
CTE 2: unique_accounts - DISTINCT on transactions (requires scan)
CTE 3: account_months - CROSS JOIN (fast, small result set)
CTE 4: monthly_last_tx - Window function on transactions (requires full scan + sort)
Final SELECT: LEFT JOIN between spine and last transactions

Performance note: For 100M rows, CTE 4 (window function) is the most expensive operation, but it is necessary to identify the last transaction efficiently.

Performance Optimization Strategy

For how partition pruning achieves 95% cost reduction, see Design Decisions Summary. For performance benchmarks, see Data Lake Architecture - Performance Benchmarks. This simplified diagram shows the optimization approach:

Detailed optimization diagrams: See Performance Optimization Strategy.

Key Design Patterns

For how these patterns enable Gold layer aggregation, see Data Lake Architecture. For testing these patterns, see Testing Guide.

1. Spine Generation Pattern

Problem: Need to show all account×month combinations, even when no data exists
Solution: CROSS JOIN to create complete spine, then LEFT JOIN with data
Result: Gaps are explicit (NULL) rather than missing rows

2. Window Function Pattern

Problem: Need last transaction per account per month efficiently
Solution: ROW_NUMBER() with partitioning and ordering
Result: Single pass over data (O(n log n) vs O(n²) for nested loops)

3. Partition Pruning Pattern

Problem: 100M row table, but only need Q1 2024
Solution: WHERE clause on partitioned column (tx_date)
Result: Only scans relevant partitions (Jan, Feb, Mar 2024)

Why This Approach?

For design rationale and trade-offs, see Design Decisions Summary - Partition Strategy. For alternative approaches considered, see SQL Examples - Alternatives.

Alternative 1: Simple GROUP BY (Wrong)

Code Reference: See Alternative 1: Simple GROUP BY in SQL Complete Reference.

Problem: Doesn't show months with no transactions (missing rows)

Alternative 2: SUM(amount) (Wrong per spec)

Code Reference: See Alternative 2: SUM(amount) in SQL Examples.

Problem: Spec says to use new_balance from last transaction, not sum amounts

Approach: Spine + Window Function (Correct)

Verdict: Sufficient for spec: spine covers all account×month combinations, last-transaction balance per spec, efficient for 100M rows (window function and partition pruning), and NULL months handled correctly.

Gold Layer Aggregation

For Gold layer architecture, see Data Lake Architecture - Gold Layer and Layer Transformation Cardinality.

This query implements the Silver → Gold aggregation pattern (1:N relationship). Gold structure and governance: Architecture.

Storing Results in Gold Layer

The query output represents month-end balance snapshots that can be persisted.

Code Reference: See Gold Layer Storage Example in SQL Complete Reference.

Benefits of Gold Layer Storage

Performance: Pre-computed aggregations enable O(log n) queries instead of O(n log n)
Cost: Reduces Athena query costs (scan less data)
Freshness: Can be refreshed monthly after month-end close
Governance: Gold layer owned by Business/Finance (see Gold layer structure, governance, and ownership model)
Stability: Stable schema and business contracts for reporting

Monthly Refresh Pattern

For monthly reporting, this query would run:

After month-end close (e.g., first day of next month)
Compute month-end balances from Silver layer (this query)
Store results in Gold layer (gold/finance/account_balances/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet)
Update Glue Data Catalog for Athena queries

This pattern transforms ad-hoc Silver layer queries into efficient Gold layer lookups (one of multiple Gold aggregations in the 1:N pattern) for business reporting.

Gold Layer Output Schema​

Query Flow Overview​

Data Model (Entity Relationship)​

High-Level Algorithm​

Query Objective​

Key Design Decisions​

Performance Optimizations​

Breakdown: Inner to Outer​

1. Base Table: transactions (Innermost - Data Source)​

2. CTE 1: months (First CTE - Month Spine)​

3. CTE 2: unique_accounts (Second CTE - Account Universe)​

4. CTE 3: account_months (Third CTE - Complete Spine)​

5. CTE 4: monthly_last_tx (Fourth CTE - Last Transaction per Month)​

6. Final SELECT (Outermost - Final Report)​

Execution Order (How SQL Engine Processes It)​

Performance Optimization Strategy​

Key Design Patterns​

1. Spine Generation Pattern​

2. Window Function Pattern​

3. Partition Pruning Pattern​

Why This Approach?​

Alternative 1: Simple GROUP BY (Wrong)​

Alternative 2: SUM(amount) (Wrong per spec)​

Approach: Spine + Window Function (Correct)​

Gold Layer Aggregation​

Storing Results in Gold Layer​

Benefits of Gold Layer Storage​

Monthly Refresh Pattern​

See also​

Gold Layer Output Schema

Query Flow Overview

Data Model (Entity Relationship)

High-Level Algorithm

Query Objective

Key Design Decisions

Performance Optimizations

Breakdown: Inner to Outer

1. Base Table: `transactions` (Innermost - Data Source)

2. CTE 1: `months` (First CTE - Month Spine)

3. CTE 2: `unique_accounts` (Second CTE - Account Universe)

4. CTE 3: `account_months` (Third CTE - Complete Spine)

5. CTE 4: `monthly_last_tx` (Fourth CTE - Last Transaction per Month)

6. Final SELECT (Outermost - Final Report)

Execution Order (How SQL Engine Processes It)

Performance Optimization Strategy

Key Design Patterns

1. Spine Generation Pattern

2. Window Function Pattern

3. Partition Pruning Pattern

Why This Approach?

Alternative 1: Simple GROUP BY (Wrong)

Alternative 2: SUM(amount) (Wrong per spec)

Approach: Spine + Window Function (Correct)

Gold Layer Aggregation

Storing Results in Gold Layer

Benefits of Gold Layer Storage

Monthly Refresh Pattern

See also