Skip to main content

Full SQL Implementation Code

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This section contains the complete SQL implementation for generating month-end balance history.

File: balance_history_2024_q1.sql


/*
Task 3: Month-End Balance History

Goal: Show account balance history at the end of each month for Jan-Mar 2024.
Constraint: 100M records (requires efficient indexing/partitioning logic).
History Logic: Month-end balance is the last `new_balance` observed within a month (ordered by `tx_date`, tie-broken by `id`).
Carry-Forward Logic: If an account has no transactions in a month, the last known balance is carried forward from the previous month (forward-fill).
This implementation uses `LAST_VALUE(...) IGNORE NULLS` to fill gaps in the account×month spine.
Time semantics note: this query uses `tx_date` as the ordering/reporting timestamp; if the domain distinguishes transaction vs posting date,
month-end reporting should use the posting/cutoff timestamp.

Schema Note: This query uses the schema from the business case (id, account_id, amount, new_balance, tx_date). See [Parquet Schema Specification](/docs/PARQUET_SCHEMA_SPECIFICATION) for full details.
Task 1 ETL produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp).
If this query needs to run on Task 1 output, a schema transformation/mapping step is required at the beginning or end of the pipeline,
depending on the actual raw data structure and business goals. The transformation would map:
- TransactionID → id (or generate sequential id)
- CustomerID → account_id
- TransactionAmount → amount
- TransactionTimestamp → tx_date
- new_balance would need to be calculated from TransactionAmount (cumulative sum per account)

Engine Note: syntax is standard ANSI/Presto SQL.
*/

-- 1. Generate the spine of all (Account, Month) combinations to ensure gaps are visible
WITH months AS (
SELECT CAST(date_column AS DATE) AS month_end
FROM (
VALUES
('2024-01-31'),
('2024-02-29'),
('2024-03-31')
) AS t (date_column)
),

unique_accounts AS (
SELECT DISTINCT account_id FROM transactions
-- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
),

account_months AS (
SELECT
a.account_id,
m.month_end
FROM unique_accounts AS a
CROSS JOIN months AS m
),

-- 2. Find the last transaction for each account in each month
-- Efficient strategy: Window function partitioning by account/month
monthly_last_tx AS (
SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC -- id tie-breaker ensures deterministic result
) AS rn
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
),

-- 3. Join spine with data and fill gaps (Carry-Forward)
raw_balances AS (
SELECT
am.account_id,
am.month_end,
tx.new_balance
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
)

-- 4. Apply forward-fill logic to carry over the last known balance
SELECT
account_id AS acct,
-- LAST_VALUE(IGNORE NULLS) fills NULLs with the most recent non-NULL value
LAST_VALUE(new_balance IGNORE NULLS) OVER (
PARTITION BY account_id
ORDER BY month_end
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS balance,
TO_CHAR(month_end, 'YYYY-MM') AS month
FROM raw_balances
ORDER BY account_id, month_end;

/*
Performance Considerations for 100M rows:

1. Partitioning: The 'transactions' table should be partitioned by tx_date (month/day).

This allows the query engine to prune partitions outside Jan-Mar 2024 immediately.

2. Parquet Optimization (Athena): Parquet column statistics (min/max) for account_id enable efficient filtering. Athena uses partition metadata and Parquet statistics to optimize operations without requiring explicit clustering or indexing.

This makes the 'DISTINCT account_id' and window function operations much faster (avoids heavy shuffle).
*/

/*
Query Explanation:

This query produces month-end balance history for Q1 2024 (January, February, March).

1. Month-end balance definition:
- The balance at the end of a month is defined as the last transaction's `new_balance`

observed within that month (ordered by `tx_date` descending, tie-broken by `id`).

2. Account×month spine with Carry-Forward:
- The query generates a complete spine of all (account, month) combinations using
CROSS JOIN to ensure all reporting months are present.
- If an account has no transactions in a month, the `LAST_VALUE(... IGNORE NULLS)`
function carries forward the balance from the most recent previous month.
- A `NULL` is only returned if an account has no transactions *up to and including*
that reporting month.

3. Window function approach:
- Uses ROW_NUMBER() partitioned by account and month to identify the last transaction.
- Efficient for large datasets (100M rows) as it avoids multiple scans.

4. Partition pruning:
- The WHERE clause filters to Jan-Mar 2024, enabling partition pruning on `tx_date`.
- This minimizes data scanned and improves query performance.

5. Engine portability:
- Uses standard ANSI/Presto SQL syntax.
- Compatible with Athena, BigQuery, Snowflake, and other modern SQL engines.
- Avoids engine-specific features for maximum portability.

Ownership & Governance (Task 3):

| Aspect | Owner | Responsibility |
| -------------------------------------------------------------------------------------- | ------------------ | ---------------------------------------------------- |
| **Gold Tables** | Business (Finance) | Analysts query Gold tables only for reporting |
| **Silver Tables** | Domain Teams | Technical users may query Silver for ad-hoc analysis |
| **Bronze Layer** Data Platform Team Not queried for analytics (raw audit trail only) |

Rules:

- Analysts query Gold tables only (business contracts, stable schemas).
- Silver tables used by technical users for ad-hoc analysis and debugging.
- Bronze is not queried for analytics (immutable raw data, platform team access only).

*/

See also

© 2026 Stephen AdeiCC BY 4.0