SQL Analytics Boundaries
© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
This document captures the explicit assumptions and edge cases considered for the SQL query implementation.
Query Assumptions
Data Volume & Performance
- Table Size: Designed for 100M+ row tables
- Partition Pruning: Assumes table is partitioned by
tx_date(year/month) for efficient scanning - Query Performance: Estimated < 30 seconds for month-end reports (based on partition pruning analysis)
- Scan Reduction: Partition pruning reduces data scanned by ~95% for Q1 2024 queries (approximation)
- Calculation example: For 5 years of retained data (60 months total), querying Q1 2024 (3 months: Jan, Feb, Mar) scans only 3/60 = 5% of partitions, thus 95% reduction
- Note: Actual reduction depends on data retention policy and query date range. Production systems may retain 3-7 years of data, resulting in 91-97% reduction for quarterly queries
Schema Assumptions
- Input Schema: Query uses schema from Appendix A (id, account_id, amount, new_balance, tx_date)
- Schema Mismatch: Task 1 ETL produces different schema (TransactionID, CustomerID, etc.)
- Assumption: Schema transformation/mapping step required if query runs on Task 1 output
- Mapping: TransactionID → id, CustomerID → account_id, TransactionTimestamp → tx_date
- Calculated Field:
new_balancewould need to be calculated from TransactionAmount (cumulative sum)
- Logical data model: For a single, ledger-ready Parquet schema and explicit mapping/assumptions bridging Task 1 and Task 3, see Logical Data Model: Task 1 Parquet ↔ Task 3 SQL Ledger.
Query Logic Assumptions
- Month-End Balance Definition: Last transaction's
new_balancewithin the month (ordered bytx_dateDESC, tie-broken byidDESC) - Missing Months (carry-forward): Months with no transactions show the last known balance; NULL only when the account has no prior history in the period
- Reporting Period: Hardcoded to Q1 2024 (January, February, March)
- Leap Year Handling: February 29, 2024 correctly handled (2024 is a leap year)
Engine Compatibility
- SQL Dialect: Standard ANSI/Presto SQL syntax
- Compatibility: Compatible with Athena, BigQuery, Snowflake, and other modern SQL engines
- Portability: Avoids engine-specific features for maximum portability
Edge Cases Handled
Missing Data
- No Transactions in Month: Carry-forward applies; balance is the last known balance (NULL only if no prior history)
- No Transactions for Account: Account appears in output with NULL balances for all months (no prior history)
- Missing Partition: If partition for a month does not exist, query still returns NULL (does not fail)
Data Quality Edge Cases
- Duplicate Transactions: Window function with ROW_NUMBER() ensures deterministic selection (tie-breaker by
id) - NULL Values: NULL
new_balancevalues are preserved (not filtered out) - Invalid Dates: Query filters to
tx_date >= '2024-01-01' AND tx_date < '2024-04-01'(invalid dates outside range ignored)
Performance Edge Cases
- Large Account Count: DISTINCT on
account_idmay be expensive on very large tables (mitigated by partition pruning) - Skewed Data: Uneven distribution of transactions across accounts may impact window function performance
- Cross-Partition Scans: If data spans many partitions, query may scan more data than expected
Business Logic Edge Cases
- Tie-Breaking: Multiple transactions with same
tx_dateuseidDESC as tie-breaker (deterministic but may not match business intent) - Time Semantics: Query uses
tx_dateas ordering timestamp (if domain distinguishes transaction vs posting date, month-end reporting should use posting/cutoff timestamp) - Carry-Forward Balance: Query applies forward-fill (LAST_VALUE IGNORE NULLS) so months with no activity show the last known balance; NULL only when the account has no prior history
Query Limitations
Hardcoded Values
- Reporting Period: Q1 2024 hardcoded in VALUES clause (not parameterized)
- Month-End Dates: Hardcoded month-end dates (2024-01-31, 2024-02-29, 2024-03-31)
- Extension Required: To support other quarters/years, VALUES clause must be updated
Schema Dependencies
- Column Names: Query assumes specific column names (account_id, new_balance, tx_date, id)
- Data Types: Assumes
tx_dateis TIMESTAMP,new_balanceis DECIMAL,idis sortable - Partition Structure: Assumes partitioning by
tx_date(year/month) for optimal performance
Performance Assumptions
- Partition Pruning: Query performance depends on effective partition pruning
- Parquet Optimization: Assumes Parquet column statistics (min/max) enable efficient filtering
- Athena-Specific: Performance estimates based on Athena behavior (may vary on other engines)
See also
- SQL Query Breakdown - Detailed query explanation and design rationale
- SQL Implementation Code - Complete SQL query code
- Data Lake Architecture - Architecture that this query reads from
- Parquet Schema Specification - Silver layer schema contract
- SQL Complete Reference - Extended SQL documentation with diagrams