Skip to main content

SQL Analytics Boundaries

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This document captures the explicit assumptions and edge cases considered for the SQL query implementation.


Query Assumptions

Data Volume & Performance

  • Table Size: Designed for 100M+ row tables
  • Partition Pruning: Assumes table is partitioned by tx_date (year/month) for efficient scanning
  • Query Performance: Estimated < 30 seconds for month-end reports (based on partition pruning analysis)
  • Scan Reduction: Partition pruning reduces data scanned by ~95% for Q1 2024 queries (approximation)
    • Calculation example: For 5 years of retained data (60 months total), querying Q1 2024 (3 months: Jan, Feb, Mar) scans only 3/60 = 5% of partitions, thus 95% reduction
    • Note: Actual reduction depends on data retention policy and query date range. Production systems may retain 3-7 years of data, resulting in 91-97% reduction for quarterly queries

Schema Assumptions

  • Input Schema: Query uses schema from Appendix A (id, account_id, amount, new_balance, tx_date)
  • Schema Mismatch: Task 1 ETL produces different schema (TransactionID, CustomerID, etc.)
    • Assumption: Schema transformation/mapping step required if query runs on Task 1 output
    • Mapping: TransactionID → id, CustomerID → account_id, TransactionTimestamp → tx_date
    • Calculated Field: new_balance would need to be calculated from TransactionAmount (cumulative sum)
  • Logical data model: For a single, ledger-ready Parquet schema and explicit mapping/assumptions bridging Task 1 and Task 3, see Logical Data Model: Task 1 Parquet ↔ Task 3 SQL Ledger.

Query Logic Assumptions

  • Month-End Balance Definition: Last transaction's new_balance within the month (ordered by tx_date DESC, tie-broken by id DESC)
  • Missing Months (carry-forward): Months with no transactions show the last known balance; NULL only when the account has no prior history in the period
  • Reporting Period: Hardcoded to Q1 2024 (January, February, March)
  • Leap Year Handling: February 29, 2024 correctly handled (2024 is a leap year)

Engine Compatibility

  • SQL Dialect: Standard ANSI/Presto SQL syntax
  • Compatibility: Compatible with Athena, BigQuery, Snowflake, and other modern SQL engines
  • Portability: Avoids engine-specific features for maximum portability

Edge Cases Handled

Missing Data

  • No Transactions in Month: Carry-forward applies; balance is the last known balance (NULL only if no prior history)
  • No Transactions for Account: Account appears in output with NULL balances for all months (no prior history)
  • Missing Partition: If partition for a month does not exist, query still returns NULL (does not fail)

Data Quality Edge Cases

  • Duplicate Transactions: Window function with ROW_NUMBER() ensures deterministic selection (tie-breaker by id)
  • NULL Values: NULL new_balance values are preserved (not filtered out)
  • Invalid Dates: Query filters to tx_date >= '2024-01-01' AND tx_date < '2024-04-01' (invalid dates outside range ignored)

Performance Edge Cases

  • Large Account Count: DISTINCT on account_id may be expensive on very large tables (mitigated by partition pruning)
  • Skewed Data: Uneven distribution of transactions across accounts may impact window function performance
  • Cross-Partition Scans: If data spans many partitions, query may scan more data than expected

Business Logic Edge Cases

  • Tie-Breaking: Multiple transactions with same tx_date use id DESC as tie-breaker (deterministic but may not match business intent)
  • Time Semantics: Query uses tx_date as ordering timestamp (if domain distinguishes transaction vs posting date, month-end reporting should use posting/cutoff timestamp)
  • Carry-Forward Balance: Query applies forward-fill (LAST_VALUE IGNORE NULLS) so months with no activity show the last known balance; NULL only when the account has no prior history

Query Limitations

Hardcoded Values

  • Reporting Period: Q1 2024 hardcoded in VALUES clause (not parameterized)
  • Month-End Dates: Hardcoded month-end dates (2024-01-31, 2024-02-29, 2024-03-31)
  • Extension Required: To support other quarters/years, VALUES clause must be updated

Schema Dependencies

  • Column Names: Query assumes specific column names (account_id, new_balance, tx_date, id)
  • Data Types: Assumes tx_date is TIMESTAMP, new_balance is DECIMAL, id is sortable
  • Partition Structure: Assumes partitioning by tx_date (year/month) for optimal performance

Performance Assumptions

  • Partition Pruning: Query performance depends on effective partition pruning
  • Parquet Optimization: Assumes Parquet column statistics (min/max) enable efficient filtering
  • Athena-Specific: Performance estimates based on Athena behavior (may vary on other engines)

See also

© 2026 Stephen AdeiCC BY 4.0