© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
Data Model Design Rationale
Document Type: Explanatory Documentation
Audience: Architecture Review, Engineering, Interviewers
Status: Active
For architecture context: See Data Lake Architecture - Schema Evolution. For the normative schema specification, see Parquet Schema Specification.
Purpose
This document explains the "why" behind the Parquet schema design decisions. For the normative schema specification, see Parquet Schema Specification.
Context: Task 1 (ETL) ↔ Task 3 (SQL)
The Silver-layer Parquet schema bridges two business case requirements:
- Task 1 (Data Ingestion): Transform raw CSV → validated Parquet
- Task 3 (SQL Analytics): Query month-end balances from validated data
Design Goal: Create a schema that serves both requirements without duplication or ambiguity.
Key Design Decisions
1. Transaction Log Pattern (No Materialized Balance)
Decision: Store transactions only (amounts), derive balances at query time.
Rationale:
Running balance depends on full, ordered history per account. Materializing new_balance at ingestion would require:
Option A: Read existing Silver per account to carry balance forward
- Problem: Couples ingestion to prior state; adds complexity and coupling
- Problem: Requires per-account lookups during batch processing
- Problem: Breaks if historical data is reprocessed (balance drift)
Option B: Accept per-batch balances (only correct after full historical recompute)
- Problem: Misleading for ad-hoc queries (balance is wrong until full recompute)
- Problem: Requires full reprocessing to fix any historical error
Chosen Approach: Derive new_balance in Gold layer (SQL query or batch job)
- Benefit: Keeps Silver as append-only transaction log (standard ledger practice)
- Benefit: Avoids ordering/state-carry logic at ingestion
- Benefit: Balance derivation is deterministic and replayable
- Benefit: Separates concerns: ingestion = store facts, aggregation = derive state
Trade-off: Month-end balance queries must compute running sum (window function). Acceptable for batch OLAP workload.
2. Account vs Customer Semantics
Decision: CustomerID in CSV maps to account_id in Parquet (treated as account identifier).
Rationale:
In the business case, "customer" and "account" are aligned: one ledger per customer/account.
Future-proofing:
- If the business later distinguishes customer (party) vs account (contract), the same Parquet can be used with
account_idrepresenting the ledger key - One customer may have multiple accounts →
account_idremains unique per ledger - No schema change needed; semantic interpretation changes only
Terminology choice: Use account_id (not customer_id) in Parquet to align with financial ledger terminology.
3. Opening Balance Handling
Decision: No opening balance column in Parquet; opening balance is implicit or applied separately.
Rationale:
Opening balance is either:
- Option A: First transaction implies starting balance of 0 before that transaction
- Option B: Opening balance is applied in a separate process (e.g., Gold snapshot or query parameter)
Both options are consistent with transaction log pattern. Task 3's "last new_balance in the month" can be computed as:
- "Balance after last transaction in month" (cumulative sum from opening)
- "Opening + sum(amount) up to that point"
Convention: "No prior history" → NULL balance in month-end report (see SQL_BREAKDOWN.md carry-forward logic).
4. Currency Handling
Decision: Store currency in Parquet; validate at ingestion (ISO-4217 allowlist).
Rationale:
Why store currency?
- Multi-currency reporting: Future queries may need to filter/group by currency
- Audit trail: Preserves original currency for debugging and compliance
- Validation evidence: Storing validated currency confirms ingestion validated ISO-4217
Why not convert to single currency?
- FX conversion requires exchange rates, valuation time, and consistent rounding policy
- Not specified in business case requirements
- Conversion is aggregation concern (Gold layer), not ingestion concern (Silver layer)
Storage overhead: 3 bytes per row (VARCHAR(3)) - negligible for analytical workload.
5. Transaction Ordering Guarantees
Decision: Ordering for balance derivation is (account_id, tx_date, id).
Rationale:
Why not rely on ingestion order?
- Ingestion does not guarantee global order across files or runs
- Backfills and late-arriving data may arrive out-of-order
Why tx_date (transaction time)?
- Balance should reflect transaction occurrence time, not ingestion time
- Month-end balances are defined by transaction date, not when the data was received
Why id as tie-breaker?
- Same-
tx_dateties must be broken deterministically idis stable, unique, and sortable- SQL:
ORDER BY tx_date, idensures consistent results across runs
Deduplication: (TransactionID, event_date) uniqueness enforced by ETL (loop prevention) ensures at most one row per business event in Silver.
6. Identifier Casting and Stability
Decision: id column is BIGINT (stable, unique, sortable).
Rationale:
Why BIGINT?
- Task 3 tie-breaker uses
id DESC- requires sortable type - BIGINT is standard for numeric identifiers (supports 64-bit integers)
What if TransactionID is non-numeric?
- Specification allows VARCHAR(64) for
idif TransactionID is non-numeric - Task 3 query uses string ordering (still deterministic and stable)
Stability requirement: id must be stable across runs (same transaction = same ID). Casting rule must be deterministic.
7. Partitioning Strategy
Decision: Partition by (year, month, schema_v, run_id).
Rationale:
Why year/month (transaction-time)?
- Queries filter by transaction dates (e.g., Q1 2024)
- Enables 95% scan reduction via partition pruning
- Aligns with business reporting period (month-end balances)
Why not ingest_date (ingestion-time)?
- Bronze layer uses
ingest_date(when file arrived) - Silver layer uses
year/monthfromtx_date(when transaction occurred) - Analytical queries care about transaction time, not ingestion time
Why schema_v?
- Supports schema evolution with backward compatibility
- Multiple schema versions can coexist (v1, v2) in same table
- Athena queries can union versions or query specific version
Why run_id?
- Run isolation: Each ETL run writes to unique path
- Safe backfills: Reruns do not overwrite existing data
- Idempotent: Same input produces same output (deterministic)
- Audit trail: Full history of all runs preserved
Comparison: Silver (Transaction Log) vs Gold (Balance Report)
| Concern | Silver Layer | Gold Layer (Task 3) |
|---|---|---|
| Schema | id, account_id, amount, currency, tx_date, partitions | account_id, month, balance, partitions |
| Granularity | Transaction-level (one row per transaction) | Month-level (one row per account per month) |
| Balance | NOT stored (derive at query time) | Stored (materialized via SQL query or batch job) |
| Cardinality | 1:1 from Bronze (one validation contract) | 1:N from Silver (multiple aggregations) |
| Query pattern | Append-only log (immutable) | Aggregation from Silver |
| Ownership | Platform Team → Domain Team | Domain Team → Business (Finance) |
Transformation: SQL query (Task 3) reads Silver → derives running balance → writes Gold.
Interview Defense: Common Questions
Q: Why not store new_balance in Parquet?
A: Running balance requires full ordered history per account. Storing balance at ingestion couples ingestion to prior state and breaks replayability. Deriving balance in Gold layer (SQL/batch job) keeps Silver as append-only transaction log (standard ledger practice) and makes balance computation deterministic and replayable.
Q: What if transactions arrive out-of-order?
A: Silver is partitioned by transaction-time (year/month from tx_date), not ingestion-time. Late-arriving data is written to the correct transaction-time partition. Balance queries use ORDER BY tx_date, id for deterministic ordering. Backfills are safe due to run_id isolation (each run writes to unique path).
Q: How do you handle opening balance?
A: Opening balance is either implicit (first transaction = balance 0 before) or applied separately (Gold snapshot or query parameter). Task 3 carry-forward logic: NULL if no prior history, otherwise last known balance in period.
Q: Why partition by year/month instead of ingest_date?
A: Analytical queries filter by transaction dates (e.g., "Q1 2024 month-end balances"), not ingestion dates. Transaction-time partitioning enables 95% scan reduction via partition pruning. Bronze layer uses ingest_date (arrival time); Silver layer uses year/month from tx_date (occurrence time).
Q: What if currency conversion is needed?
A: Currency is stored but not converted. FX conversion requires exchange rates, valuation time, and rounding policy (not specified in requirements). Conversion is aggregation concern (Gold layer), not ingestion concern (Silver layer). Silver preserves original currency for audit trail and future multi-currency reporting.
See also
- Normative Schema: Parquet Schema Specification
- Data Lake Architecture: Data Lake Architecture
- SQL Analytics (Task 3): SQL Breakdown
Version: 1.0
Last Updated: 2024-01-30
Owner: Data Platform Team