© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Data Model Design Rationale

Document Type: Explanatory Documentation
Audience: Architecture Review, Engineering, Interviewers
Status: Active

For architecture context: See Data Lake Architecture - Schema Evolution. For the normative schema specification, see Parquet Schema Specification.

Purpose

This document explains the "why" behind the Parquet schema design decisions. For the normative schema specification, see Parquet Schema Specification.

Context: Task 1 (ETL) ↔ Task 3 (SQL)

The Silver-layer Parquet schema bridges two business case requirements:

Task 1 (Data Ingestion): Transform raw CSV → validated Parquet
Task 3 (SQL Analytics): Query month-end balances from validated data

Design Goal: Create a schema that serves both requirements without duplication or ambiguity.

Key Design Decisions

1. Transaction Log Pattern (No Materialized Balance)

Decision: Store transactions only (amounts), derive balances at query time.

Rationale:

Running balance depends on full, ordered history per account. Materializing new_balance at ingestion would require:

Option A: Read existing Silver per account to carry balance forward

Problem: Couples ingestion to prior state; adds complexity and coupling
Problem: Requires per-account lookups during batch processing
Problem: Breaks if historical data is reprocessed (balance drift)

Option B: Accept per-batch balances (only correct after full historical recompute)

Problem: Misleading for ad-hoc queries (balance is wrong until full recompute)
Problem: Requires full reprocessing to fix any historical error

Chosen Approach: Derive new_balance in Gold layer (SQL query or batch job)

Benefit: Keeps Silver as append-only transaction log (standard ledger practice)
Benefit: Avoids ordering/state-carry logic at ingestion
Benefit: Balance derivation is deterministic and replayable
Benefit: Separates concerns: ingestion = store facts, aggregation = derive state

Trade-off: Month-end balance queries must compute running sum (window function). Acceptable for batch OLAP workload.

2. Account vs Customer Semantics

Decision: CustomerID in CSV maps to account_id in Parquet (treated as account identifier).

Rationale:

In the business case, "customer" and "account" are aligned: one ledger per customer/account.

Future-proofing:

If the business later distinguishes customer (party) vs account (contract), the same Parquet can be used with account_id representing the ledger key
One customer may have multiple accounts → account_id remains unique per ledger
No schema change needed; semantic interpretation changes only

Terminology choice: Use account_id (not customer_id) in Parquet to align with financial ledger terminology.

3. Opening Balance Handling

Decision: No opening balance column in Parquet; opening balance is implicit or applied separately.

Rationale:

Opening balance is either:

Option A: First transaction implies starting balance of 0 before that transaction
Option B: Opening balance is applied in a separate process (e.g., Gold snapshot or query parameter)

Both options are consistent with transaction log pattern. Task 3's "last new_balance in the month" can be computed as:

"Balance after last transaction in month" (cumulative sum from opening)
"Opening + sum(amount) up to that point"

Convention: "No prior history" → NULL balance in month-end report (see SQL_BREAKDOWN.md carry-forward logic).

4. Currency Handling

Decision: Store currency in Parquet; validate at ingestion (ISO-4217 allowlist).

Rationale:

Why store currency?

Multi-currency reporting: Future queries may need to filter/group by currency
Audit trail: Preserves original currency for debugging and compliance
Validation evidence: Storing validated currency confirms ingestion validated ISO-4217

Why not convert to single currency?

FX conversion requires exchange rates, valuation time, and consistent rounding policy
Not specified in business case requirements
Conversion is aggregation concern (Gold layer), not ingestion concern (Silver layer)

Storage overhead: 3 bytes per row (VARCHAR(3)) - negligible for analytical workload.

5. Transaction Ordering Guarantees

Decision: Ordering for balance derivation is (account_id, tx_date, id).

Rationale:

Why not rely on ingestion order?

Ingestion does not guarantee global order across files or runs
Backfills and late-arriving data may arrive out-of-order

Why tx_date (transaction time)?

Balance should reflect transaction occurrence time, not ingestion time
Month-end balances are defined by transaction date, not when the data was received

Why id as tie-breaker?

Same-tx_date ties must be broken deterministically
id is stable, unique, and sortable
SQL: ORDER BY tx_date, id ensures consistent results across runs

Deduplication: (TransactionID, event_date) uniqueness enforced by ETL (loop prevention) ensures at most one row per business event in Silver.

6. Identifier Casting and Stability

Decision: id column is BIGINT (stable, unique, sortable).

Rationale:

Why BIGINT?

Task 3 tie-breaker uses id DESC - requires sortable type
BIGINT is standard for numeric identifiers (supports 64-bit integers)

What if TransactionID is non-numeric?

Specification allows VARCHAR(64) for id if TransactionID is non-numeric
Task 3 query uses string ordering (still deterministic and stable)

Stability requirement: id must be stable across runs (same transaction = same ID). Casting rule must be deterministic.

7. Partitioning Strategy

Decision: Partition by (year, month, schema_v, run_id).

Rationale:

Why year/month (transaction-time)?

Queries filter by transaction dates (e.g., Q1 2024)
Enables 95% scan reduction via partition pruning
Aligns with business reporting period (month-end balances)

Why not ingest_date (ingestion-time)?

Bronze layer uses ingest_date (when file arrived)
Silver layer uses year/month from tx_date (when transaction occurred)
Analytical queries care about transaction time, not ingestion time

Why schema_v?

Supports schema evolution with backward compatibility
Multiple schema versions can coexist (v1, v2) in same table
Athena queries can union versions or query specific version

Why run_id?

Run isolation: Each ETL run writes to unique path
Safe backfills: Reruns do not overwrite existing data
Idempotent: Same input produces same output (deterministic)
Audit trail: Full history of all runs preserved

Comparison: Silver (Transaction Log) vs Gold (Balance Report)

Concern	Silver Layer	Gold Layer (Task 3)
Schema	`id`, `account_id`, `amount`, `currency`, `tx_date`, partitions	`account_id`, `month`, `balance`, partitions
Granularity	Transaction-level (one row per transaction)	Month-level (one row per account per month)
Balance	NOT stored (derive at query time)	Stored (materialized via SQL query or batch job)
Cardinality	1:1 from Bronze (one validation contract)	1:N from Silver (multiple aggregations)
Query pattern	Append-only log (immutable)	Aggregation from Silver
Ownership	Platform Team → Domain Team	Domain Team → Business (Finance)

Transformation: SQL query (Task 3) reads Silver → derives running balance → writes Gold.

Interview Defense: Common Questions

Q: Why not store `new_balance` in Parquet?

A: Running balance requires full ordered history per account. Storing balance at ingestion couples ingestion to prior state and breaks replayability. Deriving balance in Gold layer (SQL/batch job) keeps Silver as append-only transaction log (standard ledger practice) and makes balance computation deterministic and replayable.

Q: What if transactions arrive out-of-order?

A: Silver is partitioned by transaction-time (year/month from tx_date), not ingestion-time. Late-arriving data is written to the correct transaction-time partition. Balance queries use ORDER BY tx_date, id for deterministic ordering. Backfills are safe due to run_id isolation (each run writes to unique path).

Q: How do you handle opening balance?

A: Opening balance is either implicit (first transaction = balance 0 before) or applied separately (Gold snapshot or query parameter). Task 3 carry-forward logic: NULL if no prior history, otherwise last known balance in period.

Q: Why partition by `year`/`month` instead of `ingest_date`?

A: Analytical queries filter by transaction dates (e.g., "Q1 2024 month-end balances"), not ingestion dates. Transaction-time partitioning enables 95% scan reduction via partition pruning. Bronze layer uses ingest_date (arrival time); Silver layer uses year/month from tx_date (occurrence time).

Q: What if currency conversion is needed?

A: Currency is stored but not converted. FX conversion requires exchange rates, valuation time, and rounding policy (not specified in requirements). Conversion is aggregation concern (Gold layer), not ingestion concern (Silver layer). Silver preserves original currency for audit trail and future multi-currency reporting.

Purpose​

Context: Task 1 (ETL) ↔ Task 3 (SQL)​

Key Design Decisions​

1. Transaction Log Pattern (No Materialized Balance)​

2. Account vs Customer Semantics​

3. Opening Balance Handling​

4. Currency Handling​

5. Transaction Ordering Guarantees​

6. Identifier Casting and Stability​

7. Partitioning Strategy​

Comparison: Silver (Transaction Log) vs Gold (Balance Report)​

Interview Defense: Common Questions​

Q: Why not store new_balance in Parquet?​

Q: What if transactions arrive out-of-order?​

Q: How do you handle opening balance?​

Q: Why partition by year/month instead of ingest_date?​

Q: What if currency conversion is needed?​

See also​

Purpose

Context: Task 1 (ETL) ↔ Task 3 (SQL)

Key Design Decisions

1. Transaction Log Pattern (No Materialized Balance)

2. Account vs Customer Semantics

3. Opening Balance Handling

4. Currency Handling

5. Transaction Ordering Guarantees

6. Identifier Casting and Stability

7. Partitioning Strategy

Comparison: Silver (Transaction Log) vs Gold (Balance Report)

Interview Defense: Common Questions

Q: Why not store `new_balance` in Parquet?

Q: What if transactions arrive out-of-order?

Q: How do you handle opening balance?

Q: Why partition by `year`/`month` instead of `ingest_date`?

Q: What if currency conversion is needed?

See also