Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Data Model Design Rationale

Document Type: Explanatory Documentation
Audience: Architecture Review, Engineering, Interviewers
Status: Active

For architecture context: See Data Lake Architecture - Schema Evolution. For the normative schema specification, see Parquet Schema Specification.


Purpose

This document explains the "why" behind the Parquet schema design decisions. For the normative schema specification, see Parquet Schema Specification.


Context: Task 1 (ETL) ↔ Task 3 (SQL)

The Silver-layer Parquet schema bridges two business case requirements:

  1. Task 1 (Data Ingestion): Transform raw CSV → validated Parquet
  2. Task 3 (SQL Analytics): Query month-end balances from validated data

Design Goal: Create a schema that serves both requirements without duplication or ambiguity.


Key Design Decisions

1. Transaction Log Pattern (No Materialized Balance)

Decision: Store transactions only (amounts), derive balances at query time.

Rationale:

Running balance depends on full, ordered history per account. Materializing new_balance at ingestion would require:

Option A: Read existing Silver per account to carry balance forward

  • Problem: Couples ingestion to prior state; adds complexity and coupling
  • Problem: Requires per-account lookups during batch processing
  • Problem: Breaks if historical data is reprocessed (balance drift)

Option B: Accept per-batch balances (only correct after full historical recompute)

  • Problem: Misleading for ad-hoc queries (balance is wrong until full recompute)
  • Problem: Requires full reprocessing to fix any historical error

Chosen Approach: Derive new_balance in Gold layer (SQL query or batch job)

  • Benefit: Keeps Silver as append-only transaction log (standard ledger practice)
  • Benefit: Avoids ordering/state-carry logic at ingestion
  • Benefit: Balance derivation is deterministic and replayable
  • Benefit: Separates concerns: ingestion = store facts, aggregation = derive state

Trade-off: Month-end balance queries must compute running sum (window function). Acceptable for batch OLAP workload.


2. Account vs Customer Semantics

Decision: CustomerID in CSV maps to account_id in Parquet (treated as account identifier).

Rationale:

In the business case, "customer" and "account" are aligned: one ledger per customer/account.

Future-proofing:

  • If the business later distinguishes customer (party) vs account (contract), the same Parquet can be used with account_id representing the ledger key
  • One customer may have multiple accounts → account_id remains unique per ledger
  • No schema change needed; semantic interpretation changes only

Terminology choice: Use account_id (not customer_id) in Parquet to align with financial ledger terminology.


3. Opening Balance Handling

Decision: No opening balance column in Parquet; opening balance is implicit or applied separately.

Rationale:

Opening balance is either:

  • Option A: First transaction implies starting balance of 0 before that transaction
  • Option B: Opening balance is applied in a separate process (e.g., Gold snapshot or query parameter)

Both options are consistent with transaction log pattern. Task 3's "last new_balance in the month" can be computed as:

  • "Balance after last transaction in month" (cumulative sum from opening)
  • "Opening + sum(amount) up to that point"

Convention: "No prior history" → NULL balance in month-end report (see SQL_BREAKDOWN.md carry-forward logic).


4. Currency Handling

Decision: Store currency in Parquet; validate at ingestion (ISO-4217 allowlist).

Rationale:

Why store currency?

  • Multi-currency reporting: Future queries may need to filter/group by currency
  • Audit trail: Preserves original currency for debugging and compliance
  • Validation evidence: Storing validated currency confirms ingestion validated ISO-4217

Why not convert to single currency?

  • FX conversion requires exchange rates, valuation time, and consistent rounding policy
  • Not specified in business case requirements
  • Conversion is aggregation concern (Gold layer), not ingestion concern (Silver layer)

Storage overhead: 3 bytes per row (VARCHAR(3)) - negligible for analytical workload.


5. Transaction Ordering Guarantees

Decision: Ordering for balance derivation is (account_id, tx_date, id).

Rationale:

Why not rely on ingestion order?

  • Ingestion does not guarantee global order across files or runs
  • Backfills and late-arriving data may arrive out-of-order

Why tx_date (transaction time)?

  • Balance should reflect transaction occurrence time, not ingestion time
  • Month-end balances are defined by transaction date, not when the data was received

Why id as tie-breaker?

  • Same-tx_date ties must be broken deterministically
  • id is stable, unique, and sortable
  • SQL: ORDER BY tx_date, id ensures consistent results across runs

Deduplication: (TransactionID, event_date) uniqueness enforced by ETL (loop prevention) ensures at most one row per business event in Silver.


6. Identifier Casting and Stability

Decision: id column is BIGINT (stable, unique, sortable).

Rationale:

Why BIGINT?

  • Task 3 tie-breaker uses id DESC - requires sortable type
  • BIGINT is standard for numeric identifiers (supports 64-bit integers)

What if TransactionID is non-numeric?

  • Specification allows VARCHAR(64) for id if TransactionID is non-numeric
  • Task 3 query uses string ordering (still deterministic and stable)

Stability requirement: id must be stable across runs (same transaction = same ID). Casting rule must be deterministic.


7. Partitioning Strategy

Decision: Partition by (year, month, schema_v, run_id).

Rationale:

Why year/month (transaction-time)?

  • Queries filter by transaction dates (e.g., Q1 2024)
  • Enables 95% scan reduction via partition pruning
  • Aligns with business reporting period (month-end balances)

Why not ingest_date (ingestion-time)?

  • Bronze layer uses ingest_date (when file arrived)
  • Silver layer uses year/month from tx_date (when transaction occurred)
  • Analytical queries care about transaction time, not ingestion time

Why schema_v?

  • Supports schema evolution with backward compatibility
  • Multiple schema versions can coexist (v1, v2) in same table
  • Athena queries can union versions or query specific version

Why run_id?

  • Run isolation: Each ETL run writes to unique path
  • Safe backfills: Reruns do not overwrite existing data
  • Idempotent: Same input produces same output (deterministic)
  • Audit trail: Full history of all runs preserved

Comparison: Silver (Transaction Log) vs Gold (Balance Report)

ConcernSilver LayerGold Layer (Task 3)
Schemaid, account_id, amount, currency, tx_date, partitionsaccount_id, month, balance, partitions
GranularityTransaction-level (one row per transaction)Month-level (one row per account per month)
BalanceNOT stored (derive at query time)Stored (materialized via SQL query or batch job)
Cardinality1:1 from Bronze (one validation contract)1:N from Silver (multiple aggregations)
Query patternAppend-only log (immutable)Aggregation from Silver
OwnershipPlatform Team → Domain TeamDomain Team → Business (Finance)

Transformation: SQL query (Task 3) reads Silver → derives running balance → writes Gold.


Interview Defense: Common Questions

Q: Why not store new_balance in Parquet?

A: Running balance requires full ordered history per account. Storing balance at ingestion couples ingestion to prior state and breaks replayability. Deriving balance in Gold layer (SQL/batch job) keeps Silver as append-only transaction log (standard ledger practice) and makes balance computation deterministic and replayable.

Q: What if transactions arrive out-of-order?

A: Silver is partitioned by transaction-time (year/month from tx_date), not ingestion-time. Late-arriving data is written to the correct transaction-time partition. Balance queries use ORDER BY tx_date, id for deterministic ordering. Backfills are safe due to run_id isolation (each run writes to unique path).

Q: How do you handle opening balance?

A: Opening balance is either implicit (first transaction = balance 0 before) or applied separately (Gold snapshot or query parameter). Task 3 carry-forward logic: NULL if no prior history, otherwise last known balance in period.

Q: Why partition by year/month instead of ingest_date?

A: Analytical queries filter by transaction dates (e.g., "Q1 2024 month-end balances"), not ingestion dates. Transaction-time partitioning enables 95% scan reduction via partition pruning. Bronze layer uses ingest_date (arrival time); Silver layer uses year/month from tx_date (occurrence time).

Q: What if currency conversion is needed?

A: Currency is stored but not converted. FX conversion requires exchange rates, valuation time, and rounding policy (not specified in requirements). Conversion is aggregation concern (Gold layer), not ingestion concern (Silver layer). Silver preserves original currency for audit trail and future multi-currency reporting.


See also


Version: 1.0
Last Updated: 2024-01-30
Owner: Data Platform Team

© 2026 Stephen AdeiCC BY 4.0