© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Parquet Schema Specification - Silver Layer

Document Type: Normative Specification
Audience: Engineering, Platform Team
Status: Active

For architecture context: See Data Lake Architecture - Silver Layer Schema. For design rationale, see Data Model Design Rationale.

Purpose

This document defines the canonical Parquet schema for the Silver layer (validated transactions). This schema serves as the contract between ETL ingestion and analytical consumption.

Silver Layer Schema

Table Name

Logical table: silver.transactions
Physical location: s3://ohpen-silver/silver/mortgages/transactions/

Column Specification

Column	Athena Type	Nullable	Semantic Meaning
`id`	`BIGINT`	NOT NULL	Unique transaction identifier; stable and sortable for tie-breaking (source: TransactionID per casting rule).
`account_id`	`VARCHAR`	NOT NULL	Account (ledger) identifier; one row per account per transaction (source: CustomerID).
`amount`	`DECIMAL(16,2)`	NOT NULL	Signed transaction amount in account currency; credits positive, debits negative.
`currency`	`VARCHAR(3)`	NOT NULL	ISO-4217 currency code of the transaction; preserved for multi-currency reporting and validation.
`tx_date`	`TIMESTAMP`	NOT NULL	Business timestamp of the transaction (UTC); used for ordering and month-end logic.
`year`	`INT`	NOT NULL	Partition key: year of `tx_date`.
`month`	`VARCHAR(2)`	NOT NULL	Partition key: zero-padded month of `tx_date` (e.g. `01`, `12`).
`schema_v`	`VARCHAR`	NOT NULL	Partition key: schema version (e.g., `v1`) for evolution support.
`run_id`	`VARCHAR`	NOT NULL	Partition key: run isolation identifier (unique per ETL execution).

Partitioning Strategy

Partition Keys: year, month, schema_v, run_id

Partition Scheme:

s3://ohpen-silver/silver/mortgages/transactions/
  year=2024/
    month=01/
      schema_v=v1/
        run_id=20240121T120000Z/
          part-00000.parquet
          _SUCCESS

Rationale:

Transaction-time partitioning (year/month from tx_date): Queries filter by transaction dates, not ingestion dates
Schema versioning (schema_v): Supports additive schema evolution with backward compatibility
Run isolation (run_id): Each run writes to unique path, enabling safe backfills and idempotent reruns

Data Type Casting Rules

From CSV to Parquet

CSV Column	CSV Type	Parquet Column	Parquet Type	Transformation
`TransactionID`	String	`id`	BIGINT	Cast to numeric (or deterministic surrogate if non-numeric)
`CustomerID`	String	`account_id`	VARCHAR	1:1; type VARCHAR
`TransactionAmount`	String/Numeric	`amount`	DECIMAL(16,2)	Cast to DECIMAL(16,2)
`Currency`	String	`currency`	VARCHAR(3)	1:1; validated against ISO-4217 allowlist
`TransactionTimestamp`	String	`tx_date`	TIMESTAMP	Parse to TIMESTAMP (UTC)
—	—	`year`	INT	Extract from `tx_date`
—	—	`month`	VARCHAR(2)	Extract from `tx_date` (zero-padded)

Schema Evolution Rules

Additive-Only Changes

Allowed:

Add new nullable columns
Add new partition keys (requires new schema_v)
Rename schema version (e.g., schema_v=v1 → schema_v=v2)

Not Allowed:

Remove columns
Change column types (breaking change)
Rename existing columns (breaking change)
Add NOT NULL columns (breaks backward compatibility)

Schema Versioning

When schema changes are needed:

Create new schema_v (e.g., v2)
Write new data to schema_v=v2/ partition
Existing data remains in schema_v=v1/
Athena queries can union both versions or query specific version

Example path coexistence:

s3://ohpen-silver/silver/mortgages/transactions/
  year=2024/
    month=01/
      schema_v=v1/  (original schema)
      schema_v=v2/  (new schema with additional columns)

Schema change impact checklist (e.g. adding TransactionType)

Before adding a new column (e.g. TransactionType) or changing the schema:

Notify consumers – Identify downstream consumers (Athena workgroups, BI tools, reports). Notify owners of the planned change and effective date.
Additive change – New columns MUST be nullable (or have a default that does not change semantics). Example: add transaction_type VARCHAR(32) NULL in schema_v=v2; treat missing as 'unknown'.
Dual-read period – Consumers that need the new column can migrate to schema_v=v2; existing queries on schema_v=v1 (or union v1/v2) continue to work. Avoid breaking existing SELECTs.
Update Glue Data Catalog – Register new table/partition definition for schema_v=v2 in Terraform or Glue; document in Data Lake Architecture - Schema Evolution.
Runbooks – Update any runbooks or SQL examples that assume the old column set. Backfill playbook applies when historical partitions need the new column.

Consumer registry (recommended): Maintain a list of consumers of Silver/Gold (e.g. "Month-end balance report", "Athena workgroup X", "Dashboard Y") and include them in the notification step. See Data Lake Architecture - Ownership.

Nullability Contract

NOT NULL Columns

All core data columns are NOT NULL:

id - Every transaction must have unique identifier
account_id - Every transaction must belong to an account
amount - Every transaction must have an amount (can be 0.00)
currency - Every transaction must have currency code
tx_date - Every transaction must have timestamp

Nullable Columns

No columns are nullable in the current schema (v1).

Future additions: New columns added in schema_v=v2 or later MUST be nullable to maintain backward compatibility with v1 data.

File Format Specifications

Parquet Configuration

Compression: SNAPPY (balance between compression ratio and speed)
File size target: ~128MB per file (optimal for Athena/Spark)
Row group size: 128MB (Parquet default)
Column encoding: Dictionary encoding where applicable

Success Markers

Each run_id partition MUST include a _SUCCESS marker file:

{
  "run_id": "20240121T120000Z",
  "execution_arn": "arn:aws:states:...",
  "completed_at": "2024-01-21T12:05:30Z",
  "metrics": {
    "input_rows": 1000000,
    "valid_rows": 995000,
    "quarantined_rows": 5000,
    "condemned_rows": 0
  }
}

Purpose​

Silver Layer Schema​

Table Name​

Column Specification​

Partitioning Strategy​

Data Type Casting Rules​

From CSV to Parquet​

Schema Evolution Rules​

Additive-Only Changes​

Schema Versioning​

Schema change impact checklist (e.g. adding TransactionType)​

Nullability Contract​

NOT NULL Columns​

Nullable Columns​

File Format Specifications​

Parquet Configuration​

Success Markers​

See also​