Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Parquet Schema Specification - Silver Layer

Document Type: Normative Specification
Audience: Engineering, Platform Team
Status: Active

For architecture context: See Data Lake Architecture - Silver Layer Schema. For design rationale, see Data Model Design Rationale.


Purpose

This document defines the canonical Parquet schema for the Silver layer (validated transactions). This schema serves as the contract between ETL ingestion and analytical consumption.


Silver Layer Schema

Table Name

Logical table: silver.transactions
Physical location: s3://ohpen-silver/silver/mortgages/transactions/

Column Specification

ColumnAthena TypeNullableSemantic Meaning
idBIGINTNOT NULLUnique transaction identifier; stable and sortable for tie-breaking (source: TransactionID per casting rule).
account_idVARCHARNOT NULLAccount (ledger) identifier; one row per account per transaction (source: CustomerID).
amountDECIMAL(16,2)NOT NULLSigned transaction amount in account currency; credits positive, debits negative.
currencyVARCHAR(3)NOT NULLISO-4217 currency code of the transaction; preserved for multi-currency reporting and validation.
tx_dateTIMESTAMPNOT NULLBusiness timestamp of the transaction (UTC); used for ordering and month-end logic.
yearINTNOT NULLPartition key: year of tx_date.
monthVARCHAR(2)NOT NULLPartition key: zero-padded month of tx_date (e.g. 01, 12).
schema_vVARCHARNOT NULLPartition key: schema version (e.g., v1) for evolution support.
run_idVARCHARNOT NULLPartition key: run isolation identifier (unique per ETL execution).

Partitioning Strategy

Partition Keys: year, month, schema_v, run_id

Partition Scheme:

s3://ohpen-silver/silver/mortgages/transactions/
year=2024/
month=01/
schema_v=v1/
run_id=20240121T120000Z/
part-00000.parquet
_SUCCESS

Rationale:

  • Transaction-time partitioning (year/month from tx_date): Queries filter by transaction dates, not ingestion dates
  • Schema versioning (schema_v): Supports additive schema evolution with backward compatibility
  • Run isolation (run_id): Each run writes to unique path, enabling safe backfills and idempotent reruns

Data Type Casting Rules

From CSV to Parquet

CSV ColumnCSV TypeParquet ColumnParquet TypeTransformation
TransactionIDStringidBIGINTCast to numeric (or deterministic surrogate if non-numeric)
CustomerIDStringaccount_idVARCHAR1:1; type VARCHAR
TransactionAmountString/NumericamountDECIMAL(16,2)Cast to DECIMAL(16,2)
CurrencyStringcurrencyVARCHAR(3)1:1; validated against ISO-4217 allowlist
TransactionTimestampStringtx_dateTIMESTAMPParse to TIMESTAMP (UTC)
yearINTExtract from tx_date
monthVARCHAR(2)Extract from tx_date (zero-padded)

Schema Evolution Rules

Additive-Only Changes

Allowed:

  • Add new nullable columns
  • Add new partition keys (requires new schema_v)
  • Rename schema version (e.g., schema_v=v1schema_v=v2)

Not Allowed:

  • Remove columns
  • Change column types (breaking change)
  • Rename existing columns (breaking change)
  • Add NOT NULL columns (breaks backward compatibility)

Schema Versioning

When schema changes are needed:

  1. Create new schema_v (e.g., v2)
  2. Write new data to schema_v=v2/ partition
  3. Existing data remains in schema_v=v1/
  4. Athena queries can union both versions or query specific version

Example path coexistence:

s3://ohpen-silver/silver/mortgages/transactions/
year=2024/
month=01/
schema_v=v1/ (original schema)
schema_v=v2/ (new schema with additional columns)

Schema change impact checklist (e.g. adding TransactionType)

Before adding a new column (e.g. TransactionType) or changing the schema:

  1. Notify consumers – Identify downstream consumers (Athena workgroups, BI tools, reports). Notify owners of the planned change and effective date.
  2. Additive change – New columns MUST be nullable (or have a default that does not change semantics). Example: add transaction_type VARCHAR(32) NULL in schema_v=v2; treat missing as 'unknown'.
  3. Dual-read period – Consumers that need the new column can migrate to schema_v=v2; existing queries on schema_v=v1 (or union v1/v2) continue to work. Avoid breaking existing SELECTs.
  4. Update Glue Data Catalog – Register new table/partition definition for schema_v=v2 in Terraform or Glue; document in Data Lake Architecture - Schema Evolution.
  5. Runbooks – Update any runbooks or SQL examples that assume the old column set. Backfill playbook applies when historical partitions need the new column.

Consumer registry (recommended): Maintain a list of consumers of Silver/Gold (e.g. "Month-end balance report", "Athena workgroup X", "Dashboard Y") and include them in the notification step. See Data Lake Architecture - Ownership.


Nullability Contract

NOT NULL Columns

All core data columns are NOT NULL:

  • id - Every transaction must have unique identifier
  • account_id - Every transaction must belong to an account
  • amount - Every transaction must have an amount (can be 0.00)
  • currency - Every transaction must have currency code
  • tx_date - Every transaction must have timestamp

Nullable Columns

No columns are nullable in the current schema (v1).

Future additions: New columns added in schema_v=v2 or later MUST be nullable to maintain backward compatibility with v1 data.


File Format Specifications

Parquet Configuration

  • Compression: SNAPPY (balance between compression ratio and speed)
  • File size target: ~128MB per file (optimal for Athena/Spark)
  • Row group size: 128MB (Parquet default)
  • Column encoding: Dictionary encoding where applicable

Success Markers

Each run_id partition MUST include a _SUCCESS marker file:

{
"run_id": "20240121T120000Z",
"execution_arn": "arn:aws:states:...",
"completed_at": "2024-01-21T12:05:30Z",
"metrics": {
"input_rows": 1000000,
"valid_rows": 995000,
"quarantined_rows": 5000,
"condemned_rows": 0
}
}

See also


Version: 1.0
Last Updated: 2024-01-30
Owner: Data Platform Team

© 2026 Stephen AdeiCC BY 4.0