© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
Parquet Schema Specification - Silver Layer
Document Type: Normative Specification
Audience: Engineering, Platform Team
Status: Active
For architecture context: See Data Lake Architecture - Silver Layer Schema. For design rationale, see Data Model Design Rationale.
Purpose
This document defines the canonical Parquet schema for the Silver layer (validated transactions). This schema serves as the contract between ETL ingestion and analytical consumption.
Silver Layer Schema
Table Name
Logical table: silver.transactions
Physical location: s3://ohpen-silver/silver/mortgages/transactions/
Column Specification
| Column | Athena Type | Nullable | Semantic Meaning |
|---|---|---|---|
id | BIGINT | NOT NULL | Unique transaction identifier; stable and sortable for tie-breaking (source: TransactionID per casting rule). |
account_id | VARCHAR | NOT NULL | Account (ledger) identifier; one row per account per transaction (source: CustomerID). |
amount | DECIMAL(16,2) | NOT NULL | Signed transaction amount in account currency; credits positive, debits negative. |
currency | VARCHAR(3) | NOT NULL | ISO-4217 currency code of the transaction; preserved for multi-currency reporting and validation. |
tx_date | TIMESTAMP | NOT NULL | Business timestamp of the transaction (UTC); used for ordering and month-end logic. |
year | INT | NOT NULL | Partition key: year of tx_date. |
month | VARCHAR(2) | NOT NULL | Partition key: zero-padded month of tx_date (e.g. 01, 12). |
schema_v | VARCHAR | NOT NULL | Partition key: schema version (e.g., v1) for evolution support. |
run_id | VARCHAR | NOT NULL | Partition key: run isolation identifier (unique per ETL execution). |
Partitioning Strategy
Partition Keys: year, month, schema_v, run_id
Partition Scheme:
s3://ohpen-silver/silver/mortgages/transactions/
year=2024/
month=01/
schema_v=v1/
run_id=20240121T120000Z/
part-00000.parquet
_SUCCESS
Rationale:
- Transaction-time partitioning (year/month from
tx_date): Queries filter by transaction dates, not ingestion dates - Schema versioning (
schema_v): Supports additive schema evolution with backward compatibility - Run isolation (
run_id): Each run writes to unique path, enabling safe backfills and idempotent reruns
Data Type Casting Rules
From CSV to Parquet
| CSV Column | CSV Type | Parquet Column | Parquet Type | Transformation |
|---|---|---|---|---|
TransactionID | String | id | BIGINT | Cast to numeric (or deterministic surrogate if non-numeric) |
CustomerID | String | account_id | VARCHAR | 1:1; type VARCHAR |
TransactionAmount | String/Numeric | amount | DECIMAL(16,2) | Cast to DECIMAL(16,2) |
Currency | String | currency | VARCHAR(3) | 1:1; validated against ISO-4217 allowlist |
TransactionTimestamp | String | tx_date | TIMESTAMP | Parse to TIMESTAMP (UTC) |
| — | — | year | INT | Extract from tx_date |
| — | — | month | VARCHAR(2) | Extract from tx_date (zero-padded) |
Schema Evolution Rules
Additive-Only Changes
Allowed:
- Add new nullable columns
- Add new partition keys (requires new
schema_v) - Rename schema version (e.g.,
schema_v=v1→schema_v=v2)
Not Allowed:
- Remove columns
- Change column types (breaking change)
- Rename existing columns (breaking change)
- Add NOT NULL columns (breaks backward compatibility)
Schema Versioning
When schema changes are needed:
- Create new
schema_v(e.g.,v2) - Write new data to
schema_v=v2/partition - Existing data remains in
schema_v=v1/ - Athena queries can union both versions or query specific version
Example path coexistence:
s3://ohpen-silver/silver/mortgages/transactions/
year=2024/
month=01/
schema_v=v1/ (original schema)
schema_v=v2/ (new schema with additional columns)
Schema change impact checklist (e.g. adding TransactionType)
Before adding a new column (e.g. TransactionType) or changing the schema:
- Notify consumers – Identify downstream consumers (Athena workgroups, BI tools, reports). Notify owners of the planned change and effective date.
- Additive change – New columns MUST be nullable (or have a default that does not change semantics). Example: add
transaction_type VARCHAR(32) NULLinschema_v=v2; treat missing as'unknown'. - Dual-read period – Consumers that need the new column can migrate to
schema_v=v2; existing queries onschema_v=v1(or union v1/v2) continue to work. Avoid breaking existing SELECTs. - Update Glue Data Catalog – Register new table/partition definition for
schema_v=v2in Terraform or Glue; document in Data Lake Architecture - Schema Evolution. - Runbooks – Update any runbooks or SQL examples that assume the old column set. Backfill playbook applies when historical partitions need the new column.
Consumer registry (recommended): Maintain a list of consumers of Silver/Gold (e.g. "Month-end balance report", "Athena workgroup X", "Dashboard Y") and include them in the notification step. See Data Lake Architecture - Ownership.
Nullability Contract
NOT NULL Columns
All core data columns are NOT NULL:
id- Every transaction must have unique identifieraccount_id- Every transaction must belong to an accountamount- Every transaction must have an amount (can be 0.00)currency- Every transaction must have currency codetx_date- Every transaction must have timestamp
Nullable Columns
No columns are nullable in the current schema (v1).
Future additions: New columns added in schema_v=v2 or later MUST be nullable to maintain backward compatibility with v1 data.
File Format Specifications
Parquet Configuration
- Compression: SNAPPY (balance between compression ratio and speed)
- File size target: ~128MB per file (optimal for Athena/Spark)
- Row group size: 128MB (Parquet default)
- Column encoding: Dictionary encoding where applicable
Success Markers
Each run_id partition MUST include a _SUCCESS marker file:
{
"run_id": "20240121T120000Z",
"execution_arn": "arn:aws:states:...",
"completed_at": "2024-01-21T12:05:30Z",
"metrics": {
"input_rows": 1000000,
"valid_rows": 995000,
"quarantined_rows": 5000,
"condemned_rows": 0
}
}
See also
- ETL Implementation: ETL Flow, ETL Scripts
- Data Lake Architecture: Data Lake Architecture
- Logical Data Model Rationale: Data Model Design Rationale
Version: 1.0
Last Updated: 2024-01-30
Owner: Data Platform Team