ADR-002: Year/Month Partitioning Strategy

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

Status

Accepted

Context

The system needs to support efficient month-end reporting queries on large datasets (100M+ rows). Query patterns focus on specific time ranges (e.g., Q1 2024: January, February, March).

The following options were considered:

Year/month partitioning (chosen)
Year/month/day partitioning (more granular)
Hash partitioning (even distribution)
No partitioning (full table scans)

Decision

Partition Silver layer data by year and month (year=YYYY/month=MM) based on transaction timestamp (tx_date).

Rationale

Query alignment: Month-end reporting queries filter by year and month, enabling 95% scan reduction
Partition pruning: WHERE clauses like year=2024 AND month IN ('01', '02', '03') scan only 3 partitions out of 60 (5-year retention)
Query performance: <30 seconds for 100M row queries with partition pruning
Cost optimization: 95% reduction in Athena scan costs for Q1 queries
Balance: Granular enough for query optimization, not too granular to cause small file problems

Consequences

Positive:

95% scan reduction for Q1 queries (3 partitions out of 60 scanned)
Query performance: <30 seconds for 100M row queries
Cost optimization: <$1 per query vs $20+ for full table scan
Aligned with business reporting patterns (month-end reports)

Negative:

Write amplification: Small files if low volume per month
Manual partition management: Must ensure partition columns are extracted correctly
Not optimal for day-level queries: Would require scanning entire month partition

Alternatives Considered

Year/Month/Day Partitioning

Why rejected: Too granular for month-end reporting. Small files problem if low daily volume. Query patterns do not require day-level filtering.

Hash Partitioning

Why rejected: Doesn't align with query patterns (time-based queries). No partition pruning benefit.

No Partitioning

Why rejected: Full table scans too expensive (100M+ rows). Query performance unacceptable (>5 minutes).

Design Decisions Summary - Complete trade-off analysis for this decision
ADR-001: Parquet-only Format - Parquet format enables partition pruning
ADR-003: Serverless Architecture - Athena benefits from partition pruning

Implementation Evidence

Code: ETL adds year and month partition columns from transaction timestamp
Documentation: SQL Breakdown - Partition Pruning - Query optimization patterns
Architecture: Data Lake Architecture - Partition Strategy - Partition design rationale

Status​

Context​

Decision​

Rationale​

Consequences​

Alternatives Considered​

Year/Month/Day Partitioning​

Hash Partitioning​

No Partitioning​

Related Decisions​

Implementation Evidence​