© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
ADR-001: Parquet-only Format (not Iceberg/Delta)
Status
Accepted
Context
The case study requirements specify storing validated data in Parquet format for analytical queries. The following options were considered:
- Plain Parquet (chosen)
- Apache Iceberg (table format with ACID transactions, time travel, partition evolution)
- Delta Lake (similar to Iceberg, with ACID transactions and time travel)
The workload is batch OLAP (analytical processing), not real-time streaming. Month-end reporting queries can tolerate 30-second execution times.
Decision
Use Parquet-only format without Iceberg or Delta Lake table formats.
Rationale
- Case requirements met: Parquet format satisfies all case study requirements
- Simplicity: No additional table format layer reduces operational complexity
- Batch workload: OLAP workload does not require ACID transactions or time travel
- Schema evolution: Additive-only schema changes with
schema_vversioning provides sufficient backward compatibility - Cost: No additional compute or storage overhead from table format metadata
Consequences
Positive:
- Simpler architecture (no table format layer)
- Lower operational complexity (no Iceberg/Delta metadata management)
- Meets case study requirements without over-engineering
- Faster development (no table format learning curve)
Negative:
- No ACID transactions (not needed for batch OLAP)
- No time travel queries (not required for case study)
- No automatic partition evolution (manual schema_v versioning required)
- No automatic small file compaction (manual optimization required)
Alternatives Considered
Apache Iceberg
- Why rejected: Adds complexity without immediate benefit for batch OLAP workload. ACID transactions and time travel not required for month-end reporting.
Delta Lake
- Why rejected: Similar to Iceberg, adds complexity without clear benefit. Requires Spark runtime, adds metadata overhead.
Plain Parquet with Schema Registry
- Why rejected: Schema evolution via
schema_vversioning in paths is sufficient. No need for separate schema registry infrastructure.
Related Decisions
- Design Decisions Summary - Complete trade-off analysis for this decision
- ADR-002: Year/Month Partitioning - Partition strategy works with Parquet
- ADR-004: Quarantine + Condemned Layers - Error handling layers use Parquet format
Implementation Evidence
- Code: All ETL writes use Parquet format (
write_parquet_to_s3functions) - Documentation: Parquet Schema Specification - Schema contract
- Architecture: Data Lake Architecture - Storage Format - Parquet format rationale