© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
ADR-005: Dual Pandas + PySpark Implementations
Status
Accepted
Context
The system needs to handle varying data volumes: small batches (<10M rows) for development/testing and large batches (10M+ rows) for production.
The following options were considered:
- Dual implementations (Pandas + PySpark) (chosen)
- PySpark only (rejected)
- Pandas only (rejected)
Decision
Maintain both Pandas and PySpark implementations:
- Pandas: For <10M records (fast iteration, simpler debugging, local development)
- PySpark: For 10M+ records (horizontal scalability, distributed processing)
Rationale
- Development speed: Pandas enables fast iteration and simpler debugging for small datasets
- Scalability: PySpark handles large datasets (10M+ rows) with 10-100x speedup
- Cost efficiency: Pandas uses 1 DPU (cheaper for small batches), PySpark scales to 100+ DPUs for large batches
- Flexibility: Runtime selection based on data volume optimizes cost and performance
- Migration path: Gradual migration from Pandas to PySpark as data volumes grow
Consequences
Positive:
- Development speed: Fast iteration with Pandas for small datasets
- Scalability: PySpark handles 100M+ row datasets
- Cost optimization: Right-sized compute for each workload
- Performance: 10-100x speedup with PySpark for large datasets
Negative:
- Development time: Two implementations to maintain
- Code duplication: Similar logic in both implementations
- Testing overhead: Both implementations must be tested
Alternatives Considered
PySpark Only
- Why rejected: Slower development iteration, overkill for small datasets (<10M rows), higher cost for small batches.
Pandas Only
- Why rejected: Doesn't scale beyond ~40MB files, fails on large datasets (100M+ rows), no horizontal scalability.
Related Decisions
- Design Decisions Summary - Complete trade-off analysis for this decision
- ADR-003: Serverless Architecture - Both implementations run on Glue serverless
- ADR-006: run_id Isolation - Both implementations use run_id isolation
Implementation Evidence
- Code:
ingest_transactions.py(Pandas) andingest_transactions_spark.py(PySpark) - Documentation: PySpark Implementation Summary - Performance optimizations
- ETL Flow: ETL Flow - Runtime Selection - Selection criteria