© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

ADR-005: Dual Pandas + PySpark Implementations

Status

Accepted

Context

The system needs to handle varying data volumes: small batches (<10M rows) for development/testing and large batches (10M+ rows) for production.

The following options were considered:

Dual implementations (Pandas + PySpark) (chosen)
PySpark only (rejected)
Pandas only (rejected)

Decision

Maintain both Pandas and PySpark implementations:

Pandas: For <10M records (fast iteration, simpler debugging, local development)
PySpark: For 10M+ records (horizontal scalability, distributed processing)

Rationale

Development speed: Pandas enables fast iteration and simpler debugging for small datasets
Scalability: PySpark handles large datasets (10M+ rows) with 10-100x speedup
Cost efficiency: Pandas uses 1 DPU (cheaper for small batches), PySpark scales to 100+ DPUs for large batches
Flexibility: Runtime selection based on data volume optimizes cost and performance
Migration path: Gradual migration from Pandas to PySpark as data volumes grow

Consequences

Positive:

Development speed: Fast iteration with Pandas for small datasets
Scalability: PySpark handles 100M+ row datasets
Cost optimization: Right-sized compute for each workload
Performance: 10-100x speedup with PySpark for large datasets

Negative:

Development time: Two implementations to maintain
Code duplication: Similar logic in both implementations
Testing overhead: Both implementations must be tested

Alternatives Considered

PySpark Only

Why rejected: Slower development iteration, overkill for small datasets (<10M rows), higher cost for small batches.

Pandas Only

Why rejected: Doesn't scale beyond ~40MB files, fails on large datasets (100M+ rows), no horizontal scalability.

Design Decisions Summary - Complete trade-off analysis for this decision
ADR-003: Serverless Architecture - Both implementations run on Glue serverless
ADR-006: run_id Isolation - Both implementations use run_id isolation

Implementation Evidence

Code: ingest_transactions.py (Pandas) and ingest_transactions_spark.py (PySpark)
Documentation: PySpark Implementation Summary - Performance optimizations
ETL Flow: ETL Flow - Runtime Selection - Selection criteria

Status​

Context​

Decision​

Rationale​

Consequences​

Alternatives Considered​

PySpark Only​

Pandas Only​

Related Decisions​

Implementation Evidence​