Skip to main content

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

ADR-005: Dual Pandas + PySpark Implementations

Status

Accepted

Context

The system needs to handle varying data volumes: small batches (<10M rows) for development/testing and large batches (10M+ rows) for production.

The following options were considered:

  1. Dual implementations (Pandas + PySpark) (chosen)
  2. PySpark only (rejected)
  3. Pandas only (rejected)

Decision

Maintain both Pandas and PySpark implementations:

  • Pandas: For <10M records (fast iteration, simpler debugging, local development)
  • PySpark: For 10M+ records (horizontal scalability, distributed processing)

Rationale

  1. Development speed: Pandas enables fast iteration and simpler debugging for small datasets
  2. Scalability: PySpark handles large datasets (10M+ rows) with 10-100x speedup
  3. Cost efficiency: Pandas uses 1 DPU (cheaper for small batches), PySpark scales to 100+ DPUs for large batches
  4. Flexibility: Runtime selection based on data volume optimizes cost and performance
  5. Migration path: Gradual migration from Pandas to PySpark as data volumes grow

Consequences

Positive:

  • Development speed: Fast iteration with Pandas for small datasets
  • Scalability: PySpark handles 100M+ row datasets
  • Cost optimization: Right-sized compute for each workload
  • Performance: 10-100x speedup with PySpark for large datasets

Negative:

  • Development time: Two implementations to maintain
  • Code duplication: Similar logic in both implementations
  • Testing overhead: Both implementations must be tested

Alternatives Considered

PySpark Only

  • Why rejected: Slower development iteration, overkill for small datasets (<10M rows), higher cost for small batches.

Pandas Only

  • Why rejected: Doesn't scale beyond ~40MB files, fails on large datasets (100M+ rows), no horizontal scalability.

Implementation Evidence

© 2026 Stephen AdeiCC BY 4.0