Skip to main content

Test Environment PySpark Support

Implementation Complete

The test environment has been fully updated to support PySpark testing alongside the existing Pandas tests.

Changes Made

1. Docker Configuration

Dockerfile.test:

  • Added Java 17 JDK (required for PySpark)
  • Set JAVA_HOME environment variable
  • Added PySpark installation (pyspark==3.5.0)
  • Updated to include requirements-spark.txt

docker-compose.test.yml:

  • Added requirements-spark.txt volume mount
  • Added JAVA_HOME and SPARK_LOCAL_IP environment variables

2. Dependencies

requirements-dev.txt:

  • Added pyspark==3.5.0 for local testing

3. Test Infrastructure

conftest.py:

  • Added spark_session fixture (session-scoped SparkSession)
  • Added spark_df_from_dict helper fixture
  • Configured Spark for local testing (local[2], 2GB memory)

test_etl_spark.py:

  • Complete PySpark test suite (mirrors test_etl.py)
  • 15+ test cases covering all functionality
  • Performance test for vectorized operations
  • All tests use Spark DataFrames instead of Pandas

4. CI/CD Pipeline

GitHub Actions (.github/workflows/ci.yml):

  • Split into two jobs:
    • python-validation: Pandas tests
    • pyspark-validation: PySpark tests
  • Added Java 17 setup for PySpark job
  • Separate linting and testing for each implementation

5. Documentation

tests/README_TESTING.md:

  • Complete testing guide
  • Instructions for running Pandas and PySpark tests
  • Troubleshooting guide
  • Best practices

Test Coverage

Pandas Tests (test_etl.py)

  • 15+ test cases
  • All validation scenarios
  • Loop prevention
  • Circuit breaker

PySpark Tests (test_etl_spark.py)

  • 15+ test cases (mirrors Pandas)
  • All validation scenarios
  • Loop prevention
  • Circuit breaker
  • Vectorized operations performance
  • Large dataset handling

Running Tests

Local (Pandas)

pytest tests/test_etl.py -v

Local (PySpark)

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
pytest tests/test_etl_spark.py -v

Docker (Both)

docker-compose -f docker-compose.test.yml up --build

CI/CD

  • Automatically runs both test suites on every push/PR
  • Both must pass for CI to succeed

Verification

To verify the test environment works:

  1. Build Docker image:

    docker build -f Dockerfile.test -t etl-tests .
  2. Run PySpark tests:

    docker run --rm etl-tests pytest tests/test_etl_spark.py -v
  3. Run all tests:

    docker run --rm etl-tests pytest tests/ -v

Troubleshooting

Java Not Found

  • Ensure Java 17+ is installed
  • Set JAVA_HOME environment variable
  • Docker image includes Java automatically

PySpark Import Errors

  • Check pyspark is installed: pip list | grep pyspark
  • Rebuild Docker image if needed

Memory Issues

  • Reduce Spark memory in conftest.py if needed
  • Default: 2GB driver, 2GB executor

Next Steps

  1. ✅ Test environment supports PySpark (COMPLETE)
  2. 🔄 Run tests locally to verify
  3. 🔄 Push to GitHub to verify CI/CD
  4. 🔄 Add integration tests for Spark modules

Status: ✅ TEST ENVIRONMENT READY FOR PYSPARK

All test infrastructure has been updated and is ready to use.

© 2026 Stephen AdeiCC BY 4.0