Test Environment PySpark Support
Implementation Complete
The test environment has been fully updated to support PySpark testing alongside the existing Pandas tests.
Changes Made
1. Docker Configuration
Dockerfile.test:
- Added Java 17 JDK (required for PySpark)
- Set JAVA_HOME environment variable
- Added PySpark installation (pyspark==3.5.0)
- Updated to include requirements-spark.txt
docker-compose.test.yml:
- Added requirements-spark.txt volume mount
- Added JAVA_HOME and SPARK_LOCAL_IP environment variables
2. Dependencies
requirements-dev.txt:
- Added pyspark==3.5.0 for local testing
3. Test Infrastructure
conftest.py:
- Added
spark_sessionfixture (session-scoped SparkSession) - Added
spark_df_from_dicthelper fixture - Configured Spark for local testing (local[2], 2GB memory)
test_etl_spark.py:
- Complete PySpark test suite (mirrors test_etl.py)
- 15+ test cases covering all functionality
- Performance test for vectorized operations
- All tests use Spark DataFrames instead of Pandas
4. CI/CD Pipeline
GitHub Actions (.github/workflows/ci.yml):
- Split into two jobs:
python-validation: Pandas testspyspark-validation: PySpark tests
- Added Java 17 setup for PySpark job
- Separate linting and testing for each implementation
5. Documentation
tests/README_TESTING.md:
- Complete testing guide
- Instructions for running Pandas and PySpark tests
- Troubleshooting guide
- Best practices
Test Coverage
Pandas Tests (test_etl.py)
- 15+ test cases
- All validation scenarios
- Loop prevention
- Circuit breaker
PySpark Tests (test_etl_spark.py)
- 15+ test cases (mirrors Pandas)
- All validation scenarios
- Loop prevention
- Circuit breaker
- Vectorized operations performance
- Large dataset handling
Running Tests
Local (Pandas)
pytest tests/test_etl.py -v
Local (PySpark)
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
pytest tests/test_etl_spark.py -v
Docker (Both)
docker-compose -f docker-compose.test.yml up --build
CI/CD
- Automatically runs both test suites on every push/PR
- Both must pass for CI to succeed
Verification
To verify the test environment works:
-
Build Docker image:
docker build -f Dockerfile.test -t etl-tests . -
Run PySpark tests:
docker run --rm etl-tests pytest tests/test_etl_spark.py -v -
Run all tests:
docker run --rm etl-tests pytest tests/ -v
Troubleshooting
Java Not Found
- Ensure Java 17+ is installed
- Set JAVA_HOME environment variable
- Docker image includes Java automatically
PySpark Import Errors
- Check pyspark is installed:
pip list | grep pyspark - Rebuild Docker image if needed
Memory Issues
- Reduce Spark memory in conftest.py if needed
- Default: 2GB driver, 2GB executor
Next Steps
- ✅ Test environment supports PySpark (COMPLETE)
- 🔄 Run tests locally to verify
- 🔄 Push to GitHub to verify CI/CD
- 🔄 Add integration tests for Spark modules
Status: ✅ TEST ENVIRONMENT READY FOR PYSPARK
All test infrastructure has been updated and is ready to use.