Skip to main content

Dockerized Testing Guide

Context: This guide describes the Task 1 (ETL) Docker test setup. From repo root use make test-task1 or make test; see TESTING_MANUAL.md for root commands and combined report locations.

Overview

All tests are now fully dockerized and can run in isolated containers with all dependencies (MinIO, Spark, Java) pre-configured.

Quick Start

Run All Tests

make test-docker

Run Individual Test Files

# Using the test runner script
./scripts/run_test_solo.sh tests/test_s3_integration.py -v

# Using Make
make test-solo FILE=tests/test_s3_integration.py

Run Specific Test Suites

make test-s3-integration      # S3 integration tests
make test-spark-integration # Spark integration tests
make test-idempotency # Idempotency tests
make test-failure-modes # Failure mode tests
make test-performance # Performance tests
make test-resilience # Resilience tests
make test-scenarios # Bronze→Silver→Promotion scenario tests (MinIO)
make test-contract # StoragePort + IngestStrategy contract tests
make test-invariants # Row-conservation and no-duplicate-ID invariants (MinIO)
make test-golden # Golden-file comparison for a1/a2 (MinIO)

No venv required: All of the above run inside the etl-tests container; use Docker as the primary way to run tests.

Docker Architecture

Services

  1. MinIO (minio)

    • S3-compatible storage for integration tests
    • Automatically initialized with test buckets
    • Health checks ensure readiness
  2. MinIO Init (minio-init)

    • One-time bucket initialization
    • Creates: test-bronze, test-silver, test-quarantine, test-condemned
  3. ETL Test Runner (etl-tests)

    • Python 3.11 with all dependencies
    • Java 17 for PySpark
    • All test dependencies pre-installed
    • Connected to MinIO via network

Network Configuration

All services run on test-network bridge network:

  • minio:9000 - MinIO API endpoint
  • Services can communicate via service names

Environment Variables

The test runner automatically sets:

S3_ENDPOINT_URL=http://minio:9000
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
AWS_DEFAULT_REGION=us-east-1
TEST_BRONZE_BUCKET=test-bronze
TEST_SILVER_BUCKET=test-silver
TEST_QUARANTINE_BUCKET=test-quarantine
TEST_CONDEMNED_BUCKET=test-condemned
JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
PYTHONPATH=/app

Test File Organization

New Test Files (All Dockerized)

  1. test_spark_integration.py - Real Spark execution tests

    • Requires: Java 17, PySpark
    • Marker: @pytest.mark.real_spark
  2. test_idempotency.py - Deduplication verification

    • Requires: MinIO
    • Marker: @pytest.mark.integration
  3. test_failure_modes.py - Failure scenario tests

    • Requires: MinIO
    • Marker: @pytest.mark.integration
  4. test_performance_metrics.py - Performance benchmarks

    • Requires: None (local execution)
    • Marker: @pytest.mark.performance
  5. test_resilience.py - Recovery and resilience tests

    • Requires: MinIO
    • Marker: @pytest.mark.integration
  6. test_s3_integration.py - Enhanced S3 integration tests

    • Requires: MinIO
    • Marker: @pytest.mark.real_s3, @pytest.mark.integration

Enhanced Existing Files

  1. test_performance.py - Enhanced with 6 new tests
  2. test_data_quality.py - Enhanced with 6 new tests

Running Tests Individually

Method 1: Using Test Runner Script

# Run a single test file
./scripts/run_test_solo.sh tests/test_s3_integration.py -v

# Run with specific pytest markers
./scripts/run_test_solo.sh tests/test_idempotency.py -v -m "integration"

# Run specific test function
./scripts/run_test_solo.sh tests/test_s3_integration.py::test_s3_partition_structure_validation -v

Method 2: Using Docker Compose Directly

# Start services
docker-compose -f docker-compose.test.yml up -d minio minio-init

# Run specific test file
docker-compose -f docker-compose.test.yml run --rm etl-tests \
pytest tests/test_s3_integration.py -v

# Run with markers
docker-compose -f docker-compose.test.yml run --rm etl-tests \
pytest tests/ -v -m "integration"

Method 3: Using Make

# Run specific test suite
make test-s3-integration

# Run individual test file
make test-solo FILE=tests/test_idempotency.py ARGS="-v -k test_reprocess"

Test Execution Flow

  1. Service Startup

    • MinIO starts and waits for health check
    • MinIO-init creates buckets
    • Test runner waits for dependencies
  2. Test Execution

    • Tests run in isolated container
    • All environment variables set automatically
    • Network connectivity to MinIO via service name
  3. Cleanup

    • Container removed after test (--rm flag)
    • MinIO data persists in volume
    • Reports saved to ./reports/

Troubleshooting

MinIO Port Already in Use

If you see "port is already allocated", either:

  1. Use existing MinIO container (script detects automatically)
  2. Stop existing MinIO: docker stop ohpen-etl-test-minio
  3. Use different port in docker-compose.test.yml

Tests Cannot Connect to MinIO

  1. Check MinIO is running: docker ps | grep minio
  2. Check network: docker network inspect test-network
  3. Verify environment variables in test container

Java/PySpark Issues

  1. Verify Java is installed: docker-compose -f docker-compose.test.yml run --rm etl-tests java -version
  2. Check JAVA_HOME: docker-compose -f docker-compose.test.yml run --rm etl-tests echo $JAVA_HOME

Build Cache Issues

# Rebuild without cache
docker-compose -f docker-compose.test.yml build --no-cache etl-tests

CI/CD Integration

The same Docker setup is used in CI/CD:

# .github/workflows/ci.yml
integration-tests:
services:
minio:
image: minio/minio:latest
# ... same configuration

Best Practices

  1. Always use Docker for integration tests - Ensures consistent environment
  2. Run tests individually first - Catch issues early
  3. Use markers for test selection - -m "integration" for integration tests
  4. Check service health - Scripts wait for MinIO readiness
  5. Clean up between runs - Use --rm flag or make clean

File Structure

tasks/data_ingestion_transformation/
├── Dockerfile.test # Test container definition
├── docker-compose.test.yml # Test services orchestration
├── scripts/
│ └── run_test_solo.sh # Individual test runner
├── tests/
│ ├── test_s3_integration.py # S3 tests (dockerized)
│ ├── test_spark_integration.py # Spark tests (dockerized)
│ ├── test_idempotency.py # Idempotency tests (dockerized)
│ ├── test_failure_modes.py # Failure tests (dockerized)
│ ├── test_resilience.py # Resilience tests (dockerized)
│ └── ...
└── Makefile # Docker test commands

Summary

✅ All tests are dockerized
✅ MinIO automatically configured
✅ Environment variables set automatically
✅ Network isolation via Docker networks
✅ Easy individual test execution
✅ CI/CD ready


Technical Documentation

Task-Specific Documentation

  • Task 1 Test Directory (source only) - Detailed test documentation
© 2026 Stephen AdeiCC BY 4.0