Dockerized Testing Guide

Context: This guide describes the Task 1 (ETL) Docker test setup. From repo root use make test-task1 or make test; see TESTING_MANUAL.md for root commands and combined report locations.

Overview

All tests are now fully dockerized and can run in isolated containers with all dependencies (MinIO, Spark, Java) pre-configured.

Quick Start

Run All Tests

make test-docker

Run Individual Test Files

# Using the test runner script
./scripts/run_test_solo.sh tests/test_s3_integration.py -v

# Using Make
make test-solo FILE=tests/test_s3_integration.py

Run Specific Test Suites

make test-s3-integration      # S3 integration tests
make test-spark-integration   # Spark integration tests
make test-idempotency        # Idempotency tests
make test-failure-modes      # Failure mode tests
make test-performance        # Performance tests
make test-resilience         # Resilience tests
make test-scenarios          # Bronze→Silver→Promotion scenario tests (MinIO)
make test-contract           # StoragePort + IngestStrategy contract tests
make test-invariants         # Row-conservation and no-duplicate-ID invariants (MinIO)
make test-golden             # Golden-file comparison for a1/a2 (MinIO)

No venv required: All of the above run inside the etl-tests container; use Docker as the primary way to run tests.

Docker Architecture

Services

MinIO (minio)
- S3-compatible storage for integration tests
- Automatically initialized with test buckets
- Health checks ensure readiness
MinIO Init (minio-init)
- One-time bucket initialization
- Creates: test-bronze, test-silver, test-quarantine, test-condemned
ETL Test Runner (etl-tests)
- Python 3.11 with all dependencies
- Java 17 for PySpark
- All test dependencies pre-installed
- Connected to MinIO via network

Network Configuration

All services run on test-network bridge network:

minio:9000 - MinIO API endpoint
Services can communicate via service names

Environment Variables

The test runner automatically sets:

S3_ENDPOINT_URL=http://minio:9000
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
AWS_DEFAULT_REGION=us-east-1
TEST_BRONZE_BUCKET=test-bronze
TEST_SILVER_BUCKET=test-silver
TEST_QUARANTINE_BUCKET=test-quarantine
TEST_CONDEMNED_BUCKET=test-condemned
JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
PYTHONPATH=/app

Test File Organization

New Test Files (All Dockerized)

test_spark_integration.py - Real Spark execution tests
- Requires: Java 17, PySpark
- Marker: @pytest.mark.real_spark
test_idempotency.py - Deduplication verification
- Requires: MinIO
- Marker: @pytest.mark.integration
test_failure_modes.py - Failure scenario tests
- Requires: MinIO
- Marker: @pytest.mark.integration
test_performance_metrics.py - Performance benchmarks
- Requires: None (local execution)
- Marker: @pytest.mark.performance
test_resilience.py - Recovery and resilience tests
- Requires: MinIO
- Marker: @pytest.mark.integration
test_s3_integration.py - Enhanced S3 integration tests
- Requires: MinIO
- Marker: @pytest.mark.real_s3, @pytest.mark.integration

Enhanced Existing Files

test_performance.py - Enhanced with 6 new tests
test_data_quality.py - Enhanced with 6 new tests

Running Tests Individually

Method 1: Using Test Runner Script

# Run a single test file
./scripts/run_test_solo.sh tests/test_s3_integration.py -v

# Run with specific pytest markers
./scripts/run_test_solo.sh tests/test_idempotency.py -v -m "integration"

# Run specific test function
./scripts/run_test_solo.sh tests/test_s3_integration.py::test_s3_partition_structure_validation -v

Method 2: Using Docker Compose Directly

# Start services
docker-compose -f docker-compose.test.yml up -d minio minio-init

# Run specific test file
docker-compose -f docker-compose.test.yml run --rm etl-tests \
    pytest tests/test_s3_integration.py -v

# Run with markers
docker-compose -f docker-compose.test.yml run --rm etl-tests \
    pytest tests/ -v -m "integration"

Method 3: Using Make

# Run specific test suite
make test-s3-integration

# Run individual test file
make test-solo FILE=tests/test_idempotency.py ARGS="-v -k test_reprocess"

Test Execution Flow

Service Startup
- MinIO starts and waits for health check
- MinIO-init creates buckets
- Test runner waits for dependencies
Test Execution
- Tests run in isolated container
- All environment variables set automatically
- Network connectivity to MinIO via service name
Cleanup
- Container removed after test (--rm flag)
- MinIO data persists in volume
- Reports saved to ./reports/

Troubleshooting

MinIO Port Already in Use

If you see "port is already allocated", either:

Use existing MinIO container (script detects automatically)
Stop existing MinIO: docker stop ohpen-etl-test-minio
Use different port in docker-compose.test.yml

Tests Cannot Connect to MinIO

Check MinIO is running: docker ps | grep minio
Check network: docker network inspect test-network
Verify environment variables in test container

Java/PySpark Issues

Verify Java is installed: docker-compose -f docker-compose.test.yml run --rm etl-tests java -version
Check JAVA_HOME: docker-compose -f docker-compose.test.yml run --rm etl-tests echo $JAVA_HOME

Build Cache Issues

# Rebuild without cache
docker-compose -f docker-compose.test.yml build --no-cache etl-tests

CI/CD Integration

The same Docker setup is used in CI/CD:

# .github/workflows/ci.yml
integration-tests:
  services:
    minio:
      image: minio/minio:latest
      # ... same configuration

Best Practices

Always use Docker for integration tests - Ensures consistent environment
Run tests individually first - Catch issues early
Use markers for test selection - -m "integration" for integration tests
Check service health - Scripts wait for MinIO readiness
Clean up between runs - Use --rm flag or make clean

File Structure

tasks/data_ingestion_transformation/
├── Dockerfile.test              # Test container definition
├── docker-compose.test.yml      # Test services orchestration
├── scripts/
│   └── run_test_solo.sh        # Individual test runner
├── tests/
│   ├── test_s3_integration.py  # S3 tests (dockerized)
│   ├── test_spark_integration.py # Spark tests (dockerized)
│   ├── test_idempotency.py     # Idempotency tests (dockerized)
│   ├── test_failure_modes.py   # Failure tests (dockerized)
│   ├── test_resilience.py      # Resilience tests (dockerized)
│   └── ...
└── Makefile                    # Docker test commands

Summary

✅ All tests are dockerized
✅ MinIO automatically configured
✅ Environment variables set automatically
✅ Network isolation via Docker networks
✅ Easy individual test execution
✅ CI/CD ready

Technical Documentation

Testing Quick Start - Quick start guide for running tests
Testing Guide - Comprehensive testing documentation
Test Runner Guide - Efficient test execution
Test Metrics Guide - Understanding test metrics
Unified Testing Convention - Testing standards
Test Results Overview - Current test execution results

Task-Specific Documentation

Task 1 Test Directory (source only) - Detailed test documentation

Overview​

Quick Start​

Run All Tests​

Run Individual Test Files​

Run Specific Test Suites​

Docker Architecture​

Services​

Network Configuration​

Environment Variables​

Test File Organization​

New Test Files (All Dockerized)​

Enhanced Existing Files​

Running Tests Individually​

Method 1: Using Test Runner Script​

Method 2: Using Docker Compose Directly​

Method 3: Using Make​

Test Execution Flow​

Troubleshooting​

MinIO Port Already in Use​

Tests Cannot Connect to MinIO​

Java/PySpark Issues​

Build Cache Issues​

CI/CD Integration​

Best Practices​

File Structure​

Summary​

Related Documentation​

Technical Documentation​

Task-Specific Documentation​