Skip to main content

Test separation: unit vs integration vs e2e

Current state

Task 1 (ETL) – tasks/data_ingestion_transformation

LayerHow the layer is definedUsed in CI?
UnitNo @pytest.mark.unit used. Effectively “everything that is not integration / real_s3 / real_spark”.Partly: “Python Unit Tests” runs specific files (test_etl.py, test_integration.py), not by marker. So some integration tests can run in that job if they live in those files.
Integration@pytest.mark.integration (often with real_s3 or real_spark). Tests that need MinIO/S3 or real Spark.Yes: “Integration Tests (MinIO)” runs selected files with -m "integration".
E2ENot separated. No e2e marker. A few tests are described as “end-to-end” in docstrings (e.g. test_real_end_to_end_pipeline, test_main_full_pipeline) but are only marked integration or unmarked.Same as integration (no distinct e2e run).

Markers in pytest.ini: unit, integration, slow, load, real_s3, real_spark, performance.

Task 3 (SQL)

  • Single test dir; no unit/integration/e2e markers. All tests are local (DuckDB in-memory) and behave like unit tests.

Task 4 (CI/CD)

  • Markers are by concern, not by layer: syntax, structure, workflow, terraform, integration, orchestration. No unit or e2e. “Integration” here means “workflow/infra consistency”, not “run against real AWS”.

Gaps

  1. Unit – The unit marker exists in Task 1 but no test uses it. Unit tests are identified only by exclusion (not integration and not real_s3 in root Makefile). So “unit” is implicit.
  2. E2E – There is no e2e marker. True end-to-end tests (full pipeline with real S3, or full CI flow) are mixed with other integration tests.
  3. CI – Separation is inconsistent:
    • “Run Python Unit Tests” = file-based (test_etl.py, test_integration.py), so anything in those files runs there (including integration-style tests in test_integration.py).
    • “Run Integration Tests” = marker-based (-m "integration") on a fixed list of files. So unit vs integration is not defined in one place (sometimes by file, sometimes by marker).

LayerMeaningMarkers / how to run
UnitSingle module/function, no real I/O (mocks or in-memory only).@pytest.mark.unit or unmarked; exclude integration and e2e.
IntegrationMultiple components or real infra (MinIO, Spark, etc.) but not full user journey.@pytest.mark.integration; optionally real_s3 / real_spark.
E2EFull pipeline or full flow (e.g. CSV → S3 → validate → silver, or full CI workflow).@pytest.mark.e2e (add to pytest.ini and use on the few true e2e tests).

Suggested commands:

  • Unit only: pytest -m "not integration and not e2e" (or -m "unit" once tests are marked).
  • Integration only: pytest -m "integration and not e2e".
  • E2E only: pytest -m "e2e".

CI: Run unit first (fast), then integration (with MinIO), then optionally e2e (or keep e2e as a subset of integration).


What to change (optional)

  1. Add e2e marker in Task 1 (and Task 4 if needed) and mark the few full-pipeline tests (e.g. test_real_end_to_end_pipeline, test_main_full_pipeline) with @pytest.mark.e2e in addition to integration if they are integration tests that are also e2e.
  2. Optionally mark unit tests in Task 1 with @pytest.mark.unit so “unit” is explicit; then CI can run pytest -m "unit" for the unit job instead of a file list.
  3. CI: Have “Python Unit Tests” use markers, e.g. pytest tests/ -v -m "not integration and not e2e" (and ensure no unit test depends on MinIO), and keep “Integration Tests” as -m "integration" so separation is consistent and by marker.
© 2026 Stephen AdeiCC BY 4.0