Test separation: unit vs integration vs e2e

Current state

Layer	How the layer is defined	Used in CI?
Unit	No `@pytest.mark.unit` used. Effectively “everything that is not `integration` / `real_s3` / `real_spark`”.	Partly: “Python Unit Tests” runs specific files (`test_etl.py`, `test_integration.py`), not by marker. So some integration tests can run in that job if they live in those files.
Integration	`@pytest.mark.integration` (often with `real_s3` or `real_spark`). Tests that need MinIO/S3 or real Spark.	Yes: “Integration Tests (MinIO)” runs selected files with `-m "integration"`.
E2E	Not separated. No `e2e` marker. A few tests are described as “end-to-end” in docstrings (e.g. `test_real_end_to_end_pipeline`, `test_main_full_pipeline`) but are only marked `integration` or unmarked.	Same as integration (no distinct e2e run).

Markers in pytest.ini: unit, integration, slow, load, real_s3, real_spark, performance.

Single test dir; no unit/integration/e2e markers. All tests are local (DuckDB in-memory) and behave like unit tests.

Markers are by concern, not by layer: syntax, structure, workflow, terraform, integration, orchestration. No unit or e2e. “Integration” here means “workflow/infra consistency”, not “run against real AWS”.

Unit – The unit marker exists in Task 1 but no test uses it. Unit tests are identified only by exclusion (not integration and not real_s3 in root Makefile). So “unit” is implicit.
E2E – There is no e2e marker. True end-to-end tests (full pipeline with real S3, or full CI flow) are mixed with other integration tests.
CI – Separation is inconsistent:
- “Run Python Unit Tests” = file-based (test_etl.py, test_integration.py), so anything in those files runs there (including integration-style tests in test_integration.py).
- “Run Integration Tests” = marker-based (-m "integration") on a fixed list of files. So unit vs integration is not defined in one place (sometimes by file, sometimes by marker).

Layer	Meaning	Markers / how to run
Unit	Single module/function, no real I/O (mocks or in-memory only).	`@pytest.mark.unit` or unmarked; exclude `integration` and `e2e`.
Integration	Multiple components or real infra (MinIO, Spark, etc.) but not full user journey.	`@pytest.mark.integration`; optionally `real_s3` / `real_spark`.
E2E	Full pipeline or full flow (e.g. CSV → S3 → validate → silver, or full CI workflow).	`@pytest.mark.e2e` (add to pytest.ini and use on the few true e2e tests).

Suggested commands:

Unit only: pytest -m "not integration and not e2e" (or -m "unit" once tests are marked).
Integration only: pytest -m "integration and not e2e".
E2E only: pytest -m "e2e".

CI: Run unit first (fast), then integration (with MinIO), then optionally e2e (or keep e2e as a subset of integration).

Add e2e marker in Task 1 (and Task 4 if needed) and mark the few full-pipeline tests (e.g. test_real_end_to_end_pipeline, test_main_full_pipeline) with @pytest.mark.e2e in addition to integration if they are integration tests that are also e2e.
Optionally mark unit tests in Task 1 with @pytest.mark.unit so “unit” is explicit; then CI can run pytest -m "unit" for the unit job instead of a file list.
CI: Have “Python Unit Tests” use markers, e.g. pytest tests/ -v -m "not integration and not e2e" (and ensure no unit test depends on MinIO), and keep “Integration Tests” as -m "integration" so separation is consistent and by marker.