CI Test Policy

Test markers

All Task 1 ETL tests use consistent markers so CI can select by test type:

Marker	Meaning
`unit`	No S3, no Spark cluster; mocks or in-memory (e.g. FakeStoragePort). Fast feedback.
`integration`	Requires MinIO (or real S3). Full ETL path.
`real_s3`	Same as integration; marks tests that need S3 endpoint (MinIO or AWS).
`real_spark`	Requires real Spark session (e.g. local[2]).

Defined in tasks/data_ingestion_transformation/tests/pytest.ini. Use -m "unit" or -m "integration and real_s3" to run subsets.

PR (pull_request): Run unit tests plus one backend (Pandas) integration with MinIO when available. Goal: fast feedback without full scenario × backend matrix.
Push to main / nightly: Run full matrix: unit + Pandas integration + PySpark integration (all scenario × backend combinations when MinIO is up). Optional: separate job for real_s3 (real AWS) if needed.
Current workflow (.github/workflows/ci.yml): Python job runs unit tests with -m "not integration and not e2e"; integration tests may run in a separate job with MinIO. Document here that for full coverage, run pytest tests/ -m "integration" (with MinIO) on main or in a nightly job.

Quality gates: scripts/test_report/quality_gates.yaml defines:
- total_duration_warn_seconds: warn if total test run exceeds (e.g. 600s).
- slow_test_warn_seconds: warn if any single test exceeds (e.g. 60s).
- skipped_warn_threshold: warn if skipped count is above (e.g. 5).
Parallelization: To keep feedback time bounded, consider splitting by marker: one job for unit, one for integration, so they run in parallel. Not yet specified in the workflow; add when needed.
Flakiness: If a test is flaky, fix or quarantine it; quality gates do not yet fail on flaky reruns.