CI Test Policy
Test markers
All Task 1 ETL tests use consistent markers so CI can select by test type:
| Marker | Meaning |
|---|---|
unit | No S3, no Spark cluster; mocks or in-memory (e.g. FakeStoragePort). Fast feedback. |
integration | Requires MinIO (or real S3). Full ETL path. |
real_s3 | Same as integration; marks tests that need S3 endpoint (MinIO or AWS). |
real_spark | Requires real Spark session (e.g. local[2]). |
Defined in tasks/data_ingestion_transformation/tests/pytest.ini. Use -m "unit" or -m "integration and real_s3" to run subsets.
CI policy
- PR (pull_request): Run unit tests plus one backend (Pandas) integration with MinIO when available. Goal: fast feedback without full scenario × backend matrix.
- Push to main / nightly: Run full matrix: unit + Pandas integration + PySpark integration (all scenario × backend combinations when MinIO is up). Optional: separate job for
real_s3(real AWS) if needed. - Current workflow (
.github/workflows/ci.yml): Python job runs unit tests with-m "not integration and not e2e"; integration tests may run in a separate job with MinIO. Document here that for full coverage, runpytest tests/ -m "integration"(with MinIO) on main or in a nightly job.
Duration and stability
- Quality gates:
scripts/test_report/quality_gates.yamldefines:total_duration_warn_seconds: warn if total test run exceeds (e.g. 600s).slow_test_warn_seconds: warn if any single test exceeds (e.g. 60s).skipped_warn_threshold: warn if skipped count is above (e.g. 5).
- Parallelization: To keep feedback time bounded, consider splitting by marker: one job for
unit, one forintegration, so they run in parallel. Not yet specified in the workflow; add when needed. - Flakiness: If a test is flaky, fix or quarantine it; quality gates do not yet fail on flaky reruns.
References
- quality_gates.yaml — thresholds (or see Test Dashboard).
- TEST_REPORT_ARTIFACT — report artifact for audits and debugging.