Skip to main content

CI Test Policy

Test markers

All Task 1 ETL tests use consistent markers so CI can select by test type:

MarkerMeaning
unitNo S3, no Spark cluster; mocks or in-memory (e.g. FakeStoragePort). Fast feedback.
integrationRequires MinIO (or real S3). Full ETL path.
real_s3Same as integration; marks tests that need S3 endpoint (MinIO or AWS).
real_sparkRequires real Spark session (e.g. local[2]).

Defined in tasks/data_ingestion_transformation/tests/pytest.ini. Use -m "unit" or -m "integration and real_s3" to run subsets.

CI policy

  • PR (pull_request): Run unit tests plus one backend (Pandas) integration with MinIO when available. Goal: fast feedback without full scenario × backend matrix.
  • Push to main / nightly: Run full matrix: unit + Pandas integration + PySpark integration (all scenario × backend combinations when MinIO is up). Optional: separate job for real_s3 (real AWS) if needed.
  • Current workflow (.github/workflows/ci.yml): Python job runs unit tests with -m "not integration and not e2e"; integration tests may run in a separate job with MinIO. Document here that for full coverage, run pytest tests/ -m "integration" (with MinIO) on main or in a nightly job.

Duration and stability

  • Quality gates: scripts/test_report/quality_gates.yaml defines:
    • total_duration_warn_seconds: warn if total test run exceeds (e.g. 600s).
    • slow_test_warn_seconds: warn if any single test exceeds (e.g. 60s).
    • skipped_warn_threshold: warn if skipped count is above (e.g. 5).
  • Parallelization: To keep feedback time bounded, consider splitting by marker: one job for unit, one for integration, so they run in parallel. Not yet specified in the workflow; add when needed.
  • Flakiness: If a test is flaky, fix or quarantine it; quality gates do not yet fail on flaky reruns.

References

© 2026 Stephen AdeiCC BY 4.0