AWS Testing: MinIO vs Real AWS
Summary: S3 behaviour is tested via MinIO (S3-compatible) in CI and locally. Lambda, Glue, and Step Functions are tested structurally (workflow + Terraform) and at runtime only after a real deployment via a smoke script. One ETL test is intended for real S3.
1. What is tested today
S3 (data layer)
| How | Where | What |
|---|---|---|
| MinIO | Task 1 Docker + CI (MinIO service) | All S3-style operations: read/write CSV and Parquet, success markers, quarantine/condemned writes, run_id isolation, schema versioning, error recovery. Uses same boto3 API as AWS. |
| Real S3 | Not in CI | One test, test_s3_large_file_handling, asserts Parquet compression keeps 100K rows under a size bound; it can fail on MinIO (different compression/partitioning) and is intended for real S3. |
So: S3 logic and paths are well covered by MinIO; the only gap is that one compression/size assertion.
Lambda, Glue, Step Functions
| How | Where | What |
|---|---|---|
| Static | Task 4 pytest | Workflow YAML and Terraform are validated (syntax, structure, job steps, resource names). No AWS calls. |
| Smoke (real AWS) | tasks/devops_cicd/scripts/smoke_tests.sh | Run after Terraform apply in CD. Checks S3 buckets exist, Glue job exists, Step Functions state machine ACTIVE, Lambda exists (optional invoke). Requires AWS credentials and deployed infra. |
| Pytest with creds | make test-task4-aws | When AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY are set, test_deployment_smoke_test() runs and only checks STS get_caller_identity (proves creds work), not deployed resources. |
So: AWS “parts” for orchestration are tested by config + post-deploy smoke, not by unit/integration tests against a live AWS account in CI.
2. Coverage of AWS components
- S3: Yes. MinIO gives high API parity; your ETL tests cover reads, writes, partitioning, and error handling. The single known difference is the large-file compression test.
- Lambda / Glue / Step Functions: Partially. Structure and naming are tested; runtime behaviour is only tested when you deploy and run
smoke_tests.sh(or manually run the CD pipeline). There is no automated test in CI that invokes Lambda/Glue/Step Functions without a real deployment.
So: S3 is well tested via MinIO; orchestration is “design + smoke after deploy.”
3. Is MinIO sufficient, or is there an advantage to a free AWS account?
MinIO is sufficient for
- Daily development and CI: all ETL tests (except the one size test), scenario tests, and integration tests run against MinIO.
- Validating S3 API usage, paths, partitioning, and error handling.
- No cost, no AWS account, fast feedback in Docker/GitHub Actions.
Where a free AWS account helps
-
test_s3_large_file_handling
Run the test against real S3 (noS3_ENDPOINT_URL, real credentials). It then validates Parquet compression as intended. Optional: you can skip or relax this test in MinIO (e.g. skip whenS3_ENDPOINT_URLis set) so CI stays green while keeping the assertion for real S3. -
Post-deploy smoke
smoke_tests.shneeds a real deployment (Terraform apply) and AWS credentials. A free-tier or low-cost dev account lets you run CD → deploy → smoke and confirm buckets, Glue job, Step Functions, and Lambda exist and are reachable. -
AWS-specific behaviour
IAM, bucket policies, lifecycle, multipart uploads, and region/endpoint behaviour can differ slightly from MinIO. A real account catches those if you run the same tests or smoke there. -
Interview/demo
Showing a green run against real S3 and a green smoke after deploy is stronger than “testing is only with MinIO and static checks.”
Practical recommendation
- Keep MinIO as the default for all S3 and ETL testing in CI and locally.
- Optional: free AWS account
- Use it to run
test_s3_large_file_handlingagainst real S3 when you want to confirm compression. - Use it to deploy (e.g. staging) and run
smoke_tests.shso Lambda/Glue/Step Functions are validated in a real environment.
- Use it to run
- Cost: Stay within free tier (S3, Lambda, Step Functions; Glue has free tier limits). Use a separate dev account or clearly tagged resources and tear down when not needed.
4. Making the large-file test CI-friendly (optional)
To avoid the single “expected failure” in MinIO, you can skip or relax the size assertion when running against MinIO:
# In test_s3_large_file_handling, after verifying total_size > 0:
if _is_minio_env():
pytest.skip("Compression size bound is validated on real S3; MinIO can exceed it")
# else: assert total_size < num_rows * 200
Then CI stays green; run the full assertion only when testing against real S3 (e.g. locally with AWS creds and no S3_ENDPOINT_URL).
5. See also
- TESTING_MANUAL.md — How to run tests and where reports are.
- tasks/data_ingestion_transformation/tests/README_TESTING.md (source only) — Task 1 test status (including
test_s3_large_file_handling). - tasks/devops_cicd/scripts/README.md (source only) — Smoke script usage.