AWS Testing: MinIO vs Real AWS

Summary: S3 behaviour is tested via MinIO (S3-compatible) in CI and locally. Lambda, Glue, and Step Functions are tested structurally (workflow + Terraform) and at runtime only after a real deployment via a smoke script. One ETL test is intended for real S3.

1. What is tested today

S3 (data layer)

How	Where	What
MinIO	Task 1 Docker + CI (MinIO service)	All S3-style operations: read/write CSV and Parquet, success markers, quarantine/condemned writes, run_id isolation, schema versioning, error recovery. Uses same boto3 API as AWS.
Real S3	Not in CI	One test, `test_s3_large_file_handling`, asserts Parquet compression keeps 100K rows under a size bound; it can fail on MinIO (different compression/partitioning) and is intended for real S3.

So: S3 logic and paths are well covered by MinIO; the only gap is that one compression/size assertion.

Lambda, Glue, Step Functions

How	Where	What
Static	Task 4 pytest	Workflow YAML and Terraform are validated (syntax, structure, job steps, resource names). No AWS calls.
Smoke (real AWS)	`tasks/devops_cicd/scripts/smoke_tests.sh`	Run after Terraform apply in CD. Checks S3 buckets exist, Glue job exists, Step Functions state machine ACTIVE, Lambda exists (optional invoke). Requires AWS credentials and deployed infra.
Pytest with creds	`make test-task4-aws`	When `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` are set, `test_deployment_smoke_test()` runs and only checks STS get_caller_identity (proves creds work), not deployed resources.

So: AWS “parts” for orchestration are tested by config + post-deploy smoke, not by unit/integration tests against a live AWS account in CI.

2. Coverage of AWS components

S3: Yes. MinIO gives high API parity; your ETL tests cover reads, writes, partitioning, and error handling. The single known difference is the large-file compression test.
Lambda / Glue / Step Functions: Partially. Structure and naming are tested; runtime behaviour is only tested when you deploy and run smoke_tests.sh (or manually run the CD pipeline). There is no automated test in CI that invokes Lambda/Glue/Step Functions without a real deployment.

So: S3 is well tested via MinIO; orchestration is “design + smoke after deploy.”

3. Is MinIO sufficient, or is there an advantage to a free AWS account?

MinIO is sufficient for

Daily development and CI: all ETL tests (except the one size test), scenario tests, and integration tests run against MinIO.
Validating S3 API usage, paths, partitioning, and error handling.
No cost, no AWS account, fast feedback in Docker/GitHub Actions.

Where a free AWS account helps

test_s3_large_file_handling
Run the test against real S3 (no S3_ENDPOINT_URL, real credentials). It then validates Parquet compression as intended. Optional: you can skip or relax this test in MinIO (e.g. skip when S3_ENDPOINT_URL is set) so CI stays green while keeping the assertion for real S3.
Post-deploy smoke
smoke_tests.sh needs a real deployment (Terraform apply) and AWS credentials. A free-tier or low-cost dev account lets you run CD → deploy → smoke and confirm buckets, Glue job, Step Functions, and Lambda exist and are reachable.
AWS-specific behaviour
IAM, bucket policies, lifecycle, multipart uploads, and region/endpoint behaviour can differ slightly from MinIO. A real account catches those if you run the same tests or smoke there.
Interview/demo
Showing a green run against real S3 and a green smoke after deploy is stronger than “testing is only with MinIO and static checks.”

Practical recommendation

Keep MinIO as the default for all S3 and ETL testing in CI and locally.
Optional: free AWS account
- Use it to run test_s3_large_file_handling against real S3 when you want to confirm compression.
- Use it to deploy (e.g. staging) and run smoke_tests.sh so Lambda/Glue/Step Functions are validated in a real environment.
Cost: Stay within free tier (S3, Lambda, Step Functions; Glue has free tier limits). Use a separate dev account or clearly tagged resources and tear down when not needed.

4. Making the large-file test CI-friendly (optional)

To avoid the single “expected failure” in MinIO, you can skip or relax the size assertion when running against MinIO:

# In test_s3_large_file_handling, after verifying total_size > 0:
if _is_minio_env():
    pytest.skip("Compression size bound is validated on real S3; MinIO can exceed it")
# else: assert total_size < num_rows * 200

Then CI stays green; run the full assertion only when testing against real S3 (e.g. locally with AWS creds and no S3_ENDPOINT_URL).

5. See also

TESTING_MANUAL.md — How to run tests and where reports are.
tasks/data_ingestion_transformation/tests/README_TESTING.md (source only) — Task 1 test status (including test_s3_large_file_handling).
tasks/devops_cicd/scripts/README.md (source only) — Smoke script usage.

1. What is tested today​

S3 (data layer)​

Lambda, Glue, Step Functions​

2. Coverage of AWS components​

3. Is MinIO sufficient, or is there an advantage to a free AWS account?​

MinIO is sufficient for​

Where a free AWS account helps​

Practical recommendation​

4. Making the large-file test CI-friendly (optional)​

5. See also​