Scale Assumptions
This document records scale and load assumptions for the data pipeline and data lake. It informs capacity reasoning and when to re-evaluate the architecture (e.g. streaming).
Ingestion (writes)
| Assumption | Value / description | Source |
|---|---|---|
| Trigger model | Batch: schedule (e.g. daily) and/or S3 object creation (file land). | Case study; ETL_FLOW; Terraform EventBridge rules |
| Runs per day | Not specified. Designed for “one or more runs per day” without assuming a fixed rate. | — |
| Files per run | Typically one or few input files per run; design supports multiple files. | Case: “raw financial data in CSV … stored in S3” |
| Streaming QPS | Not assumed. No real-time event rate (events per second) required. | Explicit: case does not specify QPS |
| Engine threshold | Pandas (Lambda): <10M rows or <500MB per file; PySpark (Glue): ≥10M rows or ≥500MB. | ETL_FLOW.md |
Storage and retention
| Assumption | Value / description | Source |
|---|---|---|
| Design target (analytics) | 100M rows for month-end / analytical queries. | Case study Task 3; QUALITY_REQUIREMENTS |
| Partitioning | Year/month for Silver; partition pruning reduces scan (e.g. 95% for Q1 query). | ARCHITECTURE; balance_history_2024_q1.sql |
| Retention | Per-layer lifecycle (e.g. Silver/Gold retention, Glacier for older tiers); perpetual retention for condemned where required by compliance. | Terraform lifecycle rules; QUALITY_REQUIREMENTS |
Queries (reads)
| Assumption | Value / description | Source |
|---|---|---|
| Query pattern | Batch and ad-hoc (e.g. month-end reports, analyst SQL). | Case study; Athena workgroup |
| Query latency target | <30 seconds for 100M-row query (with partition pruning). | QUALITY_REQUIREMENTS |
| Read QPS | Not specified. Design supports multiple concurrent Athena queries via serverless scaling. | — |
When to re-evaluate
- High write QPS: If the business requires sustained high event rate (e.g. thousands of events per second), consider a streaming path (Kinesis, Lambda/KDA) in addition to or instead of batch Glue.
- Strict latency: If sub-second or near-real-time ingestion or query is required, the design would need to change (streaming, different query layer).
- Concrete load numbers: When the business case provides files/day, runs/day, or query volume, add them here and use them for capacity and cost reasoning.