Scale Assumptions

This document records scale and load assumptions for the data pipeline and data lake. It informs capacity reasoning and when to re-evaluate the architecture (e.g. streaming).

Ingestion (writes)

Assumption	Value / description	Source
Trigger model	Batch: schedule (e.g. daily) and/or S3 object creation (file land).	Case study; ETL_FLOW; Terraform EventBridge rules
Runs per day	Not specified. Designed for “one or more runs per day” without assuming a fixed rate.	—
Files per run	Typically one or few input files per run; design supports multiple files.	Case: “raw financial data in CSV … stored in S3”
Streaming QPS	Not assumed. No real-time event rate (events per second) required.	Explicit: case does not specify QPS
Engine threshold	Pandas (Lambda): <10M rows or <500MB per file; PySpark (Glue): ≥10M rows or ≥500MB.	ETL_FLOW.md

Storage and retention

Assumption	Value / description	Source
Design target (analytics)	100M rows for month-end / analytical queries.	Case study Task 3; QUALITY_REQUIREMENTS
Partitioning	Year/month for Silver; partition pruning reduces scan (e.g. 95% for Q1 query).	ARCHITECTURE; balance_history_2024_q1.sql
Retention	Per-layer lifecycle (e.g. Silver/Gold retention, Glacier for older tiers); perpetual retention for condemned where required by compliance.	Terraform lifecycle rules; QUALITY_REQUIREMENTS

Queries (reads)

Assumption	Value / description	Source
Query pattern	Batch and ad-hoc (e.g. month-end reports, analyst SQL).	Case study; Athena workgroup
Query latency target	<30 seconds for 100M-row query (with partition pruning).	QUALITY_REQUIREMENTS
Read QPS	Not specified. Design supports multiple concurrent Athena queries via serverless scaling.	—

When to re-evaluate

High write QPS: If the business requires sustained high event rate (e.g. thousands of events per second), consider a streaming path (Kinesis, Lambda/KDA) in addition to or instead of batch Glue.
Strict latency: If sub-second or near-real-time ingestion or query is required, the design would need to change (streaming, different query layer).
Concrete load numbers: When the business case provides files/day, runs/day, or query volume, add them here and use them for capacity and cost reasoning.

Ingestion (writes)​

Storage and retention​

Queries (reads)​

When to re-evaluate​

Ingestion (writes)

Storage and retention

Queries (reads)

When to re-evaluate