Skip to main content

Scale Assumptions

This document records scale and load assumptions for the data pipeline and data lake. It informs capacity reasoning and when to re-evaluate the architecture (e.g. streaming).

Ingestion (writes)

AssumptionValue / descriptionSource
Trigger modelBatch: schedule (e.g. daily) and/or S3 object creation (file land).Case study; ETL_FLOW; Terraform EventBridge rules
Runs per dayNot specified. Designed for “one or more runs per day” without assuming a fixed rate.
Files per runTypically one or few input files per run; design supports multiple files.Case: “raw financial data in CSV … stored in S3”
Streaming QPSNot assumed. No real-time event rate (events per second) required.Explicit: case does not specify QPS
Engine thresholdPandas (Lambda): <10M rows or <500MB per file; PySpark (Glue): ≥10M rows or ≥500MB.ETL_FLOW.md

Storage and retention

AssumptionValue / descriptionSource
Design target (analytics)100M rows for month-end / analytical queries.Case study Task 3; QUALITY_REQUIREMENTS
PartitioningYear/month for Silver; partition pruning reduces scan (e.g. 95% for Q1 query).ARCHITECTURE; balance_history_2024_q1.sql
RetentionPer-layer lifecycle (e.g. Silver/Gold retention, Glacier for older tiers); perpetual retention for condemned where required by compliance.Terraform lifecycle rules; QUALITY_REQUIREMENTS

Queries (reads)

AssumptionValue / descriptionSource
Query patternBatch and ad-hoc (e.g. month-end reports, analyst SQL).Case study; Athena workgroup
Query latency target<30 seconds for 100M-row query (with partition pruning).QUALITY_REQUIREMENTS
Read QPSNot specified. Design supports multiple concurrent Athena queries via serverless scaling.

When to re-evaluate

  • High write QPS: If the business requires sustained high event rate (e.g. thousands of events per second), consider a streaming path (Kinesis, Lambda/KDA) in addition to or instead of batch Glue.
  • Strict latency: If sub-second or near-real-time ingestion or query is required, the design would need to change (streaming, different query layer).
  • Concrete load numbers: When the business case provides files/day, runs/day, or query volume, add them here and use them for capacity and cost reasoning.
© 2026 Stephen AdeiCC BY 4.0