ETL Pipeline Diagrams
© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.
This section contains all detailed diagrams for the ETL pipeline. For simplified, abstracted versions, see the main ETL Flow & Pseudocode document.
Note: These diagrams reflect the PySpark implementation (recommended for production). The Pandas implementation (for development/testing) has some stubbed features (quarantine history check).
ETL Diagrams
High-Level Data Flow
Detailed Validation Flow
Note: Reflects PySpark implementation (production). Pandas version has stubbed quarantine history check.
S3 Storage Structure
Component Interaction
Error Handling & Resilience
Data Quality Metrics Flow
Code structure (module layout)
The implementation is modular so flow logic can be tested and adapters swapped (e.g. for local runs or tests).
| Layer / area | Purpose |
|---|---|
| Entry points | Pandas: ingest_transactions.py (CLI) → application/ingest_use_case.run_ingest_pandas. PySpark: ingest_transactions_spark.py orchestrates inline (no separate use-case layer). |
| Application | application/ingest_use_case.py – Pandas run orchestration (extracted from CLI for testing). Can be extended to accept injected adapters. |
| Ports | application/ports.py – Clock (e.g. utc_now() for run_id / ingest_date), MetricsPublisher (e.g. publish(metrics)). Used for testability and swapping implementations. |
| Adapters | adapters/cloudwatch.py – CloudWatch metrics; adapters/clock.py – real time; adapters/bedrock.py – GenAI (e.g. quarantine descriptions). |
| Shared modules | config.py (constants), paths.py (Silver/quarantine/condemned prefixes), run_context.py (run_id, ingest_date), partition_pruning.py (event date range), metrics.py (calculate + publish). |
| Engine-specific | Pandas: validator, loop_prevention, metadata, s3_operations. PySpark: validator_spark, loop_prevention_spark, metadata_spark, s3_operations_spark. Same validation and loop-prevention rules; only execution engine differs. |
The diagrams and pseudocode in this reference describe the flow; the code realises that flow with this structure so orchestration, I/O, and time are separable for tests and environments.
ETL Pseudocode
For full algorithmic pseudocode (validation, S3 operations, helpers), see ETL Algorithmic Pseudocode.
See also
- ETL Flow & Pseudocode - Simplified, abstracted versions of these diagrams
- ETL Complete Reference - All ETL diagrams, pseudocode, and code
- ETL Pseudocode - Detailed pseudocode
- ETL Implementation Code - Complete implementation code