Skip to main content

ETL Pipeline Diagrams

© 2026 Stephen Adei. All rights reserved. All content on this site is the intellectual property of Stephen Adei. See License for terms of use and attribution.

This section contains all detailed diagrams for the ETL pipeline. For simplified, abstracted versions, see the main ETL Flow & Pseudocode document.

Note: These diagrams reflect the PySpark implementation (recommended for production). The Pandas implementation (for development/testing) has some stubbed features (quarantine history check).


ETL Diagrams

High-Level Data Flow


Detailed Validation Flow

Note: Reflects PySpark implementation (production). Pandas version has stubbed quarantine history check.


S3 Storage Structure


Component Interaction


Error Handling & Resilience


Data Quality Metrics Flow


Code structure (module layout)

The implementation is modular so flow logic can be tested and adapters swapped (e.g. for local runs or tests).

Layer / areaPurpose
Entry pointsPandas: ingest_transactions.py (CLI) → application/ingest_use_case.run_ingest_pandas. PySpark: ingest_transactions_spark.py orchestrates inline (no separate use-case layer).
Applicationapplication/ingest_use_case.py – Pandas run orchestration (extracted from CLI for testing). Can be extended to accept injected adapters.
Portsapplication/ports.pyClock (e.g. utc_now() for run_id / ingest_date), MetricsPublisher (e.g. publish(metrics)). Used for testability and swapping implementations.
Adaptersadapters/cloudwatch.py – CloudWatch metrics; adapters/clock.py – real time; adapters/bedrock.py – GenAI (e.g. quarantine descriptions).
Shared modulesconfig.py (constants), paths.py (Silver/quarantine/condemned prefixes), run_context.py (run_id, ingest_date), partition_pruning.py (event date range), metrics.py (calculate + publish).
Engine-specificPandas: validator, loop_prevention, metadata, s3_operations. PySpark: validator_spark, loop_prevention_spark, metadata_spark, s3_operations_spark. Same validation and loop-prevention rules; only execution engine differs.

The diagrams and pseudocode in this reference describe the flow; the code realises that flow with this structure so orchestration, I/O, and time are separable for tests and environments.


ETL Pseudocode

For full algorithmic pseudocode (validation, S3 operations, helpers), see ETL Algorithmic Pseudocode.

See also

© 2026 Stephen AdeiCC BY 4.0