Metrics Discovery and Executive Signal Extraction

Context: Principal Data Engineer / Analytics Architect — repo-wide metrics discovery for FinTech (auditability, reliability, clarity over volume).
Scope: Entire ohpen-case-2026 repository (ETL, SQL, Terraform, Step Functions, Glue, S3, CI/CD, validation, documentation).
Rule: Only metrics that are realistically derivable from the existing implementation (or trivial extensions that do not change architecture).

1) Full Metrics Inventory (No Filtering)

Grouped by category. For each metric: Name, Category, How computed (source of truth), Where it would live, Exists or trivial to add.

1) DATA VOLUME & FLOW

Metric	Category	How computed (source of truth)	Where it would live	Exists / Trivial
Records ingested (input rows)	Data volume	ETL: `len(raw_df)` / `raw_df.count()`; written to `_SUCCESS.metrics.input_rows`	CloudWatch (InputRows), _SUCCESS JSON, Lambda read_run_summary	Exists
Records validated (valid rows)	Data volume	ETL: valid_df row count; `metrics['valid_rows']`	CloudWatch (ValidRows), _SUCCESS	Exists
Records quarantined	Data volume	ETL: quarantine_df count; `metrics['quarantined_rows']`	CloudWatch (QuarantinedRows), _SUCCESS, Lambda summary	Exists
Records condemned	Data volume	ETL: condemned_df count; `metrics['condemned_rows']`	CloudWatch (CondemnedRows), _SUCCESS, Lambda summary	Exists
Records promoted	Data volume	After promotion gate: valid_rows for runs where promotion_allowed=true; Lambda promote_silver copies run output to current/	Step Functions execution history + _SUCCESS; no aggregate metric today	Trivial (aggregate from executions)
Data freshness / latency	Data volume	Time from trigger (EventBridge event_time or schedule) to_SUCCESS.completed_at; or max(TransactionTimestamp) in Silver vs now	CloudWatch Logs / custom metric from Step Functions or Glue	Trivial (compute from _SUCCESS + trigger in state)
File counts (Bronze/Silver/Quarantine)	Data volume	S3 ListObjectsV2 by prefix (e.g. bronze/..., silver/.../run_id=..., quarantine/...)	Athena (external table list) or Lambda/Glue script	Trivial
Bytes read (input)	Data volume	ETL: `read_csv_from_s3` returns bytes_read; `metrics['bytes_read']`	CloudWatch (BytesRead), _SUCCESS when passed	Exists (pandas path; Spark path trivial)
Bytes written (output)	Data volume	ETL: sum of write sizes; `metrics['bytes_written']`	CloudWatch (BytesWritten), _SUCCESS	Exists (pandas); Spark can add
Partitions written (Silver)	Data volume	Count of distinct year/month in valid_df or S3 prefix list under run_id	_SUCCESS (could add), or S3 list	Trivial

2) DATA QUALITY & GOVERNANCE

Metric	Category	How computed (source of truth)	Where it would live	Exists / Trivial
Validation pass count	Data quality	valid_rows (see above)	CloudWatch, _SUCCESS	Exists
Validation fail (quarantine) count	Data quality	quarantined_rows	CloudWatch, _SUCCESS	Exists
Quarantine rate (%)	Data quality	quarantined_rows / input_rows * 100	CloudWatch (QuarantineRate), Lambda summary, alarm (5% threshold)	Exists
Error categories and frequencies	Data quality	quarantine_df/condemned_df `validation_error` value_counts → quarantine_by_reason, condemned_by_reason	_SUCCESS.metrics, CloudWatch Logs (ETLComplete), Lambda summary	Exists
Currency code violations	Data quality	Count where validation_error == ERROR_CURRENCY (CURRENCY_ERROR)	Derived from quarantine_by_reason / condemned_by_reason	Exists (in breakdown)
Null / range violations	Data quality	NULL_VALUE_ERROR, TYPE_ERROR (amount), TIMESTAMP_ERROR from validation	Same	Exists
Reprocessing attempts (per row)	Data quality	attempt_count in quarantine/condemned; avg_attempt_count in metrics	CloudWatch (AvgAttemptCount), _SUCCESS	Exists
Duplicate detection (row_hash)	Data quality	Condemned with ERROR_DUPLICATE_FAILURE; duplicate_detection_rate in metrics	CloudWatch (DuplicateDetectionRate), _SUCCESS	Exists
TransactionID duplicate (Silver)	Data quality	Condemned with ERROR_DUPLICATE_TRANSACTION_ID	Same breakdown	Exists
Auto-condemnation rate (%)	Data quality	condemned_rows / input_rows * 100	CloudWatch (AutoCondemnationRate), _SUCCESS	Exists
Quarantine retry exhaustion	Data quality	Rows condemned with ERROR_MAX_ATTEMPTS (attempt_count >= MAX_ATTEMPTS)	condemned_by_reason	Exists
Schema version distribution	Data quality	S3 path contains schema_v=v1; Glue table params	S3 prefix listing, Glue Catalog	Trivial

3) PIPELINE HEALTH & RELIABILITY

Metric	Category	How computed (source of truth)	Where it would live	Exists / Trivial
Pipeline success/failure rate	Pipeline health	Step Functions execution status (SUCCEEDED / FAILED); Glue job run status; GetExecutionHistory / ListExecutions; GetJobRuns	CloudWatch Events, SNS, Step Functions logs	Exists (events + SNS on failure)
Retry counts (Glue)	Pipeline health	Terraform Retry (Glue.ThrottlingException, etc.) → Glue job run attempt count	Glue GetJobRun	Trivial (Glue API)
DLQ volume	Pipeline health	SQS ohpen-etl-failures-dlq ApproximateNumberOfMessages	SQS API / CloudWatch SQS metrics	Exists (SQS metrics)
Quarantine retry exhaustion count	Pipeline health	Count condemned with MAX_ATTEMPTS	_SUCCESS condemned_by_reason	Exists
Circuit breaker triggered	Pipeline health	Glue job fails with RuntimeError (circuit breaker); logged and state FAILED	CloudWatch Logs, Step Functions Catch → HandleFailure → SNS	Exists
Mean time to failure / recovery	Pipeline health	Time between last SUCCEEDED and FAILED; time from FAILED to next SUCCEEDED	Step Functions execution timestamps	Trivial (query history)
Backfill vs normal runs	Pipeline health	Trigger source in _SUCCESS.trigger (s3-eventbridge vs schedule)	_SUCCESS	Exists
Promotion gate pass/fail	Pipeline health	Lambda read_run_summary returns promotion_allowed; Step Functions CheckValidationGate	Step Functions state (ValidationGateFailed vs PromoteSilver)	Exists

4) PERFORMANCE & SCALE

Metric	Category	How computed (source of truth)	Where it would live	Exists / Trivial
Glue job duration	Performance	end_time - start_time in ETL; Glue GetJobRun CompletedOn - StartedOn	CloudWatch (DurationSeconds), _SUCCESS, Glue API	Exists
Records/sec throughput	Performance	input_rows / duration_seconds	Derived from CloudWatch or _SUCCESS	Trivial
Partition skew	Performance	Row count or file size per year/month from Silver listing	Athena or Glue partition stats	Trivial
Small-file ratio	Performance	Count of Parquet files under threshold size vs total (e.g. <1MB)	S3 list + size	Trivial
Athena scan volume (estimated)	Performance	Athena workgroup publish_cloudwatch_metrics_enabled → CloudWatch; per-query data scanned	Athena CloudWatch metrics / query result metadata	Exists (workgroup config)
Job resource utilization	Performance	Glue DPU / worker type (G.1X, number_of_workers) in Terraform; no per-run CPU % in repo	Glue console / CloudWatch Glue metrics	Trivial (Glue native)

5) COST & EFFICIENCY (ESTIMATED / DERIVED)

Metric	Category	How computed (source of truth)	Where it would live	Exists / Trivial
Data scanned per query	Cost	Athena bytes scanned per query (workgroup)	Athena CloudWatch / query results	Exists
Storage growth by layer	Cost	S3 bucket/prefix size (ListObjects + Size sum or Storage Lens)	S3 API, Cost Explorer, Storage Lens	Trivial
Job runtime cost drivers	Cost	Glue duration × DPU × price; doc COST_ANALYSIS	Derived from duration + Terraform worker config	Trivial
Reprocessing overhead	Cost	Count of runs with same input_key or trigger; condemned + quarantine volume	_SUCCESS trigger + metrics	Trivial

6) AUDITABILITY & TRACEABILITY

Metric	Category	How computed (source of truth)	Where it would live	Exists / Trivial
Executions per day	Audit	Step Functions ListExecutions by date	Step Functions API, CloudWatch Logs	Trivial
Distinct source files processed	Audit	_SUCCESS.trigger.key or input_key per run; distinct count	Athena over _SUCCESS markers or log query	Trivial
Runs per schema version	Audit	S3 prefix count schema_v=v1 vs v2	S3 list	Trivial
Time-to-detect vs time-to-correct	Audit	Quarantine event time vs next run success / human resolution (no automated resolution tracking in repo)	Manual / runbooks	Partial (detect from logs; correct not automated)
Execution ARN / run_id coverage	Audit	_SUCCESS.execution_arn, run_id in every run	_SUCCESS, CloudWatch dimensions	Exists
Promotion decisions (allow/block)	Audit	Lambda read_run_summary response; Step Functions state	Step Functions history, Lambda logs	Exists

2) Stakeholder Relevance Filter

Classification by audience and why it matters in plain language.

Metric(s)	Audience	Why it matters
Quarantine rate; Condemned rows; Promotion gate (pass/fail)	Executive / Management	“What share of data is invalid and is bad data blocked from reporting?” Direct signal of control and risk.
Pipeline success/failure rate; Data freshness; Records promoted	Executive / Management	“Does the pipeline run reliably and is reporting data up to date?” Trust in delivery.
Records ingested; Records validated; Records quarantined/condemned	Operations / Support	“How much was processed and how much went to quarantine?” Operational capacity and issue volume.
Error categories (quarantine_by_reason); Circuit breaker triggered; DLQ volume	Operations / Support	“Why did rows fail and is intervention needed?” Triage and prioritization.
Pipeline success/failure; Retry counts; Duration; Promotion gate	Operations / Support	“Did the job succeed, how long did it take, and was promotion applied?” Day-to-day health.
Quarantine rate; Condemned rows; Auto-condemnation; Duplicate detection	Risk / Compliance	“How much bad data is caught and removed from the main dataset?” Audit and control.
Executions per day; Distinct source files; Execution ARN / run_id; Promotion decisions	Risk / Compliance	“Can each run be traced and what was approved?” Audit trail.
All of the above + bytes, throughput, Athena scan, storage growth, cost drivers	Data / Engineering	“How is the platform tuned, scaled, and cost-managed?” Technical ownership.

3) Top 8 Metrics Selection (Critical)

Exactly eight metrics that are understandable to non-technical stakeholders, reflect control and reliability, are defensible in interview, and align with business-case goals (FinTech: auditability, reliability, clarity).

#	Metric name	One-line business explanation	Why this over others	Business-case alignment	AWS / component
1	Quarantine rate (%)	Share of rows held back for review instead of going to reporting.	Single number that summarizes “how clean is this run?”; threshold (e.g. 5%) is already in alarms and promotion gate.	Data quality and control over what reaches reporting.	ETL → CloudWatch (QuarantineRate); Lambda read_run_summary; alarm QuarantineRateQualityGate.
2	Condemned rows (count)	Rows permanently excluded (e.g. duplicates, too many retries).	Shows irreversible loss of rows and duplicate/retry control; <100 is gate for promotion.	Risk and audit: what is deliberately excluded.	ETL → CloudWatch (CondemnedRows); _SUCCESS; Lambda summary; promotion gate.
3	Pipeline success rate	Percentage of runs that complete successfully.	Direct reliability signal; failures already trigger SNS.	Reliability and operational stability.	Step Functions execution status; CloudWatch Events; SNS on failure.
4	Promotion gate (pass/fail)	Whether a run was allowed to update “current” reporting data.	Binary control: bad runs never reach reporting.	Control and governance.	Lambda read_run_summary (promotion_allowed); Step Functions CheckValidationGate.
5	Data freshness	How old the latest data in the reporting layer is.	Answers “is month-end data ready when needed?”	Timeliness and trust in reporting.	Derived from _SUCCESS.completed_at and trigger time; or max event time in Silver.
6	Records validated (promoted)	Number of rows that passed checks and were promoted to reporting.	Business value: volume of data trusted for reporting.	Value and scale of trusted data.	valid_rows for runs where promotion_allowed; Lambda promote_silver.
7	Error breakdown (top reasons)	Main reasons rows were quarantined or condemned.	Explains why data was held back; supports remediation.	Transparency and remediation.	_SUCCESS quarantine_by_reason / condemned_by_reason; Lambda summary.
8	Run traceability (execution ARN / run_id)	Every run can be traced to a unique execution and logs.	FinTech audit requirement: “evidence of what ran and when.”	Auditability and compliance.	_SUCCESS.execution_arn, run_id; Step Functions; CloudWatch dimensions.

4) Stakeholder Email Draft

To: Product, Operations, Risk, Management
From: Data Platform Team
Subject: Financial data pipeline — month-end run summary and key metrics

Word count: ~195

This summary covers the latest run of the financial data pipeline and the metrics used to manage quality and risk.

What the pipeline does
Raw transaction files (CSV) are ingested into a secure, immutable layer (Bronze). Each run validates every row. Rows that pass are written to the reporting-ready layer (Silver); rows that fail are quarantined with clear error reasons. Rows that are duplicates or exceed retry limits are condemned and never used for reporting. Only runs that meet the defined quality thresholds are allowed to update the “current” dataset used for month-end reporting.

How control and reliability are measured
Eight headline metrics are tracked:

Quarantine rate — Share of rows held back for review (kept low and stable by design).
Condemned rows — Rows permanently excluded (duplicates and retry exhaustion); promotion is capped when this is high.
Pipeline success rate — Proportion of runs that complete successfully.
Promotion gate — Whether a run was allowed to update reporting data (runs that exceed thresholds are blocked).
Data freshness — How up to date the data in the reporting layer is.
Records validated and promoted — Number of rows that passed checks and were promoted to reporting.
Error breakdown — Main reasons rows were quarantined or condemned, to support remediation of upstream issues.
Run traceability — Every run is tied to a unique execution ID and full logs for audit.

This month
[Insert 1–2 sentences with actual numbers, e.g. quarantine rate X%, Y runs succeeded, Z records promoted, and any notable error reason or action taken.]

These metrics will continue to be shared in regular updates. For more detail, the technical appendix lists all available metrics and where they are produced. If you have questions, contact the Data Platform Team.

Satisfies Business Case §5 – Communication and Documentation.

5) Technical Appendix Outline

High-level outline only (no full text). Purpose: list all other metrics, how they are computed, where they are stored/queryable, and where to look when deeper questions arise.

Introduction
- Purpose of the appendix (reference for auditors, engineering, and follow-up questions).
- How it relates to the stakeholder email and the eight headline metrics.
Data volume and flow
- Input/output row counts: source (ETL metrics, _SUCCESS), CloudWatch names (InputRows, ValidRows, QuarantinedRows, CondemnedRows).
- Bytes read/written: where set (pandas/Spark), CloudWatch (BytesRead, BytesWritten).
- File and partition counts: S3 layout (Bronze, Silver, Quarantine), how to derive (list prefixes, _SUCCESS).
- Data freshness / latency: definition, how to compute from _SUCCESS and trigger payload.
Data quality and governance
- Validation outcomes: valid vs quarantine vs condemned; quarantine_rate and CloudWatch alarm (5%).
- Error categories: quarantineby_reason, condemned_by_reason; mapping to ERROR* constants (config.py).
- Attempt counts and auto-condemnation: avg_attempt_count, auto_condemnation_rate, duplicate_detection_rate; MAX_ATTEMPTS and duplicate logic (loop_prevention, validator).
- Schema version: where it appears (path schema_v=), Glue table parameters.
Pipeline health and reliability
- Success/failure: Step Functions execution status, Glue job run status; where to query (ListExecutions, GetJobRuns).
- Retries: Glue Retry (Terraform), SQS DLQ (ohpen-etl-failures-dlq), maxReceiveCount.
- Circuit breaker: condition (CIRCUIT_BREAKER_THRESHOLD), how it fails the job, where it appears (logs, SNS).
- Promotion gate: Lambda read_run_summary logic (quarantine_rate, condemned_rows, has_critical_errors); Step Functions state names.
Performance and scale
- ETL duration: CloudWatch DurationSeconds, Glue GetJobRun.
- Throughput: records/sec from input_rows and duration.
- Athena: workgroup (ohpen-analytics), publish_cloudwatch_metrics_enabled; where scan volume appears.
- Partition skew and small-file ratio: how to derive (S3 list, partition layout in ARCHITECTURE.md).
Cost and efficiency (derived)
- Glue cost drivers: worker type, DPU, duration (Terraform, COST_ANALYSIS.md).
- S3 storage: buckets and prefixes; lifecycle (Silver, Gold, Quarantine) per main.tf.
- Athena: cost per TB scanned; recommendation to use partition pruning.
Auditability and traceability
- Identifiers: execution_arn, run_id, Glue JobRunId (TRACEABILITY_DESIGN.md).
- Where they are stored: _SUCCESS, CloudWatch dimensions, Step Functions state, SNS payload.
- Executions per day and distinct source files: how to compute from Step Functions history and _SUCCESS.
Where metrics are stored and queryable
- CloudWatch: namespace Ohpen/ETL; metric names and dimensions (RunId, ExecutionArn when passed).
- S3: _SUCCESS JSON schema (run_id, completed_at, metrics, execution_arn, trigger).
- Step Functions: execution history, state output (validation_summary, promotion_result).
- Lambda: read_run_summary return payload; CloudWatch Logs for both Lambdas.
- Athena: Glue Catalog (silver_transactions), workgroup settings; optional external table over_SUCCESS markers for run-level analytics.
- CI/CD: _STAGING.json (sha, build_time, ci_run_id, validation); test_report.json, coverage.json (TEST_METRICS_GUIDE.md).
References
- Key repo paths: tasks/data_ingestion_transformation/src/etl (metrics.py, ingest_transactions*.py, validator*.py, loop_prevention*.py, s3_operations*.py), tasks/devops_cicd/infra/terraform/main.tf, Lambda read_run_summary and promote_silver.
- Docs: TRACEABILITY_DESIGN.md, COST_ANALYSIS.md, TEST_METRICS_GUIDE.md, ARCHITECTURE.md, AUDIT_AND_NOTIFICATIONS_CONSIDERATIONS.md.

1) Full Metrics Inventory (No Filtering)​

1) DATA VOLUME & FLOW​

2) DATA QUALITY & GOVERNANCE​

3) PIPELINE HEALTH & RELIABILITY​

4) PERFORMANCE & SCALE​

5) COST & EFFICIENCY (ESTIMATED / DERIVED)​

6) AUDITABILITY & TRACEABILITY​

2) Stakeholder Relevance Filter​

3) Top 8 Metrics Selection (Critical)​

4) Stakeholder Email Draft​

5) Technical Appendix Outline​