Metrics Discovery and Executive Signal Extraction
Context: Principal Data Engineer / Analytics Architect — repo-wide metrics discovery for FinTech (auditability, reliability, clarity over volume).
Scope: Entire ohpen-case-2026 repository (ETL, SQL, Terraform, Step Functions, Glue, S3, CI/CD, validation, documentation).
Rule: Only metrics that are realistically derivable from the existing implementation (or trivial extensions that do not change architecture).
1) Full Metrics Inventory (No Filtering)
Grouped by category. For each metric: Name, Category, How computed (source of truth), Where it would live, Exists or trivial to add.
1) DATA VOLUME & FLOW
| Metric | Category | How computed (source of truth) | Where it would live | Exists / Trivial |
|---|---|---|---|---|
| Records ingested (input rows) | Data volume | ETL: len(raw_df) / raw_df.count(); written to _SUCCESS.metrics.input_rows | CloudWatch (InputRows), _SUCCESS JSON, Lambda read_run_summary | Exists |
| Records validated (valid rows) | Data volume | ETL: valid_df row count; metrics['valid_rows'] | CloudWatch (ValidRows), _SUCCESS | Exists |
| Records quarantined | Data volume | ETL: quarantine_df count; metrics['quarantined_rows'] | CloudWatch (QuarantinedRows), _SUCCESS, Lambda summary | Exists |
| Records condemned | Data volume | ETL: condemned_df count; metrics['condemned_rows'] | CloudWatch (CondemnedRows), _SUCCESS, Lambda summary | Exists |
| Records promoted | Data volume | After promotion gate: valid_rows for runs where promotion_allowed=true; Lambda promote_silver copies run output to current/ | Step Functions execution history + _SUCCESS; no aggregate metric today | Trivial (aggregate from executions) |
| Data freshness / latency | Data volume | Time from trigger (EventBridge event_time or schedule) to_SUCCESS.completed_at; or max(TransactionTimestamp) in Silver vs now | CloudWatch Logs / custom metric from Step Functions or Glue | Trivial (compute from _SUCCESS + trigger in state) |
| File counts (Bronze/Silver/Quarantine) | Data volume | S3 ListObjectsV2 by prefix (e.g. bronze/..., silver/.../run_id=..., quarantine/...) | Athena (external table list) or Lambda/Glue script | Trivial |
| Bytes read (input) | Data volume | ETL: read_csv_from_s3 returns bytes_read; metrics['bytes_read'] | CloudWatch (BytesRead), _SUCCESS when passed | Exists (pandas path; Spark path trivial) |
| Bytes written (output) | Data volume | ETL: sum of write sizes; metrics['bytes_written'] | CloudWatch (BytesWritten), _SUCCESS | Exists (pandas); Spark can add |
| Partitions written (Silver) | Data volume | Count of distinct year/month in valid_df or S3 prefix list under run_id | _SUCCESS (could add), or S3 list | Trivial |
2) DATA QUALITY & GOVERNANCE
| Metric | Category | How computed (source of truth) | Where it would live | Exists / Trivial |
|---|---|---|---|---|
| Validation pass count | Data quality | valid_rows (see above) | CloudWatch, _SUCCESS | Exists |
| Validation fail (quarantine) count | Data quality | quarantined_rows | CloudWatch, _SUCCESS | Exists |
| Quarantine rate (%) | Data quality | quarantined_rows / input_rows * 100 | CloudWatch (QuarantineRate), Lambda summary, alarm (5% threshold) | Exists |
| Error categories and frequencies | Data quality | quarantine_df/condemned_df validation_error value_counts → quarantine_by_reason, condemned_by_reason | _SUCCESS.metrics, CloudWatch Logs (ETLComplete), Lambda summary | Exists |
| Currency code violations | Data quality | Count where validation_error == ERROR_CURRENCY (CURRENCY_ERROR) | Derived from quarantine_by_reason / condemned_by_reason | Exists (in breakdown) |
| Null / range violations | Data quality | NULL_VALUE_ERROR, TYPE_ERROR (amount), TIMESTAMP_ERROR from validation | Same | Exists |
| Reprocessing attempts (per row) | Data quality | attempt_count in quarantine/condemned; avg_attempt_count in metrics | CloudWatch (AvgAttemptCount), _SUCCESS | Exists |
| Duplicate detection (row_hash) | Data quality | Condemned with ERROR_DUPLICATE_FAILURE; duplicate_detection_rate in metrics | CloudWatch (DuplicateDetectionRate), _SUCCESS | Exists |
| TransactionID duplicate (Silver) | Data quality | Condemned with ERROR_DUPLICATE_TRANSACTION_ID | Same breakdown | Exists |
| Auto-condemnation rate (%) | Data quality | condemned_rows / input_rows * 100 | CloudWatch (AutoCondemnationRate), _SUCCESS | Exists |
| Quarantine retry exhaustion | Data quality | Rows condemned with ERROR_MAX_ATTEMPTS (attempt_count >= MAX_ATTEMPTS) | condemned_by_reason | Exists |
| Schema version distribution | Data quality | S3 path contains schema_v=v1; Glue table params | S3 prefix listing, Glue Catalog | Trivial |
3) PIPELINE HEALTH & RELIABILITY
| Metric | Category | How computed (source of truth) | Where it would live | Exists / Trivial |
|---|---|---|---|---|
| Pipeline success/failure rate | Pipeline health | Step Functions execution status (SUCCEEDED / FAILED); Glue job run status; GetExecutionHistory / ListExecutions; GetJobRuns | CloudWatch Events, SNS, Step Functions logs | Exists (events + SNS on failure) |
| Retry counts (Glue) | Pipeline health | Terraform Retry (Glue.ThrottlingException, etc.) → Glue job run attempt count | Glue GetJobRun | Trivial (Glue API) |
| DLQ volume | Pipeline health | SQS ohpen-etl-failures-dlq ApproximateNumberOfMessages | SQS API / CloudWatch SQS metrics | Exists (SQS metrics) |
| Quarantine retry exhaustion count | Pipeline health | Count condemned with MAX_ATTEMPTS | _SUCCESS condemned_by_reason | Exists |
| Circuit breaker triggered | Pipeline health | Glue job fails with RuntimeError (circuit breaker); logged and state FAILED | CloudWatch Logs, Step Functions Catch → HandleFailure → SNS | Exists |
| Mean time to failure / recovery | Pipeline health | Time between last SUCCEEDED and FAILED; time from FAILED to next SUCCEEDED | Step Functions execution timestamps | Trivial (query history) |
| Backfill vs normal runs | Pipeline health | Trigger source in _SUCCESS.trigger (s3-eventbridge vs schedule) | _SUCCESS | Exists |
| Promotion gate pass/fail | Pipeline health | Lambda read_run_summary returns promotion_allowed; Step Functions CheckValidationGate | Step Functions state (ValidationGateFailed vs PromoteSilver) | Exists |
4) PERFORMANCE & SCALE
| Metric | Category | How computed (source of truth) | Where it would live | Exists / Trivial |
|---|---|---|---|---|
| Glue job duration | Performance | end_time - start_time in ETL; Glue GetJobRun CompletedOn - StartedOn | CloudWatch (DurationSeconds), _SUCCESS, Glue API | Exists |
| Records/sec throughput | Performance | input_rows / duration_seconds | Derived from CloudWatch or _SUCCESS | Trivial |
| Partition skew | Performance | Row count or file size per year/month from Silver listing | Athena or Glue partition stats | Trivial |
| Small-file ratio | Performance | Count of Parquet files under threshold size vs total (e.g. <1MB) | S3 list + size | Trivial |
| Athena scan volume (estimated) | Performance | Athena workgroup publish_cloudwatch_metrics_enabled → CloudWatch; per-query data scanned | Athena CloudWatch metrics / query result metadata | Exists (workgroup config) |
| Job resource utilization | Performance | Glue DPU / worker type (G.1X, number_of_workers) in Terraform; no per-run CPU % in repo | Glue console / CloudWatch Glue metrics | Trivial (Glue native) |
5) COST & EFFICIENCY (ESTIMATED / DERIVED)
| Metric | Category | How computed (source of truth) | Where it would live | Exists / Trivial |
|---|---|---|---|---|
| Data scanned per query | Cost | Athena bytes scanned per query (workgroup) | Athena CloudWatch / query results | Exists |
| Storage growth by layer | Cost | S3 bucket/prefix size (ListObjects + Size sum or Storage Lens) | S3 API, Cost Explorer, Storage Lens | Trivial |
| Job runtime cost drivers | Cost | Glue duration × DPU × price; doc COST_ANALYSIS | Derived from duration + Terraform worker config | Trivial |
| Reprocessing overhead | Cost | Count of runs with same input_key or trigger; condemned + quarantine volume | _SUCCESS trigger + metrics | Trivial |
6) AUDITABILITY & TRACEABILITY
| Metric | Category | How computed (source of truth) | Where it would live | Exists / Trivial |
|---|---|---|---|---|
| Executions per day | Audit | Step Functions ListExecutions by date | Step Functions API, CloudWatch Logs | Trivial |
| Distinct source files processed | Audit | _SUCCESS.trigger.key or input_key per run; distinct count | Athena over _SUCCESS markers or log query | Trivial |
| Runs per schema version | Audit | S3 prefix count schema_v=v1 vs v2 | S3 list | Trivial |
| Time-to-detect vs time-to-correct | Audit | Quarantine event time vs next run success / human resolution (no automated resolution tracking in repo) | Manual / runbooks | Partial (detect from logs; correct not automated) |
| Execution ARN / run_id coverage | Audit | _SUCCESS.execution_arn, run_id in every run | _SUCCESS, CloudWatch dimensions | Exists |
| Promotion decisions (allow/block) | Audit | Lambda read_run_summary response; Step Functions state | Step Functions history, Lambda logs | Exists |
2) Stakeholder Relevance Filter
Classification by audience and why it matters in plain language.
| Metric(s) | Audience | Why it matters |
|---|---|---|
| Quarantine rate; Condemned rows; Promotion gate (pass/fail) | Executive / Management | “What share of data is invalid and is bad data blocked from reporting?” Direct signal of control and risk. |
| Pipeline success/failure rate; Data freshness; Records promoted | Executive / Management | “Does the pipeline run reliably and is reporting data up to date?” Trust in delivery. |
| Records ingested; Records validated; Records quarantined/condemned | Operations / Support | “How much was processed and how much went to quarantine?” Operational capacity and issue volume. |
| Error categories (quarantine_by_reason); Circuit breaker triggered; DLQ volume | Operations / Support | “Why did rows fail and is intervention needed?” Triage and prioritization. |
| Pipeline success/failure; Retry counts; Duration; Promotion gate | Operations / Support | “Did the job succeed, how long did it take, and was promotion applied?” Day-to-day health. |
| Quarantine rate; Condemned rows; Auto-condemnation; Duplicate detection | Risk / Compliance | “How much bad data is caught and removed from the main dataset?” Audit and control. |
| Executions per day; Distinct source files; Execution ARN / run_id; Promotion decisions | Risk / Compliance | “Can each run be traced and what was approved?” Audit trail. |
| All of the above + bytes, throughput, Athena scan, storage growth, cost drivers | Data / Engineering | “How is the platform tuned, scaled, and cost-managed?” Technical ownership. |
3) Top 8 Metrics Selection (Critical)
Exactly eight metrics that are understandable to non-technical stakeholders, reflect control and reliability, are defensible in interview, and align with business-case goals (FinTech: auditability, reliability, clarity).
| # | Metric name | One-line business explanation | Why this over others | Business-case alignment | AWS / component |
|---|---|---|---|---|---|
| 1 | Quarantine rate (%) | Share of rows held back for review instead of going to reporting. | Single number that summarizes “how clean is this run?”; threshold (e.g. 5%) is already in alarms and promotion gate. | Data quality and control over what reaches reporting. | ETL → CloudWatch (QuarantineRate); Lambda read_run_summary; alarm QuarantineRateQualityGate. |
| 2 | Condemned rows (count) | Rows permanently excluded (e.g. duplicates, too many retries). | Shows irreversible loss of rows and duplicate/retry control; <100 is gate for promotion. | Risk and audit: what is deliberately excluded. | ETL → CloudWatch (CondemnedRows); _SUCCESS; Lambda summary; promotion gate. |
| 3 | Pipeline success rate | Percentage of runs that complete successfully. | Direct reliability signal; failures already trigger SNS. | Reliability and operational stability. | Step Functions execution status; CloudWatch Events; SNS on failure. |
| 4 | Promotion gate (pass/fail) | Whether a run was allowed to update “current” reporting data. | Binary control: bad runs never reach reporting. | Control and governance. | Lambda read_run_summary (promotion_allowed); Step Functions CheckValidationGate. |
| 5 | Data freshness | How old the latest data in the reporting layer is. | Answers “is month-end data ready when needed?” | Timeliness and trust in reporting. | Derived from _SUCCESS.completed_at and trigger time; or max event time in Silver. |
| 6 | Records validated (promoted) | Number of rows that passed checks and were promoted to reporting. | Business value: volume of data trusted for reporting. | Value and scale of trusted data. | valid_rows for runs where promotion_allowed; Lambda promote_silver. |
| 7 | Error breakdown (top reasons) | Main reasons rows were quarantined or condemned. | Explains why data was held back; supports remediation. | Transparency and remediation. | _SUCCESS quarantine_by_reason / condemned_by_reason; Lambda summary. |
| 8 | Run traceability (execution ARN / run_id) | Every run can be traced to a unique execution and logs. | FinTech audit requirement: “evidence of what ran and when.” | Auditability and compliance. | _SUCCESS.execution_arn, run_id; Step Functions; CloudWatch dimensions. |
4) Stakeholder Email Draft
To: Product, Operations, Risk, Management
From: Data Platform Team
Subject: Financial data pipeline — month-end run summary and key metrics
Word count: ~195
This summary covers the latest run of the financial data pipeline and the metrics used to manage quality and risk.
What the pipeline does
Raw transaction files (CSV) are ingested into a secure, immutable layer (Bronze). Each run validates every row. Rows that pass are written to the reporting-ready layer (Silver); rows that fail are quarantined with clear error reasons. Rows that are duplicates or exceed retry limits are condemned and never used for reporting. Only runs that meet the defined quality thresholds are allowed to update the “current” dataset used for month-end reporting.
How control and reliability are measured
Eight headline metrics are tracked:
- Quarantine rate — Share of rows held back for review (kept low and stable by design).
- Condemned rows — Rows permanently excluded (duplicates and retry exhaustion); promotion is capped when this is high.
- Pipeline success rate — Proportion of runs that complete successfully.
- Promotion gate — Whether a run was allowed to update reporting data (runs that exceed thresholds are blocked).
- Data freshness — How up to date the data in the reporting layer is.
- Records validated and promoted — Number of rows that passed checks and were promoted to reporting.
- Error breakdown — Main reasons rows were quarantined or condemned, to support remediation of upstream issues.
- Run traceability — Every run is tied to a unique execution ID and full logs for audit.
This month
[Insert 1–2 sentences with actual numbers, e.g. quarantine rate X%, Y runs succeeded, Z records promoted, and any notable error reason or action taken.]
These metrics will continue to be shared in regular updates. For more detail, the technical appendix lists all available metrics and where they are produced. If you have questions, contact the Data Platform Team.
Satisfies Business Case §5 – Communication and Documentation.
5) Technical Appendix Outline
High-level outline only (no full text). Purpose: list all other metrics, how they are computed, where they are stored/queryable, and where to look when deeper questions arise.
-
Introduction
- Purpose of the appendix (reference for auditors, engineering, and follow-up questions).
- How it relates to the stakeholder email and the eight headline metrics.
-
Data volume and flow
- Input/output row counts: source (ETL
metrics, _SUCCESS), CloudWatch names (InputRows, ValidRows, QuarantinedRows, CondemnedRows). - Bytes read/written: where set (pandas/Spark), CloudWatch (BytesRead, BytesWritten).
- File and partition counts: S3 layout (Bronze, Silver, Quarantine), how to derive (list prefixes, _SUCCESS).
- Data freshness / latency: definition, how to compute from _SUCCESS and trigger payload.
- Input/output row counts: source (ETL
-
Data quality and governance
- Validation outcomes: valid vs quarantine vs condemned; quarantine_rate and CloudWatch alarm (5%).
- Error categories: quarantineby_reason, condemned_by_reason; mapping to ERROR* constants (config.py).
- Attempt counts and auto-condemnation: avg_attempt_count, auto_condemnation_rate, duplicate_detection_rate; MAX_ATTEMPTS and duplicate logic (loop_prevention, validator).
- Schema version: where it appears (path schema_v=), Glue table parameters.
-
Pipeline health and reliability
- Success/failure: Step Functions execution status, Glue job run status; where to query (ListExecutions, GetJobRuns).
- Retries: Glue Retry (Terraform), SQS DLQ (ohpen-etl-failures-dlq), maxReceiveCount.
- Circuit breaker: condition (CIRCUIT_BREAKER_THRESHOLD), how it fails the job, where it appears (logs, SNS).
- Promotion gate: Lambda read_run_summary logic (quarantine_rate, condemned_rows, has_critical_errors); Step Functions state names.
-
Performance and scale
- ETL duration: CloudWatch DurationSeconds, Glue GetJobRun.
- Throughput: records/sec from input_rows and duration.
- Athena: workgroup (ohpen-analytics), publish_cloudwatch_metrics_enabled; where scan volume appears.
- Partition skew and small-file ratio: how to derive (S3 list, partition layout in ARCHITECTURE.md).
-
Cost and efficiency (derived)
- Glue cost drivers: worker type, DPU, duration (Terraform, COST_ANALYSIS.md).
- S3 storage: buckets and prefixes; lifecycle (Silver, Gold, Quarantine) per main.tf.
- Athena: cost per TB scanned; recommendation to use partition pruning.
-
Auditability and traceability
- Identifiers: execution_arn, run_id, Glue JobRunId (TRACEABILITY_DESIGN.md).
- Where they are stored: _SUCCESS, CloudWatch dimensions, Step Functions state, SNS payload.
- Executions per day and distinct source files: how to compute from Step Functions history and _SUCCESS.
-
Where metrics are stored and queryable
- CloudWatch: namespace Ohpen/ETL; metric names and dimensions (RunId, ExecutionArn when passed).
- S3: _SUCCESS JSON schema (run_id, completed_at, metrics, execution_arn, trigger).
- Step Functions: execution history, state output (validation_summary, promotion_result).
- Lambda: read_run_summary return payload; CloudWatch Logs for both Lambdas.
- Athena: Glue Catalog (silver_transactions), workgroup settings; optional external table over_SUCCESS markers for run-level analytics.
- CI/CD: _STAGING.json (sha, build_time, ci_run_id, validation); test_report.json, coverage.json (TEST_METRICS_GUIDE.md).
-
References
- Key repo paths: tasks/data_ingestion_transformation/src/etl (metrics.py, ingest_transactions*.py, validator*.py, loop_prevention*.py, s3_operations*.py), tasks/devops_cicd/infra/terraform/main.tf, Lambda read_run_summary and promote_silver.
- Docs: TRACEABILITY_DESIGN.md, COST_ANALYSIS.md, TEST_METRICS_GUIDE.md, ARCHITECTURE.md, AUDIT_AND_NOTIFICATIONS_CONSIDERATIONS.md.