Skip to main content

Metrics Discovery and Executive Signal Extraction

Context: Principal Data Engineer / Analytics Architect — repo-wide metrics discovery for FinTech (auditability, reliability, clarity over volume).
Scope: Entire ohpen-case-2026 repository (ETL, SQL, Terraform, Step Functions, Glue, S3, CI/CD, validation, documentation).
Rule: Only metrics that are realistically derivable from the existing implementation (or trivial extensions that do not change architecture).


1) Full Metrics Inventory (No Filtering)

Grouped by category. For each metric: Name, Category, How computed (source of truth), Where it would live, Exists or trivial to add.

1) DATA VOLUME & FLOW

MetricCategoryHow computed (source of truth)Where it would liveExists / Trivial
Records ingested (input rows)Data volumeETL: len(raw_df) / raw_df.count(); written to _SUCCESS.metrics.input_rowsCloudWatch (InputRows), _SUCCESS JSON, Lambda read_run_summaryExists
Records validated (valid rows)Data volumeETL: valid_df row count; metrics['valid_rows']CloudWatch (ValidRows), _SUCCESSExists
Records quarantinedData volumeETL: quarantine_df count; metrics['quarantined_rows']CloudWatch (QuarantinedRows), _SUCCESS, Lambda summaryExists
Records condemnedData volumeETL: condemned_df count; metrics['condemned_rows']CloudWatch (CondemnedRows), _SUCCESS, Lambda summaryExists
Records promotedData volumeAfter promotion gate: valid_rows for runs where promotion_allowed=true; Lambda promote_silver copies run output to current/Step Functions execution history + _SUCCESS; no aggregate metric todayTrivial (aggregate from executions)
Data freshness / latencyData volumeTime from trigger (EventBridge event_time or schedule) to_SUCCESS.completed_at; or max(TransactionTimestamp) in Silver vs nowCloudWatch Logs / custom metric from Step Functions or GlueTrivial (compute from _SUCCESS + trigger in state)
File counts (Bronze/Silver/Quarantine)Data volumeS3 ListObjectsV2 by prefix (e.g. bronze/..., silver/.../run_id=..., quarantine/...)Athena (external table list) or Lambda/Glue scriptTrivial
Bytes read (input)Data volumeETL: read_csv_from_s3 returns bytes_read; metrics['bytes_read']CloudWatch (BytesRead), _SUCCESS when passedExists (pandas path; Spark path trivial)
Bytes written (output)Data volumeETL: sum of write sizes; metrics['bytes_written']CloudWatch (BytesWritten), _SUCCESSExists (pandas); Spark can add
Partitions written (Silver)Data volumeCount of distinct year/month in valid_df or S3 prefix list under run_id_SUCCESS (could add), or S3 listTrivial

2) DATA QUALITY & GOVERNANCE

MetricCategoryHow computed (source of truth)Where it would liveExists / Trivial
Validation pass countData qualityvalid_rows (see above)CloudWatch, _SUCCESSExists
Validation fail (quarantine) countData qualityquarantined_rowsCloudWatch, _SUCCESSExists
Quarantine rate (%)Data qualityquarantined_rows / input_rows * 100CloudWatch (QuarantineRate), Lambda summary, alarm (5% threshold)Exists
Error categories and frequenciesData qualityquarantine_df/condemned_df validation_error value_counts → quarantine_by_reason, condemned_by_reason_SUCCESS.metrics, CloudWatch Logs (ETLComplete), Lambda summaryExists
Currency code violationsData qualityCount where validation_error == ERROR_CURRENCY (CURRENCY_ERROR)Derived from quarantine_by_reason / condemned_by_reasonExists (in breakdown)
Null / range violationsData qualityNULL_VALUE_ERROR, TYPE_ERROR (amount), TIMESTAMP_ERROR from validationSameExists
Reprocessing attempts (per row)Data qualityattempt_count in quarantine/condemned; avg_attempt_count in metricsCloudWatch (AvgAttemptCount), _SUCCESSExists
Duplicate detection (row_hash)Data qualityCondemned with ERROR_DUPLICATE_FAILURE; duplicate_detection_rate in metricsCloudWatch (DuplicateDetectionRate), _SUCCESSExists
TransactionID duplicate (Silver)Data qualityCondemned with ERROR_DUPLICATE_TRANSACTION_IDSame breakdownExists
Auto-condemnation rate (%)Data qualitycondemned_rows / input_rows * 100CloudWatch (AutoCondemnationRate), _SUCCESSExists
Quarantine retry exhaustionData qualityRows condemned with ERROR_MAX_ATTEMPTS (attempt_count >= MAX_ATTEMPTS)condemned_by_reasonExists
Schema version distributionData qualityS3 path contains schema_v=v1; Glue table paramsS3 prefix listing, Glue CatalogTrivial

3) PIPELINE HEALTH & RELIABILITY

MetricCategoryHow computed (source of truth)Where it would liveExists / Trivial
Pipeline success/failure ratePipeline healthStep Functions execution status (SUCCEEDED / FAILED); Glue job run status; GetExecutionHistory / ListExecutions; GetJobRunsCloudWatch Events, SNS, Step Functions logsExists (events + SNS on failure)
Retry counts (Glue)Pipeline healthTerraform Retry (Glue.ThrottlingException, etc.) → Glue job run attempt countGlue GetJobRunTrivial (Glue API)
DLQ volumePipeline healthSQS ohpen-etl-failures-dlq ApproximateNumberOfMessagesSQS API / CloudWatch SQS metricsExists (SQS metrics)
Quarantine retry exhaustion countPipeline healthCount condemned with MAX_ATTEMPTS_SUCCESS condemned_by_reasonExists
Circuit breaker triggeredPipeline healthGlue job fails with RuntimeError (circuit breaker); logged and state FAILEDCloudWatch Logs, Step Functions Catch → HandleFailure → SNSExists
Mean time to failure / recoveryPipeline healthTime between last SUCCEEDED and FAILED; time from FAILED to next SUCCEEDEDStep Functions execution timestampsTrivial (query history)
Backfill vs normal runsPipeline healthTrigger source in _SUCCESS.trigger (s3-eventbridge vs schedule)_SUCCESSExists
Promotion gate pass/failPipeline healthLambda read_run_summary returns promotion_allowed; Step Functions CheckValidationGateStep Functions state (ValidationGateFailed vs PromoteSilver)Exists

4) PERFORMANCE & SCALE

MetricCategoryHow computed (source of truth)Where it would liveExists / Trivial
Glue job durationPerformanceend_time - start_time in ETL; Glue GetJobRun CompletedOn - StartedOnCloudWatch (DurationSeconds), _SUCCESS, Glue APIExists
Records/sec throughputPerformanceinput_rows / duration_secondsDerived from CloudWatch or _SUCCESSTrivial
Partition skewPerformanceRow count or file size per year/month from Silver listingAthena or Glue partition statsTrivial
Small-file ratioPerformanceCount of Parquet files under threshold size vs total (e.g. <1MB)S3 list + sizeTrivial
Athena scan volume (estimated)PerformanceAthena workgroup publish_cloudwatch_metrics_enabled → CloudWatch; per-query data scannedAthena CloudWatch metrics / query result metadataExists (workgroup config)
Job resource utilizationPerformanceGlue DPU / worker type (G.1X, number_of_workers) in Terraform; no per-run CPU % in repoGlue console / CloudWatch Glue metricsTrivial (Glue native)

5) COST & EFFICIENCY (ESTIMATED / DERIVED)

MetricCategoryHow computed (source of truth)Where it would liveExists / Trivial
Data scanned per queryCostAthena bytes scanned per query (workgroup)Athena CloudWatch / query resultsExists
Storage growth by layerCostS3 bucket/prefix size (ListObjects + Size sum or Storage Lens)S3 API, Cost Explorer, Storage LensTrivial
Job runtime cost driversCostGlue duration × DPU × price; doc COST_ANALYSISDerived from duration + Terraform worker configTrivial
Reprocessing overheadCostCount of runs with same input_key or trigger; condemned + quarantine volume_SUCCESS trigger + metricsTrivial

6) AUDITABILITY & TRACEABILITY

MetricCategoryHow computed (source of truth)Where it would liveExists / Trivial
Executions per dayAuditStep Functions ListExecutions by dateStep Functions API, CloudWatch LogsTrivial
Distinct source files processedAudit_SUCCESS.trigger.key or input_key per run; distinct countAthena over _SUCCESS markers or log queryTrivial
Runs per schema versionAuditS3 prefix count schema_v=v1 vs v2S3 listTrivial
Time-to-detect vs time-to-correctAuditQuarantine event time vs next run success / human resolution (no automated resolution tracking in repo)Manual / runbooksPartial (detect from logs; correct not automated)
Execution ARN / run_id coverageAudit_SUCCESS.execution_arn, run_id in every run_SUCCESS, CloudWatch dimensionsExists
Promotion decisions (allow/block)AuditLambda read_run_summary response; Step Functions stateStep Functions history, Lambda logsExists

2) Stakeholder Relevance Filter

Classification by audience and why it matters in plain language.

Metric(s)AudienceWhy it matters
Quarantine rate; Condemned rows; Promotion gate (pass/fail)Executive / Management“What share of data is invalid and is bad data blocked from reporting?” Direct signal of control and risk.
Pipeline success/failure rate; Data freshness; Records promotedExecutive / Management“Does the pipeline run reliably and is reporting data up to date?” Trust in delivery.
Records ingested; Records validated; Records quarantined/condemnedOperations / Support“How much was processed and how much went to quarantine?” Operational capacity and issue volume.
Error categories (quarantine_by_reason); Circuit breaker triggered; DLQ volumeOperations / Support“Why did rows fail and is intervention needed?” Triage and prioritization.
Pipeline success/failure; Retry counts; Duration; Promotion gateOperations / Support“Did the job succeed, how long did it take, and was promotion applied?” Day-to-day health.
Quarantine rate; Condemned rows; Auto-condemnation; Duplicate detectionRisk / Compliance“How much bad data is caught and removed from the main dataset?” Audit and control.
Executions per day; Distinct source files; Execution ARN / run_id; Promotion decisionsRisk / Compliance“Can each run be traced and what was approved?” Audit trail.
All of the above + bytes, throughput, Athena scan, storage growth, cost driversData / Engineering“How is the platform tuned, scaled, and cost-managed?” Technical ownership.

3) Top 8 Metrics Selection (Critical)

Exactly eight metrics that are understandable to non-technical stakeholders, reflect control and reliability, are defensible in interview, and align with business-case goals (FinTech: auditability, reliability, clarity).

#Metric nameOne-line business explanationWhy this over othersBusiness-case alignmentAWS / component
1Quarantine rate (%)Share of rows held back for review instead of going to reporting.Single number that summarizes “how clean is this run?”; threshold (e.g. 5%) is already in alarms and promotion gate.Data quality and control over what reaches reporting.ETL → CloudWatch (QuarantineRate); Lambda read_run_summary; alarm QuarantineRateQualityGate.
2Condemned rows (count)Rows permanently excluded (e.g. duplicates, too many retries).Shows irreversible loss of rows and duplicate/retry control; <100 is gate for promotion.Risk and audit: what is deliberately excluded.ETL → CloudWatch (CondemnedRows); _SUCCESS; Lambda summary; promotion gate.
3Pipeline success ratePercentage of runs that complete successfully.Direct reliability signal; failures already trigger SNS.Reliability and operational stability.Step Functions execution status; CloudWatch Events; SNS on failure.
4Promotion gate (pass/fail)Whether a run was allowed to update “current” reporting data.Binary control: bad runs never reach reporting.Control and governance.Lambda read_run_summary (promotion_allowed); Step Functions CheckValidationGate.
5Data freshnessHow old the latest data in the reporting layer is.Answers “is month-end data ready when needed?”Timeliness and trust in reporting.Derived from _SUCCESS.completed_at and trigger time; or max event time in Silver.
6Records validated (promoted)Number of rows that passed checks and were promoted to reporting.Business value: volume of data trusted for reporting.Value and scale of trusted data.valid_rows for runs where promotion_allowed; Lambda promote_silver.
7Error breakdown (top reasons)Main reasons rows were quarantined or condemned.Explains why data was held back; supports remediation.Transparency and remediation._SUCCESS quarantine_by_reason / condemned_by_reason; Lambda summary.
8Run traceability (execution ARN / run_id)Every run can be traced to a unique execution and logs.FinTech audit requirement: “evidence of what ran and when.”Auditability and compliance._SUCCESS.execution_arn, run_id; Step Functions; CloudWatch dimensions.

4) Stakeholder Email Draft

To: Product, Operations, Risk, Management
From: Data Platform Team
Subject: Financial data pipeline — month-end run summary and key metrics

Word count: ~195


This summary covers the latest run of the financial data pipeline and the metrics used to manage quality and risk.

What the pipeline does
Raw transaction files (CSV) are ingested into a secure, immutable layer (Bronze). Each run validates every row. Rows that pass are written to the reporting-ready layer (Silver); rows that fail are quarantined with clear error reasons. Rows that are duplicates or exceed retry limits are condemned and never used for reporting. Only runs that meet the defined quality thresholds are allowed to update the “current” dataset used for month-end reporting.

How control and reliability are measured
Eight headline metrics are tracked:

  1. Quarantine rate — Share of rows held back for review (kept low and stable by design).
  2. Condemned rows — Rows permanently excluded (duplicates and retry exhaustion); promotion is capped when this is high.
  3. Pipeline success rate — Proportion of runs that complete successfully.
  4. Promotion gate — Whether a run was allowed to update reporting data (runs that exceed thresholds are blocked).
  5. Data freshness — How up to date the data in the reporting layer is.
  6. Records validated and promoted — Number of rows that passed checks and were promoted to reporting.
  7. Error breakdown — Main reasons rows were quarantined or condemned, to support remediation of upstream issues.
  8. Run traceability — Every run is tied to a unique execution ID and full logs for audit.

This month
[Insert 1–2 sentences with actual numbers, e.g. quarantine rate X%, Y runs succeeded, Z records promoted, and any notable error reason or action taken.]

These metrics will continue to be shared in regular updates. For more detail, the technical appendix lists all available metrics and where they are produced. If you have questions, contact the Data Platform Team.


Satisfies Business Case §5 – Communication and Documentation.


5) Technical Appendix Outline

High-level outline only (no full text). Purpose: list all other metrics, how they are computed, where they are stored/queryable, and where to look when deeper questions arise.

  1. Introduction

    • Purpose of the appendix (reference for auditors, engineering, and follow-up questions).
    • How it relates to the stakeholder email and the eight headline metrics.
  2. Data volume and flow

    • Input/output row counts: source (ETL metrics, _SUCCESS), CloudWatch names (InputRows, ValidRows, QuarantinedRows, CondemnedRows).
    • Bytes read/written: where set (pandas/Spark), CloudWatch (BytesRead, BytesWritten).
    • File and partition counts: S3 layout (Bronze, Silver, Quarantine), how to derive (list prefixes, _SUCCESS).
    • Data freshness / latency: definition, how to compute from _SUCCESS and trigger payload.
  3. Data quality and governance

    • Validation outcomes: valid vs quarantine vs condemned; quarantine_rate and CloudWatch alarm (5%).
    • Error categories: quarantineby_reason, condemned_by_reason; mapping to ERROR* constants (config.py).
    • Attempt counts and auto-condemnation: avg_attempt_count, auto_condemnation_rate, duplicate_detection_rate; MAX_ATTEMPTS and duplicate logic (loop_prevention, validator).
    • Schema version: where it appears (path schema_v=), Glue table parameters.
  4. Pipeline health and reliability

    • Success/failure: Step Functions execution status, Glue job run status; where to query (ListExecutions, GetJobRuns).
    • Retries: Glue Retry (Terraform), SQS DLQ (ohpen-etl-failures-dlq), maxReceiveCount.
    • Circuit breaker: condition (CIRCUIT_BREAKER_THRESHOLD), how it fails the job, where it appears (logs, SNS).
    • Promotion gate: Lambda read_run_summary logic (quarantine_rate, condemned_rows, has_critical_errors); Step Functions state names.
  5. Performance and scale

    • ETL duration: CloudWatch DurationSeconds, Glue GetJobRun.
    • Throughput: records/sec from input_rows and duration.
    • Athena: workgroup (ohpen-analytics), publish_cloudwatch_metrics_enabled; where scan volume appears.
    • Partition skew and small-file ratio: how to derive (S3 list, partition layout in ARCHITECTURE.md).
  6. Cost and efficiency (derived)

    • Glue cost drivers: worker type, DPU, duration (Terraform, COST_ANALYSIS.md).
    • S3 storage: buckets and prefixes; lifecycle (Silver, Gold, Quarantine) per main.tf.
    • Athena: cost per TB scanned; recommendation to use partition pruning.
  7. Auditability and traceability

    • Identifiers: execution_arn, run_id, Glue JobRunId (TRACEABILITY_DESIGN.md).
    • Where they are stored: _SUCCESS, CloudWatch dimensions, Step Functions state, SNS payload.
    • Executions per day and distinct source files: how to compute from Step Functions history and _SUCCESS.
  8. Where metrics are stored and queryable

    • CloudWatch: namespace Ohpen/ETL; metric names and dimensions (RunId, ExecutionArn when passed).
    • S3: _SUCCESS JSON schema (run_id, completed_at, metrics, execution_arn, trigger).
    • Step Functions: execution history, state output (validation_summary, promotion_result).
    • Lambda: read_run_summary return payload; CloudWatch Logs for both Lambdas.
    • Athena: Glue Catalog (silver_transactions), workgroup settings; optional external table over_SUCCESS markers for run-level analytics.
    • CI/CD: _STAGING.json (sha, build_time, ci_run_id, validation); test_report.json, coverage.json (TEST_METRICS_GUIDE.md).
  9. References

    • Key repo paths: tasks/data_ingestion_transformation/src/etl (metrics.py, ingest_transactions*.py, validator*.py, loop_prevention*.py, s3_operations*.py), tasks/devops_cicd/infra/terraform/main.tf, Lambda read_run_summary and promote_silver.
    • Docs: TRACEABILITY_DESIGN.md, COST_ANALYSIS.md, TEST_METRICS_GUIDE.md, ARCHITECTURE.md, AUDIT_AND_NOTIFICATIONS_CONSIDERATIONS.md.
© 2026 Stephen AdeiCC BY 4.0