Skip to main content

Amazon Bedrock Implementations

This document describes the GenAI (Amazon Bedrock) implementations added to the Ohpen data lake project. All use Claude via Bedrock for plain-language explanations, stakeholder narratives, catalog descriptions, and SQL/pipeline documentation.

🤖 GenAI possibility — Bedrock is used in four spots in this project; there are 50+ more across the platform (validation hints, runbook drafts, NL-to-SQL, cost narratives). See GenAI in the Ohpen Case & Opportunities.

Overview

ImplementationPurposeLocationIntegration
Quarantine explanationsHuman-readable explanation + suggested fix for quarantined/condemned rowsTask 01 ETLOptional post-ETL or Lambda
Report narrative2–4 sentence stakeholder paragraph from run metricsTask 05 scriptsOperational reports, stakeholder email
Catalog/quality descriptionsTable description + quality summary for Glue CatalogTask 01 scriptsGlue table description, docs
SQL/pipeline docsMarkdown explanation of SQL or ETL flowTask 05 scriptsSQL_BREAKDOWN, runbooks

Prerequisites

  • AWS credentials with access to Bedrock in the target region.
  • Bedrock model access: Request access to Claude (e.g. anthropic.claude-3-haiku-20240307-v1:0) in the AWS Console → Bedrock → Model access.
  • IAM: Glue service role has bedrock:InvokeModel on arn:aws:bedrock:*::foundation-model/anthropic.claude-* (see Terraform in tasks/devops_cicd/infra/terraform/main.tf).

Optional environment variables:

  • BEDROCK_MODEL_ID — Model ID (default: anthropic.claude-3-haiku-20240307-v1:0).
  • AWS_REGION / BEDROCK_REGION — Region for Bedrock (default: eu-west-1).

1. Quarantine explanations

Module: tasks/data_ingestion_transformation/src/etl/bedrock_quarantine.py

Purpose: Turn validation error codes and row data into short, actionable explanations for data stewards (e.g. “Currency code XBT not in ISO-4217; use EUR, USD, or another supported code.”).

Usage:

  • Single row:
    explain_quarantine_row(validation_error, row_dict, use_bedrock=True)
  • Batch (add explanation column to a quarantine DataFrame):
    explain_quarantine_batch(quarantine_df, max_rows=50, use_bedrock=True)

Integration options:

  • Post-ETL script/Lambda: After writing quarantine Parquet, read it, call explain_quarantine_batch(), write back with an explanation column (or to DynamoDB).
  • Human-review UI: Call explain_quarantine_row() per row when a steward opens a record.

Fallback: If Bedrock is disabled (use_bedrock=False) or the API fails, static fallback text per error type is returned so the pipeline does not depend on Bedrock.


2. Report narrative

Script: tasks/communication_documentation/scripts/bedrock_report_narrative.py

Purpose: Generate a 2–4 sentence stakeholder-facing paragraph from structured run metrics (run date, total/passed/quarantined, top errors).

Usage:

# From JSON string
python bedrock_report_narrative.py --metrics '{"run_date":"2026-01-31","total":1450200,"passed":1450150,"quarantined":50}'

# From file
python bedrock_report_narrative.py --metrics-file reports/metrics.json --out narrative.txt

# From stdin
echo '{"run_date":"2026-01-31"}' | python bedrock_report_narrative.py

Integration: Call from the same workflow that produces operational_month_end_final_report.md or stakeholder emails; append or insert the narrative into the report body.


3. Catalog / quality descriptions

Script: tasks/data_ingestion_transformation/scripts/bedrock_quality_descriptions.py

Purpose: Generate a short human-readable description for a table/dataset, optionally including quality metrics, and optionally update the Glue table description.

Usage:

# Description only (stdout)
python bedrock_quality_descriptions.py --table silver_transactions --metrics '{"completeness":99.99}'

# Write to file
python bedrock_quality_descriptions.py --table silver_transactions --metrics-file metrics.json --out desc.txt

# Update Glue table description
python bedrock_quality_descriptions.py --table silver_transactions --metrics-file metrics.json --glue-database ohpen_data_lake --update-glue

Integration: Run after ETL or quality checks; use --update-glue to keep Glue Catalog descriptions in sync.


4. SQL / pipeline doc generation

Script: tasks/communication_documentation/scripts/bedrock_sql_docs.py

Purpose: Generate Markdown explanation of SQL or ETL pipeline text for documentation (e.g. SQL_BREAKDOWN.md, runbooks).

Usage:

# From SQL file
python bedrock_sql_docs.py --sql-file ../../sql/balance_history_2024_q1.sql --out EXPLANATION.md

# Inline text
python bedrock_sql_docs.py --text "SELECT * FROM t" --out out.md

# ETL flow description
python bedrock_sql_docs.py --text "Bronze -> Silver validation..." --kind etl --out etl_explanation.md

Integration: Run in CI or manually when updating SQL or ETL docs; merge output into existing docs or publish to the docs site.


Shared client

Module: tasks/data_ingestion_transformation/src/etl/bedrock_client.py

  • invoke_claude(prompt, max_tokens=512, model_id=..., region=...) — Single user prompt.
  • invoke_claude_with_system(system, user, ...) — System + user (for structured outputs).
  • safe_invoke(prompt, default="", ...) — Optional: invoke with fallback on error (used where GenAI is non-blocking).

Task 05 scripts add tasks/data_ingestion_transformation/src to sys.path to import this client so there is a single Bedrock implementation.


Terraform (IAM)

File: tasks/devops_cicd/infra/terraform/main.tf

  • Resource: aws_iam_role_policy.glue_bedrock attached to aws_iam_role.glue_service_role.
  • Actions: bedrock:InvokeModel, bedrock:InvokeModelWithResponseStream.
  • Resource ARN: arn:aws:bedrock:*::foundation-model/anthropic.claude-*.

If you add a Lambda for post-ETL quarantine explanations or report narrative, attach the same policy (or a dedicated Bedrock policy) to the Lambda execution role.


Summary

ComponentPath
Bedrock clienttasks/data_ingestion_transformation/src/etl/bedrock_client.py
Quarantine explanationstasks/data_ingestion_transformation/src/etl/bedrock_quarantine.py
Report narrative scripttasks/communication_documentation/scripts/bedrock_report_narrative.py
Quality/catalog scripttasks/data_ingestion_transformation/scripts/bedrock_quality_descriptions.py
SQL/pipeline doc scripttasks/communication_documentation/scripts/bedrock_sql_docs.py
IAM (Glue + Bedrock)tasks/devops_cicd/infra/terraform/main.tf

All implementations degrade gracefully when Bedrock is unavailable (fallback text or clear error), so the pipeline remains operational without GenAI.

© 2026 Stephen AdeiCC BY 4.0