How to Use — QA Assumptions¶
droughty qa runs an AI agent that validates your warehouse data against a set of rules you define. Those rules live in a YAML file called the assumptions file. This guide explains what goes in it, where to put it, and how to write good expectations.
Where the file goes¶
By default droughty looks for the assumptions file at the root of your git repository:
my-analytics-repo/
├── droughty_qa_assumptions.yaml ← default location
├── droughty_project.yaml
├── dbt_project.yml
└── models/
You can override the path with the --assumptions-dir flag:
droughty qa --assumptions-dir ./qa/my_assumptions.yaml
If the file does not exist at the expected path and no override is provided, the command exits without running any checks.
File structure¶
The assumptions file is a YAML document with this structure:
datasets:
<dataset_name>:
tables:
<table_name>:
columns:
<column_name>:
- expectation: "natural language description of what to check"
Each level maps directly to your warehouse:
Key |
What it refers to |
|---|---|
|
Top-level key — always required |
|
BigQuery dataset name (or Snowflake schema). Must match the actual name in the warehouse. |
|
Required under each dataset |
|
Table name within the dataset. Must match the actual table name. |
|
Required under each table |
|
Column name to evaluate. Must exist in the table. |
|
A natural language description of what the data should look like. Passed to the LLM as-is. |
Full example¶
datasets:
my_analytics_dataset:
tables:
fct_orders:
columns:
order_pk:
- expectation: "Check if all order PKs are unique and not null"
order_status:
- expectation: "Check that order_status only contains the values: pending, processing, shipped, delivered, cancelled"
order_value:
- expectation: "Check that order_value is always greater than zero"
dim_customers:
columns:
customer_email:
- expectation: "Check if all emails are in valid email format. Flag any that look like test or dummy addresses."
customer_pk:
- expectation: "Check if all customer PKs are unique and not null"
persona_type:
- expectation: "Check that all customers have a persona type assigned and that the values align with ISO 8000 standards for data quality"
my_staging_dataset:
tables:
stg_orders:
columns:
raw_order_id:
- expectation: "Check for duplicate raw_order_ids which would indicate the source is sending duplicate records"
Writing good expectations¶
The expectation string is passed verbatim to an LLM (GPT-4), which queries sample data from the warehouse and interprets the results against what you described. This means expectations can be written in plain English and can express complex rules — you are not limited to simple SQL checks.
Be specific about what a violation looks like:
# Vague — the LLM will make assumptions about what "correct" means
- expectation: "Check the email column"
# Better — tells the LLM exactly what to look for
- expectation: "Check that all email addresses match the pattern user@domain.tld and flag any nulls, blank strings, or addresses without an @ sign"
Reference domain knowledge the LLM might not have:
- expectation: "Check that country_code only contains ISO 3166-1 alpha-2 codes (two-letter country codes like GB, US, DE). Flag anything that is longer than 2 characters or contains numbers."
Ask for PII detection:
- expectation: "Check if this column exposes PII. Flag any values that look like names, email addresses, phone numbers, or national ID numbers."
Check business rules:
- expectation: "Check that shipped_date is never earlier than order_date. Any row where shipped_date < order_date is a data error."
Check for expected value ranges:
- expectation: "Order values should be between £0.01 and £10,000. Flag any values outside this range as potential data entry errors or test records."
Running the QA agent¶
# Using the default file location (git root)
droughty qa
# Using a custom path
droughty qa --assumptions-dir ./qa/droughty_qa_assumptions.yaml
# With CI/CD environment variables
droughty qa --assumptions-dir ./droughty_qa_assumptions.yaml --env-vars enabled
The agent:
Loads the assumptions file
For each column expectation, queries sample data from the warehouse
Passes the data and expectation to GPT-4
GPT-4 returns a structured verdict:
metorviolated, with evidencePrints a summary report to the terminal
Writes a Mermaid diagram of the agent DAG to
langgraph/droughty_qa_agent_dag.mmd
Example output:
QA Results
----------
✓ fct_orders.order_pk — uniqueness check: met
All 50,000 sampled values are unique.
✗ fct_orders.order_status — allowed values check: violated
Found unexpected values: ['refunded', 'hold']. These are not in the
expected set [pending, processing, shipped, delivered, cancelled].
✓ dim_customers.customer_email — email format check: met
All sampled email addresses match the expected format.
✗ dim_customers.persona_type — completeness check: violated
1,243 rows (2.5% of sample) have a null persona_type.
Requirements¶
The QA agent requires:
openai_secretset in~/.droughty/profile.yaml— used to call GPT-4BigQuery or Snowflake credentials configured in the same profile
(Optional)
langsmith_secretandlangsmith_projectin the profile for LLM call tracing via LangSmith
# ~/.droughty/profile.yaml
my_profile:
warehouse_name: big_query
project_name: my-gcp-project
schema_name: my_analytics_dataset
openai_secret: sk-...
langsmith_secret: ls-... # optional
langsmith_project: my-project # optional
Multi-dataset checks¶
You can check tables across multiple datasets in a single file:
datasets:
analytics:
tables:
fct_orders:
columns:
order_pk:
- expectation: "Check for uniqueness and nulls"
staging:
tables:
stg_raw_events:
columns:
event_id:
- expectation: "Check for duplicate event IDs which indicate deduplication has not run"
session_id:
- expectation: "Check that session_id is not null — a null here means the session attribution failed"
When to run QA¶
Situation |
Why |
|---|---|
Before a stakeholder demo or data release |
Catch data quality issues before they’re visible to end users |
After a significant dbt model change |
Validate that business rules still hold after the refactor |
On a schedule (e.g. daily in CI/CD) |
Proactive monitoring for data drift or upstream source issues |
When onboarding a new data source |
Validate the source data quality before building models on top of it |