How to Use — QA Assumptions

droughty qa runs an AI agent that validates your warehouse data against a set of rules you define. Those rules live in a YAML file called the assumptions file. This guide explains what goes in it, where to put it, and how to write good expectations.


Where the file goes

By default droughty looks for the assumptions file at the root of your git repository:

my-analytics-repo/
├── droughty_qa_assumptions.yaml    ← default location
├── droughty_project.yaml
├── dbt_project.yml
└── models/

You can override the path with the --assumptions-dir flag:

droughty qa --assumptions-dir ./qa/my_assumptions.yaml

If the file does not exist at the expected path and no override is provided, the command exits without running any checks.


File structure

The assumptions file is a YAML document with this structure:

datasets:
  <dataset_name>:
    tables:
      <table_name>:
        columns:
          <column_name>:
            - expectation: "natural language description of what to check"

Each level maps directly to your warehouse:

Key

What it refers to

datasets

Top-level key — always required

<dataset_name>

BigQuery dataset name (or Snowflake schema). Must match the actual name in the warehouse.

tables

Required under each dataset

<table_name>

Table name within the dataset. Must match the actual table name.

columns

Required under each table

<column_name>

Column name to evaluate. Must exist in the table.

expectation

A natural language description of what the data should look like. Passed to the LLM as-is.


Full example

datasets:
  my_analytics_dataset:
    tables:
      fct_orders:
        columns:
          order_pk:
            - expectation: "Check if all order PKs are unique and not null"
          order_status:
            - expectation: "Check that order_status only contains the values: pending, processing, shipped, delivered, cancelled"
          order_value:
            - expectation: "Check that order_value is always greater than zero"

      dim_customers:
        columns:
          customer_email:
            - expectation: "Check if all emails are in valid email format. Flag any that look like test or dummy addresses."
          customer_pk:
            - expectation: "Check if all customer PKs are unique and not null"
          persona_type:
            - expectation: "Check that all customers have a persona type assigned and that the values align with ISO 8000 standards for data quality"

  my_staging_dataset:
    tables:
      stg_orders:
        columns:
          raw_order_id:
            - expectation: "Check for duplicate raw_order_ids which would indicate the source is sending duplicate records"

Writing good expectations

The expectation string is passed verbatim to an LLM (GPT-4), which queries sample data from the warehouse and interprets the results against what you described. This means expectations can be written in plain English and can express complex rules — you are not limited to simple SQL checks.

Be specific about what a violation looks like:

# Vague — the LLM will make assumptions about what "correct" means
- expectation: "Check the email column"

# Better — tells the LLM exactly what to look for
- expectation: "Check that all email addresses match the pattern user@domain.tld and flag any nulls, blank strings, or addresses without an @ sign"

Reference domain knowledge the LLM might not have:

- expectation: "Check that country_code only contains ISO 3166-1 alpha-2 codes (two-letter country codes like GB, US, DE). Flag anything that is longer than 2 characters or contains numbers."

Ask for PII detection:

- expectation: "Check if this column exposes PII. Flag any values that look like names, email addresses, phone numbers, or national ID numbers."

Check business rules:

- expectation: "Check that shipped_date is never earlier than order_date. Any row where shipped_date < order_date is a data error."

Check for expected value ranges:

- expectation: "Order values should be between £0.01 and £10,000. Flag any values outside this range as potential data entry errors or test records."

Running the QA agent

# Using the default file location (git root)
droughty qa

# Using a custom path
droughty qa --assumptions-dir ./qa/droughty_qa_assumptions.yaml

# With CI/CD environment variables
droughty qa --assumptions-dir ./droughty_qa_assumptions.yaml --env-vars enabled

The agent:

  1. Loads the assumptions file

  2. For each column expectation, queries sample data from the warehouse

  3. Passes the data and expectation to GPT-4

  4. GPT-4 returns a structured verdict: met or violated, with evidence

  5. Prints a summary report to the terminal

  6. Writes a Mermaid diagram of the agent DAG to langgraph/droughty_qa_agent_dag.mmd

Example output:

QA Results
----------
✓  fct_orders.order_pk — uniqueness check: met
   All 50,000 sampled values are unique.

✗  fct_orders.order_status — allowed values check: violated
   Found unexpected values: ['refunded', 'hold']. These are not in the
   expected set [pending, processing, shipped, delivered, cancelled].

✓  dim_customers.customer_email — email format check: met
   All sampled email addresses match the expected format.

✗  dim_customers.persona_type — completeness check: violated
   1,243 rows (2.5% of sample) have a null persona_type.

Requirements

The QA agent requires:

  • openai_secret set in ~/.droughty/profile.yaml — used to call GPT-4

  • BigQuery or Snowflake credentials configured in the same profile

  • (Optional) langsmith_secret and langsmith_project in the profile for LLM call tracing via LangSmith

# ~/.droughty/profile.yaml
my_profile:
  warehouse_name: big_query
  project_name: my-gcp-project
  schema_name: my_analytics_dataset

  openai_secret: sk-...
  langsmith_secret: ls-...         # optional
  langsmith_project: my-project    # optional

Multi-dataset checks

You can check tables across multiple datasets in a single file:

datasets:
  analytics:
    tables:
      fct_orders:
        columns:
          order_pk:
            - expectation: "Check for uniqueness and nulls"

  staging:
    tables:
      stg_raw_events:
        columns:
          event_id:
            - expectation: "Check for duplicate event IDs which indicate deduplication has not run"
          session_id:
            - expectation: "Check that session_id is not null  a null here means the session attribution failed"

When to run QA

Situation

Why

Before a stakeholder demo or data release

Catch data quality issues before they’re visible to end users

After a significant dbt model change

Validate that business rules still hold after the refactor

On a schedule (e.g. daily in CI/CD)

Proactive monitoring for data drift or upstream source issues

When onboarding a new data source

Validate the source data quality before building models on top of it