How to Use — QA Assumptions ============================ ``droughty qa`` runs an AI agent that validates your warehouse data against a set of rules you define. Those rules live in a YAML file called the **assumptions file**. This guide explains what goes in it, where to put it, and how to write good expectations. ---- Where the file goes -------------------- By default droughty looks for the assumptions file at the root of your git repository: .. code-block:: text my-analytics-repo/ ├── droughty_qa_assumptions.yaml ← default location ├── droughty_project.yaml ├── dbt_project.yml └── models/ You can override the path with the ``--assumptions-dir`` flag: .. code-block:: bash droughty qa --assumptions-dir ./qa/my_assumptions.yaml If the file does not exist at the expected path and no override is provided, the command exits without running any checks. ---- File structure -------------- The assumptions file is a YAML document with this structure: .. code-block:: yaml datasets: : tables: : columns: : - expectation: "natural language description of what to check" Each level maps directly to your warehouse: .. list-table:: :header-rows: 1 :widths: 25 75 * - Key - What it refers to * - ``datasets`` - Top-level key — always required * - ```` - BigQuery dataset name (or Snowflake schema). Must match the actual name in the warehouse. * - ``tables`` - Required under each dataset * - ```` - Table name within the dataset. Must match the actual table name. * - ``columns`` - Required under each table * - ```` - Column name to evaluate. Must exist in the table. * - ``expectation`` - A natural language description of what the data should look like. Passed to the LLM as-is. ---- Full example ------------- .. code-block:: yaml datasets: my_analytics_dataset: tables: fct_orders: columns: order_pk: - expectation: "Check if all order PKs are unique and not null" order_status: - expectation: "Check that order_status only contains the values: pending, processing, shipped, delivered, cancelled" order_value: - expectation: "Check that order_value is always greater than zero" dim_customers: columns: customer_email: - expectation: "Check if all emails are in valid email format. Flag any that look like test or dummy addresses." customer_pk: - expectation: "Check if all customer PKs are unique and not null" persona_type: - expectation: "Check that all customers have a persona type assigned and that the values align with ISO 8000 standards for data quality" my_staging_dataset: tables: stg_orders: columns: raw_order_id: - expectation: "Check for duplicate raw_order_ids which would indicate the source is sending duplicate records" ---- Writing good expectations -------------------------- The expectation string is passed verbatim to an LLM (GPT-4), which queries sample data from the warehouse and interprets the results against what you described. This means expectations can be written in plain English and can express complex rules — you are not limited to simple SQL checks. **Be specific about what a violation looks like:** .. code-block:: yaml # Vague — the LLM will make assumptions about what "correct" means - expectation: "Check the email column" # Better — tells the LLM exactly what to look for - expectation: "Check that all email addresses match the pattern user@domain.tld and flag any nulls, blank strings, or addresses without an @ sign" **Reference domain knowledge the LLM might not have:** .. code-block:: yaml - expectation: "Check that country_code only contains ISO 3166-1 alpha-2 codes (two-letter country codes like GB, US, DE). Flag anything that is longer than 2 characters or contains numbers." **Ask for PII detection:** .. code-block:: yaml - expectation: "Check if this column exposes PII. Flag any values that look like names, email addresses, phone numbers, or national ID numbers." **Check business rules:** .. code-block:: yaml - expectation: "Check that shipped_date is never earlier than order_date. Any row where shipped_date < order_date is a data error." **Check for expected value ranges:** .. code-block:: yaml - expectation: "Order values should be between £0.01 and £10,000. Flag any values outside this range as potential data entry errors or test records." ---- Running the QA agent --------------------- .. code-block:: bash # Using the default file location (git root) droughty qa # Using a custom path droughty qa --assumptions-dir ./qa/droughty_qa_assumptions.yaml # With CI/CD environment variables droughty qa --assumptions-dir ./droughty_qa_assumptions.yaml --env-vars enabled The agent: 1. Loads the assumptions file 2. For each column expectation, queries sample data from the warehouse 3. Passes the data and expectation to GPT-4 4. GPT-4 returns a structured verdict: ``met`` or ``violated``, with evidence 5. Prints a summary report to the terminal 6. Writes a Mermaid diagram of the agent DAG to ``langgraph/droughty_qa_agent_dag.mmd`` **Example output:** .. code-block:: text QA Results ---------- ✓ fct_orders.order_pk — uniqueness check: met All 50,000 sampled values are unique. ✗ fct_orders.order_status — allowed values check: violated Found unexpected values: ['refunded', 'hold']. These are not in the expected set [pending, processing, shipped, delivered, cancelled]. ✓ dim_customers.customer_email — email format check: met All sampled email addresses match the expected format. ✗ dim_customers.persona_type — completeness check: violated 1,243 rows (2.5% of sample) have a null persona_type. ---- Requirements ------------- The QA agent requires: - ``openai_secret`` set in ``~/.droughty/profile.yaml`` — used to call GPT-4 - BigQuery or Snowflake credentials configured in the same profile - (Optional) ``langsmith_secret`` and ``langsmith_project`` in the profile for LLM call tracing via LangSmith .. code-block:: yaml # ~/.droughty/profile.yaml my_profile: warehouse_name: big_query project_name: my-gcp-project schema_name: my_analytics_dataset openai_secret: sk-... langsmith_secret: ls-... # optional langsmith_project: my-project # optional ---- Multi-dataset checks --------------------- You can check tables across multiple datasets in a single file: .. code-block:: yaml datasets: analytics: tables: fct_orders: columns: order_pk: - expectation: "Check for uniqueness and nulls" staging: tables: stg_raw_events: columns: event_id: - expectation: "Check for duplicate event IDs which indicate deduplication has not run" session_id: - expectation: "Check that session_id is not null — a null here means the session attribution failed" ---- When to run QA --------------- .. list-table:: :header-rows: 1 :widths: 50 50 * - Situation - Why * - Before a stakeholder demo or data release - Catch data quality issues before they're visible to end users * - After a significant dbt model change - Validate that business rules still hold after the refactor * - On a schedule (e.g. daily in CI/CD) - Proactive monitoring for data drift or upstream source issues * - When onboarding a new data source - Validate the source data quality before building models on top of it