How to Use — dbt & BigQuery =========================== This guide walks through using droughty end-to-end with a dbt project on Google BigQuery. It covers everything from initial setup through to an ongoing development workflow. .. note:: This guide is specific to **dbt + BigQuery**. For Snowflake usage see :doc:`warehouse_support`, and for the general command reference see :doc:`commands`. ---- Prerequisites ------------- Before you start, you need: - A Google Cloud project with BigQuery enabled - ``gcloud`` CLI installed — `install guide `_ - dbt installed (``pip install dbt-bigquery``) - droughty installed (``pip install droughty``) - A dbt project with at least a staging schema deployed to BigQuery Your BigQuery project should have at least two datasets. Three is the full setup droughty is built for: .. code-block:: text myproject_staging ← raw source data, lightly transformed myproject_integration ← cleaned and joined data myproject_analytics ← final analytics layer (what Looker/Cube reads) ---- Step 1 — Authenticate with BigQuery ------------------------------------- The simplest local setup uses Application Default Credentials: .. code-block:: bash gcloud auth application-default login Follow the browser prompts. Once complete, droughty will use these credentials automatically — no key file needed. For CI/CD or shared environments, use a service account instead. See :doc:`authentication` for full details. ---- Step 2 — Set up your dbt project for droughty ------------------------------------------------ Droughty reads ``INFORMATION_SCHEMA`` metadata, including column descriptions. To make sure your dbt column descriptions flow through to the warehouse (and back into droughty), add ``persist_docs`` to your ``dbt_project.yml``: .. code-block:: yaml # dbt_project.yml models: +persist_docs: relation: true columns: true my_project: staging: materialized: view schema: staging integration: materialized: view schema: integration warehouse: materialized: table With ``persist_docs`` enabled, any ``description`` you write in your dbt YAML files gets written into BigQuery's table and column metadata. Droughty picks this up and includes it in generated LookML views, Cube cubes, and docs — so your descriptions only need to live in one place. .. tip:: Run ``dbt run`` after enabling ``persist_docs`` to push your existing descriptions into BigQuery before running droughty. ---- Step 3 — Create your droughty profile ---------------------------------------- Create the file ``~/.droughty/profile.yaml``: .. code-block:: bash mkdir -p ~/.droughty touch ~/.droughty/profile.yaml Add your project configuration: .. code-block:: yaml my_profile: warehouse_name: big_query project_name: my-gcp-project # your GCP project ID schema_name: myproject_analytics # your analytics / warehouse dataset openai_secret: sk-... # only needed for `droughty docs` and `droughty qa` test_schemas: # datasets droughty dbt will scan - myproject_staging - myproject_integration - myproject_analytics dbml_schemas: # datasets droughty dbml will scan - myproject_staging - myproject_integration - myproject_analytics .. tip:: ``test_schemas`` controls which BigQuery datasets get scanned for dbt test generation. Include every dataset you want test coverage on — typically all three layers. ---- Step 4 — Create your project file ------------------------------------ Create ``droughty_project.yaml`` at the **root of your dbt git repository**: .. code-block:: yaml profile: my_profile # dbt test output dbt_path: models/ dbt_tests_filename: schema # LookML output lookml_path: lookml/ lookml_base_filename: base lookml_explore_filename: explores lookml_measures_filename: measures lookml_base_path: views/ # DBML / ERD output dbml_path: db_docs/ dbml_filenames: - analytics_erd # Cube output (if using Cube) cube_path: schema/ cube_base_filename: base cube_integration_filename: integrations cube_measures_filename: measures # Staging model output stage_path: models/staging/ # AI field descriptions openai_field_descriptions_path: warehouse_docs openai_field_descriptions_filename: openai_field_descriptions ---- Step 5 — Generate your staging layer --------------------------------------- When you add a new raw data source to BigQuery, use ``droughty stage`` to generate the dbt staging boilerplate automatically instead of writing it by hand. .. code-block:: bash droughty stage -p my-gcp-project -d raw_source_dataset This reads every table in ``raw_source_dataset`` and generates: .. code-block:: text models/staging/sources.yml ← BigQuery source definitions models/staging/stg_orders.sql ← one staging model per table models/staging/stg_customers.sql models/staging/stg_products.sql ... The staging SQL models select all columns from the source and cast types where appropriate. From here, you refine the models — renaming columns, adding business logic, defining surrogate keys — rather than starting from a blank file. To target a single table instead of the whole dataset: .. code-block:: bash droughty stage -p my-gcp-project -d raw_source_dataset -t orders ---- Step 6 — Generate dbt tests ------------------------------ After running your dbt models (``dbt run``), generate schema tests for all your schemas: .. code-block:: bash droughty dbt This scans every table in your ``test_schemas`` and writes ``models/schema.yml`` with: - A ``unique`` test on every column ending in ``_pk`` - A ``not_null`` test on every column - An ``expression_is_true: valid_from <= valid_to`` test where those columns exist **Example output** (excerpt from ``models/schema.yml``): .. code-block:: yaml version: 2 models: - name: dim_customers columns: - name: customer_pk tests: - unique - not_null - name: customer_name tests: - not_null - name: email tests: - not_null - name: fct_orders columns: - name: order_pk tests: - unique - not_null - name: customer_fk tests: - not_null Run the tests with: .. code-block:: bash dbt test **Customising what gets tested** To override the tests on a specific column: .. code-block:: yaml # droughty_project.yaml test_overwrite: models: fct_orders: order_pk: - not_null - unique - dbt_utils.at_least_one To skip test generation for a table entirely: .. code-block:: yaml test_ignore: models: - raw_events_log - tmp_dedupe_working ---- Step 7 — Generate LookML -------------------------- Once your analytics layer is deployed in BigQuery, generate the Looker base layer: .. code-block:: bash droughty lookml This reads your ``schema_name`` dataset (``myproject_analytics``) from BigQuery and generates: .. code-block:: text lookml/views/base.lkml ← one LookML view per table lookml/explores.lkml ← explore definitions lookml/measures.lkml ← measure definitions **What the generated views look like** (excerpt): .. code-block:: text view: dim_customers { sql_table_name: `my-gcp-project.myproject_analytics.dim_customers` ;; dimension: customer_pk { type: number sql: ${TABLE}.customer_pk ;; hidden: yes } dimension: customer_name { type: string sql: ${TABLE}.customer_name ;; label: "Customer Name" } dimension_group: created { type: time timeframes: [raw, date, week, month, quarter, year] sql: ${TABLE}.created_at ;; } } PKs and FKs are hidden automatically. Timestamps get ``dimension_group`` with standard timeframes. If your dbt column descriptions flowed through via ``persist_docs``, they appear as ``description:`` fields in the view. **The right way to extend the base layer** Don't edit the generated files directly — droughty will overwrite them next time you run. Instead, use Looker `refinements `_ in a separate file: .. code-block:: text lookml/ ├── views/ │ └── base.lkml ← generated — do not edit ├── explores.lkml ← generated — do not edit ├── measures.lkml ← generated — do not edit └── refinements/ └── dim_customers.lkml ← your customisations go here ---- Step 8 — Generate your ERD ----------------------------- Before raising a pull request, regenerate the ERD: .. code-block:: bash droughty dbml This generates ``db_docs/analytics_erd.dbml`` containing all tables across your ``dbml_schemas``, with primary key and foreign key relationships inferred from ``_pk`` / ``_fk`` column suffixes. Upload it to `dbdocs.io `_ for a browsable ERD: .. code-block:: bash npx dbdocs build db_docs/analytics_erd.dbml Or paste it directly into `dbdiagram.io `_ for a quick visual. ---- Step 9 — Generate AI field descriptions (optional) ---------------------------------------------------- If your warehouse has undocumented fields, droughty can use OpenAI to infer a first-pass description for every column that doesn't already have one: .. code-block:: bash droughty docs This writes ``warehouse_docs/openai_field_descriptions.md``. Review the output, then copy approved descriptions back into your dbt column YAML. With ``persist_docs`` enabled, they'll flow back into BigQuery ``INFORMATION_SCHEMA`` and be picked up by droughty on the next run. Requires ``openai_secret`` in your ``profile.yaml``. .. warning:: Always review AI-generated descriptions before adding them to production documentation. The model makes reasonable guesses but can be wrong about domain-specific terminology. ---- Step 10 — Run QA validation (optional) ----------------------------------------- Before a data release or stakeholder demo, run the QA agent to validate your warehouse data against a set of assumptions: .. code-block:: bash droughty qa --assumptions-dir ./droughty_qa_assumptions.yaml Requires ``openai_secret`` in your ``profile.yaml``. Optionally add ``langsmith_secret`` and ``langsmith_project`` to trace all LLM calls through LangSmith. ---- Putting it all together — Recommended workflow ------------------------------------------------ Here is how droughty fits into a day-to-day dbt + BigQuery development cycle: **When onboarding a new source:** .. code-block:: bash # 1. Generate staging boilerplate from the raw BigQuery dataset droughty stage -p my-gcp-project -d new_source_dataset # 2. Refine the generated staging models in your editor # 3. Run dbt to deploy the new models dbt run --select staging.new_source+ **When building or modifying models:** .. code-block:: bash # 1. Build your models dbt run # 2. Regenerate schema tests to cover new tables/columns droughty dbt # 3. Run tests dbt test **Before committing:** .. code-block:: bash # Regenerate the Looker base layer so it reflects current warehouse droughty lookml # Regenerate the ERD for the PR description droughty dbml **Before a release or QA checkpoint:** .. code-block:: bash droughty qa --assumptions-dir ./droughty_qa_assumptions.yaml **As a pre-commit hook** (keep Looker permanently in sync): .. code-block:: bash # .git/hooks/pre-commit #!/bin/bash set -e droughty lookml git add lookml/ ---- CI/CD with BigQuery -------------------- In a CI/CD pipeline, use environment variables instead of local files: .. code-block:: yaml # .github/workflows/droughty.yml (example) name: Droughty on: push: branches: [main] jobs: generate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install droughty run: pip install droughty - name: Generate dbt tests env: PROFILE_NAME: my_profile WAREHOUSE_NAME: big_query PROJECT_NAME: ${{ secrets.GCP_PROJECT_ID }} SCHEMA_NAME: myproject_analytics KEY_FILE: /tmp/key.json run: | echo '${{ secrets.GCP_SA_KEY }}' > /tmp/key.json droughty dbt --env-vars enabled droughty lookml --env-vars enabled - name: Commit generated files run: | git config user.name "droughty-bot" git config user.email "bot@example.com" git add models/schema.yml lookml/ git diff --staged --quiet || git commit -m "chore: regenerate droughty outputs" git push ---- Naming conventions reference ------------------------------ Droughty's BigQuery integration depends on these column naming conventions: .. list-table:: :header-rows: 1 :widths: 25 75 * - Convention - Effect * - Column ends in ``_pk`` - Treated as primary key — hidden in LookML, tested with ``unique`` + ``not_null`` in dbt, adds ``Ref`` source in DBML * - Column ends in ``_fk`` - Treated as foreign key — hidden in LookML, adds ``Ref`` target in DBML * - Column named ``valid_from`` + ``valid_to`` - Adds an ``expression_is_true: valid_from <= valid_to`` test in dbt * - Column type ``TIMESTAMP`` / ``DATETIME`` - Generates a ``dimension_group`` with time, date, week, month, quarter, year timeframes in LookML * - Column description in BigQuery (via ``persist_docs``) - Included as ``description:`` in LookML views and Cube cubes