Version control for BigQuery

Version control is an essential tool for data teams when creating tables and data sets for consumption by other people and teams.

While a single data analyst might make to without one, as soon as you have a data team, version control is a key tool to enable collaboration and active participation of all team members, without compromising on data integrity.

With version control, you can ensures that any changes in your sql queries are reviewed, approved and tracked. If bad data shows up in a report, you can go back and review what change caused it.

If you are using a tools such as DBT, version control is part of your workflow from day one. The same goes for those using Looker’s or Omni’s modelling layer.

But if you just want to add version control to an existing BigQuery project, this is the quickest way to do so.

1. Structure your Github repository

Create a repository in your company’s Github account. By using the same version control tooling as the engineering org, you can create better visibility into the data team’s workflows and re-use best practices from the engineering org.

The sql folder content should mirror your BigQuery datasets, making it easy to find the relevant queries.

data-project/
├── .github/workflows
│   ├── bigquery.yml
├── sql/
│   ├── users/
│   └── reports/
└── README.md

2. Set up continuous deployment from Github to BigQuery

To make sure that BigQuery is synched with the latest version on Github, you can set up a continuous deployment pipeline using Github actions.

First, you need to create a service account for Github with the following permissions:

  • bigquery data editor
  • bigquery job user

in Github, create the following file in a folder called .github/wokflows:

name: bigquery queries

on:
  push:
    branches:
      - master
  workflow_dispatch:

jobs:
  run-queries:
    runs-on: ubuntu-latest

    steps:
      - name: checkout code
        uses: actions/checkout@v2

      # Authenticate with Google Cloud using service account key
      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v1
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      # Set up gcloud with the authenticated account
      - name: setup gcloud
        uses: google-github-actions/setup-gcloud@v1
        with:
          project_id: ${{ secrets.GCP_PROJECT }}

      # Run your BigQuery queries
      - name: run bigquery queries
        run: |
          find ./sql -name '*.sql' | while read sql_file; do
            echo "Running $sql_file"
            bq query --use_legacy_sql=false < "$sql_file"
          done

Next, add the GCP project name and GCP key to your Github secrets.

3. Migrate your code to Github

You are now ready to move your code over to Github. Watch out for dependencies between your queries. The above code will be executed the queries in alphabetical order.

Once you’re done, you now need to make sure you update your code in Github only. Any changes to the code in BigQuery will be overwritten when a change is detected in Github.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *