An Introduction to Testing with Data Build Tool

Manually testing and using different platforms for data integrity with DBT

By Chanakya Kaspa and Deborah Paul

Data analysts and engineers commonly separate development work and testing, performing each task with separate frameworks. Our team is no exception. Separating our testing from the platform used to execute jobs has allowed us greater choice in the frameworks we can use for data quality checks. However, this decoupling comes with some downsides.

  • First, without integrated tests, it is up to the individual developer or team to create a policy that enforces the creation of tests along with new objects or changes to existing objects. This type of policy can be harder to enforce.
  • Secondly, when making changes to tests and data atomically, with separate platforms, there is a possibility that a logical change is made to a dataset before the tests for it are updated to match. Doing so can cause the test to falsely report an error.

Our team has been exploring using Data Build Tool (DBT) as a solution to these problems and has shared on this extensively with our peers in the Capital One Python Community of Practice. In this post, we will cover many of our learnings, focusing on how DBT can improve on decoupled development and testing platforms, as well as some basics of how DBT is structured. This includes:

  • What is Data Build Tool and how can it improve the current process
  • Components of Data Build Tool (models, macros, types of tests, packages)
  • Reusable code in Data Build Tool
  • Automated unit testing in Data Build Tool
  • How is this approach better for developers and end users?

What is Data Build Tool and how can it improve the current process?

Data Build Tool (DBT) is an open source framework that integrates data quality tests into the development environment. Unlike the current process where test and development environments are separate, DBT allows users to update tests atomically, along with code/logic changes.

Components of Data Build Tool

There are four noteworthy components in DBT that we’ll cover here - models, macros, types of tests, and packages.

What are models in DBT?

Models are the central component of DBT. Each model is defined as a SELECT statement, and DBT takes care of materializing it in the database with whichever materialization type you specify (i.e. table, view, incremental, ephemeral). An example model is included below.



What are macros in DBT?

Macros are code templates in Jinja that can be used to create reusable components. These reusable pieces can be called on in the code for models and tests to avoid duplication and increase modularity. In addition to creating reusable components, there are many default macros that provide key functionality to DBT, such as ref(), which is used to define dependencies between models.

The following two snippets from the DBT documentation show how the ref() macro is used to build hierarchical models that build on each other.

code snippet illustrating using the reference macro from dbt to build a hierarchical relationship between models

Included below is an example of a model using Jinja templating and the code it compiles to.

    {% set drink_names = ["cappuccino", "latte", "mocha"] %}

    {% for drink in drink_names %}
    sum(case when drink_name = '{{drink}}' then price end) as {{drink}}_sales_amount,
    {% endfor %}
    sum(price) as total_sales
from coffee_shop.sales
group by 1

The code above compiles to the following.

    sum(case when drink_name = 'cappuccino' then price end) as cappuccino_sales_amount,
    sum(case when drink_name = 'latte' then price end) as latte_sales_amount,
    sum(case when drink_name = 'mocha' then price end) as mocha_sales_amount,
sum(price) as total_sales
from coffee_shop.sales
group by 1

What are the types of tests in DBT?

Tests come in two main types in DBT: schema and data.

  • Schema tests - Are simpler and only need the information of a single record to pass or fail. For example, enforcing a column to be not NULL could be a schema test.
  • Data tests - Can be more complex and are simply defined with a SELECT statement. By default, if the statement returns any records from the database, that test is considered as failing. Writing tests in this way allows for essentially any test to be performed, whether that is a test on the model that requires information from all records, or a more complicated test that involves more than one model.

What are packages in DBT?

Packages in DBT are essentially standalone DBT projects. Since packages take this form, they can contain any of the fundamental DBT components. For example, a package full of macros defined for specific utilities could be used across different models so that macro doesn’t need to be defined multiple times, once in each project. In addition to user-defined packages, there are also official packages which contain specific functionality or integrations.

Reusable code in Data Build Tool

Along with integrating testing and development into the same environment, DBT provides another major advantage: modularization. With DBT, your team can repeat patterns in code for a model and its tests and be able to reuse those patterns across all their models and tests.

Using Jinja to create macros for blocks of code

As mentioned above, macros use Jinja templating to define an amount of SQL code based on some optional input. Repeated patterns in code for models or tests can be expressed using a macro and called upon again  within models. Instead of needing to change multiple models due to a logic change, a central macro can be changed instead.

Leveraging DBT macros and packages to create a transformation library

Since packages are just DBT projects, a DBT project that contains lots of useful macros can be imported and used in other projects. For example, a basic SUM() window function to compute an MTD value based on some partitions will always look the same, with the only differences being the column itself, and columns being partitioned by and ordered on. A macro could be defined that represents this specific transformation, and used in models by simply calling them with the necessary columns. By using a package to hold these macros, there can be a single, version controlled source that contains the logic used for specific transformations.

Automated unit testing in Data Build Tool

The component layout and modularity of DBT make it the perfect tool to automate model testing and address the pain points highlighted in our current manual testing process.

How to use DBT tests to automate model testing

As mentioned above, DBT tests can take two different forms - schema tests and data tests. Both forms can be combined together to run data quality tests on models. Since the tests are defined as macros or SQL statements, the developer can create them without the need for a separate framework, and use them throughout the process of updating existing models.

How is this approach different from using other testing platforms?

Two reasons - tests are updated atomically with changes to models and testing can be more granular than an outside approach.

  • Tests are updated automatically with changes to models. One main advantage of updating tests atomically with changes to models is that it avoids the scenario where data quality checks are updated after models, leading to a period of time where they will erroneously report a failure. In a decentralized model, it is up to the developer to time the updates to the separate model and test frameworks to minimize this time. With DBT, a single commit that changes a model should also change a test when necessary. This way, the test logic will never be out of sync with the logic for a model.
  • Testing can be more granular than an outside approach. Another advantage of DBT tests is that they can apply to any model. Typically, testing frameworks will pull data from database objects. What if a developer is working on a single database object that contains multiple transformations, like an insert statement with multiple CTEs? In DBT, each CTE can be expressed as its own model via the “ephemeral” materialization, then  tests can be written as pulling data from those models. This allows for testing throughout the entire transformation, or, in this case, the entire insert statement, without actually changing how the models are run on the database once compiled.

How is this approach better for developers and end users?

Using a centralized framework like DBT for data tests and transformations can allow teams to standardized testing across developers, and preserve and edit test cases parallel to code changes. As models/tables are updated, relevant tests that validate incremental changes are added to the model. These tests persist in the model even when other members of the team perform more code changes, ensuring no one breaks previous development efforts. For more information on DBT, and to determine if it’s a good fit for your use case, check out their GitHub.

Capital One Tech

Stories and ideas on development from the people who build it at Capital One.

Related Content