Why data contracts are becoming the default for data quality

Bad data used to keep me up at night. I would wake up from dreams about logic problems I had been wrestling with, the same way some people suddenly sit up wondering if they left the stove on.

I’m a data engineer with over twenty years of experience. I build pipelines, BI dashboards and real time monitoring systems. In the early 2000s, I worked for a company that developed specialized business intelligence software for utilities. Our team built outage management pipelines and dashboards to produce regulatory reports for SAIDI (System Average Interruption Duration Index), SAIFI (System Average Interruption Frequency Index) and CAIDI (Customer Average Interruption Duration Index). These metrics measure how often customers lose power, how long outages last and how quickly service is restored. In other words, they tell regulators whether people can trust the grid when it matters most. We also monitored live storms to see whether hospitals or homes with life-dependent equipment were running or at risk. The urgency and criticality of this work taught me early on that data quality is always non-negotiable, and in this case, performance was just as essential.

Over the years, I have come to see data engineering through the lens of a hierarchy of needs. For data teams to succeed, they must establish six layers of capability: quality, resilience, performance, cost, completeness and security. This post will explore what I consider to be the most important of these layers: quality.

Introducing the data contract

You can’t talk about data quality without a data contract. A data contract is a formal agreement between data producers and users that defines the structure, quality and semantics of data being shared. It establishes clear expectations, sets validation rules and helps ensure the data is reliable and accurate. At the end of the day, a data contract helps prevent errors and boosts stakeholder trust in the data.

The data contract as the foundation for "shift left"

The traditional approach to data quality often relies on downstream checks: monitoring pipelines and reacting to issues after the fact. This is a costly, reactive strategy.

The data contract makes shift-left possible. It moves data quality from being a simple, passive check at the end of a pipeline to a formal, active pre-condition enforced at the source of the data.

What shift left means for data engineering

The idea of shift left came from software testing and DevOps. It meant running tests earlier in the development lifecycle instead of waiting until the end, literally shifting validation to the left on the project timeline. By catching problems sooner, teams avoided expensive failures later.

In data engineering, we apply the same concept at two levels:

  1. During development, with CI/CD pipelines that run automated tests on pull requests.

  2. At runtime, with schema enforcement, declarative expectations and data quality frameworks that validate data as it flows.

The cost of waiting too long to shift left

Without shift left practices, problems show up late and cause expensive rework. Common examples include schema drift, floods of nulls in critical columns, corrupted files and broken dashboards or degraded ML models.

Example: A new column was added to a production table without proper enforcement. This broke the dashboards and caused engineering to spend multiple days backfilling and realigning joins. The cost was not just developer time. Executives lost trust in the reporting for months, and once you lose the trust of people using your data, it is extremely difficult to win it back.

Automated testing in CI/CD

Shift left begins in development. By integrating automated tests into CI/CD, you can block bad code and schema changes before they reach production.

Using custom SQL checks or some of the tools available on the market, validation suites can run every time a pull request is opened. If a schema change or transformation breaks a rule, the pipeline fails before the merge.

This brings the discipline software engineers have relied on for years into data engineering.

Example using two popular tools (dbt-tests and GitHub Actions):

name: dbt-tests

on: [pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Install dependencies
        run: pip install dbt-core dbt-bigquery
      - name: Run dbt tests
        run: dbt test --profiles-dir . --target ci

Schema contracts with Delta Lake

The next layer of shift left happens at data ingestion. One of the most effective ways to enforce data contracts early is Delta Lake. In previous roles, I saw how much value customers gained from schema enforcement and schema evolution. Instead of silently accepting broken data, Delta Lake fails the write if the incoming schema does not match the table definition.

With Delta Lake, you can also control schema evolution so intentional changes are allowed through a managed process. This is the foundation of what many now call “data contracts.”

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

expected_schema = StructType([
    StructField("user_id", StringType(), False),
    StructField("amount", IntegerType(), False)
])

df = spark.read.schema(expected_schema).json("s3://raw/sales/")

# Writing to Delta with schema enforcement
df.write.format("delta").mode("append").option("mergeSchema", 
"false").save("/mnt/delta/sales")

Example: A customer lets schema evolution grow. Within six months, the same table has 40 schema versions. Half the downstream jobs have incompatible assumptions and the engineering team has to stop feature development in order to invest in stabilizing the data model. Schema evolution absolutely has appropriate use cases, but in this case it was abused. Schema contracts would have prevented that chaos.

Lakeflow Declarative Pipelines

Databricks’ Lakeflow Declarative Pipelines (formerly Delta Live Tables) extend this even further. Instead of procedural jobs, you declare what your tables should look like and embed expectations directly into the pipeline.

You can configure what happens when a record fails: warn, drop or fail the pipeline. Because expectations are part of the pipeline itself, they become pipeline contracts.

CREATE OR REFRESH LIVE TABLE sales_clean
AS SELECT * FROM LIVE.sales_raw
EXPECT user_id IS NOT NULL
EXPECT amount >= 0 AND amount <= 10000;

*Note: syntax may vary slightly by version.

Lakeflow focuses on structural rules and contracts inside Databricks.

Quarantining suspect data

Catching bad data without a plan can lead to broken jobs or silent loss. A better pattern is to quarantine suspect records into a separate location for review. This protects downstream jobs while preserving raw inputs for investigation.

Here is a simple PySpark example that shows the pattern:

from pyspark.sql.functions import col

# Load raw data
df = spark.read.format("delta").load("/mnt/raw/sales")

# Rule: user_id must not be null
valid_records = df.filter(col("user_id").isNotNull())
invalid_records = df.filter(col("user_id").isNull())

# Write valid data to main table
valid_records.write.format("delta").mode("append").save("/mnt/delta/sales_clean")

# Write invalid data to quarantine table
invalid_records.write.format("delta").mode("append").save("/mnt/delta/sales_quarantine")

Simplified for illustration. In practice, quarantine tables should have alerting and triage workflows to avoid becoming junk drawers of bad data.

Example end-to-end workflow

A modern shift left workflow might look like this:

  1. Development: CI/CD runs validation tests on pull requests

  2. Ingestion: Data lands in Delta Lake with schema enforcement (data contracts)

  3. Business rules: Data integrity framework validates additional checks

  4. Transformation: Lakeflow pipelines enforce embedded expectations

  5. Exception handling: Quarantine tables isolate invalid records

  6. Monitoring: Anomaly detection tracks drift and null rates

This layered approach ensures data quality is enforced at every stage, from code commit to production monitoring.

Conclusion

My experience supporting utilities taught me that data quality is not optional. When outages threatened hospitals, there was no room for broken pipelines or silent data drift. If you do not shift left, you pay the price later in wasted developer hours, broken dashboards and business stakeholders who stop trusting the numbers. Once you lose the trust of people using your data, it is extremely difficult to win it back.

Shift left data quality is no longer a nice-to-have. It is table stakes for modern data platforms. Teams that ignore it will spend more time firefighting than building, while teams that embrace it build pipelines that are trusted, fast and resilient.

As AI becomes part of the modern data stack, it will not replace the fundamentals of shift left, but it will change how we enforce them. Large language models and anomaly detection systems are already helping teams auto-generate quality tests and identify drift and even suggest fixes when pipelines fail.


Pete Tamisin, Director of Solution Architecture at Capital One Software

Pete Tamisin is a data engineering and technical GTM leader with over 20 years of experience in delivering data products, optimizing cloud computing and technical consulting. As a Director of Solution Architecture at Capital One Software, he leads a field engineering team focused on helping customers maximize the value of Slingshot. Previously, he was one of the first Customer Success Engineers at Databricks, where he helped shape global success strategies. He has also held engineering roles at Motorola, Siemens and multiple startups, gaining deep technical expertise across industries. Passionate about driving real results, Pete partners with customers to bridge technology and business impact.