Leveraging Databricks Asset Bundles
How we used Databricks Asset Bundles with Slingshot to streamline pipelines.
Capital One Slingshot is a data cloud management solution built and used by Capital One. We have recently expanded Slingshot’s core offering from Snowflake to Databricks optimization, releasing numerous new cost optimization features for Databricks.
As we expanded Slingshot’s offering to Databricks, we quickly encountered challenges managing the complexity of the platform for consistent, repeatable and maintainable pipelines. This is especially true as we have multiple engineering teams in the organization using Slingshot to deploy and manage pipelines on Databricks.
Our engineering teams also have certain requirements that influenced the solution we chose to implement. We needed:
-
Repeatable pipelines across multiple environments (dev, QA, prod)
-
The ability to integrate with enterprise CI/CD tools with little-to-low effort (i.e., Jenkins)
-
Support for a growing number of jobs and scalability
-
The ability to easily allow different teams to have independent jobs, while also sharing code as needed
-
To be able to support native features like Serverless Jobs, DLT, dashboards, etc.
After some research we landed on two options: the first was building our own Python–based deployment system and the second was using a Databricks native offering called Asset Bundles.
Databricks Asset Bundles
Databricks Asset Bundles is a CI/CD tool that allows you to deploy various assets (Jobs, DLT tables, etc.) together to Databricks workspaces. The tool has a concept of a “bundle,” which is essentially a collection of assets that can be repeatedly deployed across workspaces. A bundle will contain the following:
-
Config files: Configuration of a job, workspace, etc.
-
Source files: Notebooks, Python wheel files, etc.
Databricks Asset Bundles meets our requirements with seamless CI/CD integration capabilities, version control for deployments, consistent portability across environments and simplified management of processes and pipelines across Databricks environments.
However, we discovered a significant flaw. A crucial aspect of our workflows hinges on effective catalog management, which fundamentally relies on a precise structure of schemas and tables. For catalog management to truly function, the underlying structure of schemas and tables must be consistently established. This becomes especially problematic when deploying into new or isolated environments where the required catalog structure simply didn't exist.
To solve this, we integrated a dedicated setup task into our primary job. This task automatically creates or updates the catalog and the required tables at the beginning of every run. This addition has dramatically enhanced the reliability of our deployments and effectively eliminated the need for tedious manual pre-configuration.
The setup
Given the benefits we identified with Asset Bundles and the additional task for catalog management, we decided to move forward with implementation.
Here’s a basic layout of the architecture implemented for Asset Bundles:
As you can see, we follow a standard flow to deploy with Asset Bundles. Developers create a PR, which is deployed via Jenkins. Jenkins then pulls the Asset Bundles Image (a Python package) and deploys the files. We inject secrets, define the environment that it is being deployed to, etc. in the CI/CD. This automatically happens on merge to all environments except those in production, which requires prior approval from our change management system.
For a deeper look, let’s take a code example of what we do to deploy Asset Bundles. One unique thing is we set up multiple assets to be deployed. For example, ‘Team A’ might have an asset, and then ‘Team B’ might have another asset. Code is set up in a way that it can be shared if necessary between these two assets. In addition, as we now have many asset files, we also changed how we structure files so they are easier to manage.
File structure
src/
team_a_bundle/
databricks.yml
team_b_bundle/
bundle_configs/
job_1_config.yaml
job_2_config.yaml
databricks.yml
tests/
JenkinsfileAs you can see, we have maintained a fairly clean file structure. The file structure is as follows:
-
Each “Bundle” has its own folder that contains all the code for a specific bundle (a shared folder in a bundle is also an option, but can be more complex).
-
Within each bundle we have a databricks.yml file. This is the primary code for a Databricks bundle and contains common variables across jobs and other assets that can be leveraged. This file can either contain the definition of an asset, or reference the asset files from another location (in the example above it’s under the bundle_configs).
Let’s take this a step further and look at an example yaml file:
# yaml-language-server: $schema=bundle_config_schema.json
bundle:
name: sample_dbx_bundle
variables:
s3_bucket:
description: The S3 bucket for this sample
default: ""
workspace:
host: https://sample.cloud.databricks.com
resources:
jobs:
sample_job:
name: "[${bundle.environment}] Sample DBx Job"
tasks:
- task_key: serverless_task
depends_on:
- task_key: dlt_medium_pipeline
notebook_task:
notebook_path: ./fe_medium_report.py
- task_key: standard_compute_task
depends_on:
- task_key: serverless_task
notebook_task:
notebook_path: ./fe_medium_report.py
new_cluster:
spark_version: 13.1.x-scala2.12
num_workers: 1
node_type_id: i3.xlarge
environments:
development:
default: true
dev:
workspace:
host: https://dev-sample-workspace.cloud.databricks.com/
variables:
s3_bucket: "s3://sdev-bucket"
qa:
workspace:
host: https://qa-sample-workspace.cloud.databricks.com/
variables:
s3_bucket: "s3://sdev-bucket"
prod:
workspace:
host: https://prod-sample-workspace.cloud.databricks.com/
variables:
s3_bucket: "s3://sdev-bucket"Bundle definition
The above databricks.yml file defines a Databricks Asset Bundle configuration. Let's break down its structure and components.
`bundle`: The top-level key, indicating the start of the bundle definition.
`name`: Specifies the name of the Databricks bundle, which is `sample_dbx_bundle`.Variables
`variables`: Defines variables that can be used within the bundle configuration.
`variable_name,`: The name of the variable to define
`description`: Describes what the variable represents:
`default`: Sets the default value for the variable to an empty string (`""`).Workspace
`workspace`: Configures the Databricks workspace.
`host`: Sets the default host URL for the Databricks workspace: `https://sample.cloud.databricks.com`.Resources
`resources`: Defines the resources that will be deployed as part of the bundle.
`jobs`: Specifies the Databricks jobs to be deployed.
`sample_job`: Defines a job named `sample_job`.
`name`: Sets the name of the job, including the environment:
`"[${bundle.environment}] Sample DBx Job"`.
`tasks`: Defines the tasks within the job.
`task_key`: Defines a task named `serverless_task`.
`depends_on`: Specifies that this task depends on `dlt_medium_pipeline`.
`notebook_task`: Defines a notebook task.
`notebook_path`: Sets the path to the notebook: `./fe_medium_report.py`.
`task_key`: Defines a task named `standard_compute_task`.
`depends_on`: Specifies that this task depends on `serverless_task`.
`notebook_task`: Defines a notebook task.
`notebook_path`: Sets the path to the notebook
`new_cluster`: Defines a new cluster for the task.
`spark_version`: Sets the Spark version: `13.1.x-scala2.12`.
`num_workers`: Sets the number of workers: `1`.
`node_type_id`: Sets the node type ID: `i3.xlarge`.Environments
`environments`: Defines different deployment environments.
`env`: Defines the production environment.
`workspace`: Overrides the default workspace settings.
`host`: Sets the host URL for Workspace
`variables`: Overrides the default variables.
`variable_name`: Sets the variable value.This databricks.yml file configures a Databricks Asset Bundle named sample_dbx_bundle with a job, variables and specific settings for different environments (dev, QA and prod). It defines how the job will be executed, including tasks, dependencies and cluster configuration, while also allowing for environment-specific overrides for workspace details and variables like the S3 bucket.
Conclusion
Databricks Asset Bundles was a great choice to make our deployment pipeline simpler and more scalable across development teams. Its structured approach enables our engineering teams to easily standardize operations, reduce manual errors and achieve consistent deployments across diverse environments.
If your organization is looking to optimize your Databricks deployment strategy, we recommend Asset Bundles as a powerful enabler for robust and repeatable CI/CD.

