Building a GenAI cost supervisor agent in Databricks

How we turned Databricks System Tables into a knowledge base for an AI agent that answers any GenAI cost questions on demand.

The problem with dashboards

We built a GenAI cost dashboard. Six KPI counters, twelve charts, four detail tables. It tracked spend by service, user, model and use case. It measured governance gaps. It computed the cost per request. The first feedback we got was, “interesting, but hard to see the value when it’s so vague.”

The critique was right, but speaks to the fact that fundamentally a dashboard gives you one fixed view. What happens when you want answers to pointed questions that extend beyond the prebuilt capabilities of the dashboard, such as:

  • “How much did my team spend on GPT-5.2 last week?”
  • “Which endpoint should I tag first to fix our chargeback gap?” 
  • “What would we save if we moved simple queries from GPT-5.2 to Sonnet?” 
  • “Are errors getting worse? What’s that costing us?”  

No dashboard is going to be able to anticipate every question, but an AI agent can answer any of them, if it has the right knowledge layer to refer to. That’s when it hit us. All the SQL queries we wrote for our dashboard are actually much more than chart definitions—they’re analytical patterns. If we registered them as Unity Catalog functions and gave an agent access to them, then every user could have access to a personal GenAI cost analyst.

Setup: From zero to agent in three steps

There is no code required beyond the initial function registrations notebook, which you can find here. The entire agent runs in AI Playground, Databricks’ built-in interface for prototyping tool-calling agents.

Prerequisites

Requirement 

What it unlocks 

Unity Catalog enabled 

Function registration and System Tables access 

system.billing.usage + 
system.billing.list_prices

All 6 cost functions + tag_coverage,untagged_endpoints 

system.serving.endpoint_usagesystem.serving.served_entites 

All attributions, outcomes and remaining governance functions 

At least one model serving endpoint with AI Gateway 

Data in the serving tables 

CREATE FUNCTION privilege on target schema 

Registering the 20 functions 

 

Step 1: Register the knowledge layer

Import and run the accompanying notebook. It creates a schema, registers all 20 Unity Catalog SQL functions with CREATE OR REPLACE FUNCTION and runs tiered smoke tests to validate each function against your actual data.

You’ll notice there are more than 20 functions available for you to register in the notebook. Currently, the agent supervisor configuration is limited to 20 tools, so we’ve included a few extra functions operating an AI gateway enabled inference tables to help deduce latency and performance insights by endpoint and across your workspace.

Set the following three widgets before running the notebooks:

  • Catalog: your Unity Catalog name (e.g. main) 
  • Schema: where to register functions (e.g. genai_supervisor
  • payload_table: inference table path, if you have one (optional, only for Tier 3 functions not covered here) 

The notebook includes a validation step to ensure the functions were registered:

SELECT routine_name, comment FROM system.information_schema.routines
   WHERE routine_catalog = '{catalog}' AND routine_schema = '{schema}' AND routine_name LIKE 'genai_%' ORDER BY routine_name

You should see at least 20 rows. Each function’s comment is a description the agent reads to decide when to call it.

Step 2: Build and configure the agent

In your Databricks workspace, navigate to Agents (left sidebar) and select “Supervisor Agent” from the available options. Enter a name, description and begin configuring the agent with the 20 functions we just registered to Unity Catalog.

Note that there is an optional dropdown at the bottom of the configuration page which allows you to enter instructions for your model. 

Here is the system prompt we used for our supervisor:

Once all the pieces are in place, you can click on “Create Agent” which will take you to one more build pane where you can edit things before deploying your agent to the playground.

You are the GenAI Cost Supervisor, an AI analyst that helps Databricks users understand and optimize their GenAI spend, governance posture, and operational efficiency. 

You have access to 20 Unity Catalog functions in the genai_supervisor schema. Each function takes a lookback_days parameter and returns a table.

Reasoning Guidelines:

Start broad, then drill down, Cost questions begin with genai_total_spend, then narrow by product, team, user, or model based on follow-up. 
Explain missing data. If a function returns empty or sparse results, tell the user why and what to do about it. 

Compare periods. When asked “is X getting worse?”, call the same function with two lookback windows or use genai_spend_comparison. 

Present numbers in context. “$3,200 on GPT-5.2” means more as “62% of total genai spend” and “$0.0003 per request.” 
Chain functions. “What should I fix first?” requires dormant_endpoints, untagged_endpoints, model_cost_compare, and token_efficiency, ranked by dollar impact 

Show your work. When calling multiple functions, briefly explain why you chose each one 

Cost Metrics:
List Price = usage_quantity (DBUs) x list_prices.pricing.default 
Allocated Cost = endpoint_daily_cost x (user_tokens/total_endpoint_tokens) 
Cost per Request = allocated_cost/ total_requests 
Cost per Success = allocated_cost / successful_requests 
Cost of Errors = allocated_cost x (error_requests/total_requests)

Step 3: Start chatting with the agent in the playground

We have created our genai_cost_supervisor, now let's see what it can do.

The knowledge layer: 20 functions across 5 domains

Below is a table of functions that the agent has access to via the knowledge layer. Each function is registered in Unity Catalog with a COMMENT that tells the agent when to use it. Every function takes a lookback_days parameter and returns a table.

All 20 of these require only Tier 1-2 prerequisites: billing and serving info from System Tables (not AI-Gateway enabled inference tables). You can enable those and configure the agent since those additional functions based on inference tables are included in the notebook.

#

Function

What it answers

Cost (6)

1

genai_total_spend

Total GenAI spend with product breakdown. Entry point for every cost question.

2

genai_spend_by_product

Daily spend by product, SKU, usage type. The workhorse for trending.

3

genai_spend_by_team

Cost by custom_tags[‘team’] with (untagged) bucket. Chargeback-ready.

4

genai_spend_comparison

Two consecutive periods side-by-side. Any window size.

5

genai_spend_anomalies

Days where spend deviated >2σ from trailing 14-day average.

6

genai_spend_run_rate

Monthly/annual projection from trailing average. “Are we on budget?”

Governance (5)

7

genai_tag_coverage

Daily tagged vs untagged spend. The governance health score.

8

genai_untagged_endpoints

Top 25 untagged endpoints ranked by spend. The fix-it list.

9

genai_missing_context

Per-requester usage_contextcoverage. Names who to contact.

10

genai_endpoint_inventory

Full endpoint inventory: model, type, tag status, cost, users.

11

genai_dormant_endpoints

Billing with zero requests. Shutdown candidates.

Attribution (3)

12

genai_user_model_cost

Token-share allocated cost per user × model × day.

13

genai_model_rankings

Model rollup: tokens, requests, input/output split.

14

genai_user_summary

Per-user rollup: requests, tokens, models, activity dates.

Outcomes & efficiency (4)

15

genai_use_case_breakdown

Attribution byusage_context:use_case, project, app.

16

genai_cost_per_request

Unit economics: $/request, $/success, error rate, error cost.

17

genai_token_efficiency

Input-output ratio per model. Prompt optimization targets.

18

genai_error_analysis

Errors by status code, endpoint, day. Wasted tokens.

Model optimization (2)

19

genai_model_cost_compare

Side-by-side $/1K tokens, $/request, error rate across models.

20

genai_model_migration_savings

What-if: move X% traffic from model A to B, see savings.

How cost allocation works: The token-share method

One of the more interesting parts of the knowledge layer is in the attribution section since it bridges billing data (dollars) with usage tracking data (tokens). Here’s the challenge:

  • Billing (system.billing.usage) tells you the endpoint cost per day: total DBUs x list price. This is a great start, but it doesn't break down the cost by model or user, which is what you need in order to take action.
  • Usage tracking (system.serving.endpoint_usage) tells you who uses which model and how many tokens were consumed. But it doesn’t associate dollar amounts with users or models. 

The genai_user_model_cost bridges the gap with token-share allocation. Each request computes the share of each user/model combination out of the total tokens consumed per endpoint. It then multiples that by that day’s endpoint cost:

Allocated_cost = endpoint_daily_cost x (user_model_tokens / total_endpoint_tokens)

This gives you two derived metrics: 

  • Allocated spend: How much of the endpoint cost is attributable to a specific user/model pair.
  • Cost per 1k tokens (approx): The effective cost rate for each model, letting you answer questions like “is GPT-5.2 actually more expensive per token, or just used more?”

The SQL that powers this uses a three-CTE pattern:

  • endpoint_cost (from billing)
  • user_tokens (from usage tracking joined to served_entities for model metadata)
  • totals (as the denominator)

The final SELECT does the proportional allocation. This means it isn’t an exact value, a token-weighted approximation, but it’s far better than treating every endpoint as a black box. It also correctly reflects AI Gateway features, like fallbacks and traffic splitting. If a request falls back from GPT to Claude, then the usage tracking table records it under Claude’s served_entity_id, so the cost allocation follows the actual model that served the request.

The agent in action

Everything that follows is from the AI Playground. The agent has access to all 20 UC functions as tools and decides which to call, interprets the results and responds in natural language.

Below you can see a few examples of prompts and responses that showcase the agent’s capabilities on a mock workspace. The final two show mutli-function chaining, the capability that really separates an agent from a dashboard.

Prompt 1: “How much are we spending on GenAI?”

Agent response:

Based on the last 30 days of data, here’s your GenAI spending overview 

Total GenAI Spend: $0.20

Breakdown by Endpoint: 

1. databricks-meta-llama-3-3-70b-instruct- $0.12 (60%) 
    68 requests, 152k tokens 
    1 active user 
    Chat/completion workload 

2.databricks-claude-sonnet-4-6- $0.05 (25%) 
    Dormant endpoint - $0.05 billed with 0 requests 
    Shutdown candidate 

3. databricks-gpt-5-2- $0.03(15%)
    1 request, 1.9K tokens 
    1 active user 

4. databricks-gte-large-en $0.00 (0%) 
    35 requests, 4.7k tokens 
    2 active users 
    Embeddings workload 

Key Observations: 
Very low overall spend- Your organization is in early-stage GenAI adoption
Governance gap - None of your endpoints have team tags, making cost allocation difficult
Waste opportunity- The Claude Sonnet endpoint is costing $0.05/month
Limited adoption- Only 1-2 users per endpoint suggest pilot or experimental usage 

Would you like me to dive deeper into any specific area, such as cost trends over time, user-level breakdowns, or optimization opportunities?

This is the entry point. The agent returns total spend broken down by model, requests and users to help you see what's driving the bill. A dashboard would show the same number, but the difference is in what happens next: you can follow up with “break that down by team” without navigating anywhere else.

Notice how the Agent is already drawing key observations from the results, even alerting us to a dormant endpoint which could be a shut down candidate. Let's go to the next prompt.

Prompt 2: “Break that down by team”

Reads custom_tags[‘team’] from billing and rolls up cost per team with an (untagged) bucket. This is the chargeback view. If the untagged bucket dominates, like it does in our example below, then the agent doesn’t just show an empty chart. It explains the gap and offers to show you which endpoints to tag first.

Agent response:

I’ll query the spend breakdown by team to show you the chargeback allocation. 

Team Chargeback: Not Possible!

Current State: 
Total Spend: $9.23 (over 30 days) 
(untagged):$9.23 (100%)
	Model Serving: $0.20
	Other GenAI Services: $9.03

The Problem: None of your GenAI endpoints have been tagged with team identifiers (custom_tags.team) making cost allocation and chargeback impossible. 

What you need to do: 
Tag your endpoints- Add custom_tags.team to each serving endpoint configuration 
Establish a tagging policy - Require team tags before endpoints can be deployed
Fix existing endpoints- I can show you which specific endpoints need tagging 

Would you like me to: 
Show the top untagged endpoints ranked by cost so you know where to start?
Check your tagging coverage trend to see if it's getting better or worse?
Provide the endpoint inventory with owner information to help assign teams?

We can also see that it prompts the user with both clear next steps and other relevant metrics to help increase the tagging coverage. We can also ask things like “Which endpoints should I tag first to fix our chargeback gap?” to get a list of untagged endpoints ranked by spend, with the agent computing marginal impact.

Now let’s ask it something which triggers a multi-function chain of thinking.

Prompt 3: “I want to reduce our GenAI bill. Where should I start?”

Here the agent is able to identify and extrapolate using multiple tools in parallel to arrive at the answer. In this case, the agent returns a prioritized list of actions.

Agent response:

GenAI Cost Reduction Plan 

Based on my analysis, here’s where to focus your optimization efforts ranked by dollar impact. 

Priority 1: Stop GPU Training Waste - Save $9.03/month (98% reduction) 

The Big Issue: Your biggest cost isn’t GenAI inference at all- it’s GPU training time on interactive clusters

Feb 12: $7.02 (10.8 GPU hours)
Feb 13: $2.01 (3.1 GPU hours)
Total $9.03 = 98% of your “GenAI” bill

Actions: 
Identify who's running these GPU workloads and why
Shut down clusters when not in use (these appear to be left running)
Consider spot instances or smaller GPU types if training is needed 
Priority 2: Shut Down Dormant Endpoints - Save $0.08/month (1% reduction) 

Wasted Endpoints: 

Databricks-claude-sonnet-4-6 - $0.05/month, zero requests ever 
Databricks-gpt5-2 – $0.03/month, last used Feb 13 (14 days ago) 

Action: 
Delete or pause these endpoints immediately
Implement a policy: endpoints with <10 requests/week get flagged for review 
Priority 3: Optimize Model Selection - Potential 40-60% savings 

Current Usage: 

Llama 3.3 70B: 68 requests, 152K tokens $ → $0.12
2,243 tokens/request average ​​​
GPT-5-2: 1 request, 1.9k tokens → $0.0
Single high-cost request 

Action: 
For the Llama workload: Consider switching to Llama 3.1 8B for simpler tasks (3-5x cheaper)
Analyze if all 2243 tokens/request are necessary - could prompt engineering reduce output length?
Priority 4: Fix Governance - Enable Future Optimization 

Current State: 100% untagged, making it impossible to 

Identify wasteful teams/projects
Set budgets and alerts
Track ROI by use case 

Action: 
Tag all endpoints with custom_tags.team before next billing cycle
This unlocks chargeback and accountability 
Bottom line: Your real problem isn’t GenAI inference ($0.20/mo) – it’s GPU training clusters left running ($9.03/mo) Fix that first, then optimize your model choices

Would you like me to prove more details on any of these recommendations?

Taking it to production

The AI playground is a prototyping environment. Moving to production means deploying the agent as a governed, monitored endpoint that teams can rely on for recurring cost reviews and ad-hoc analysis.

From AI Playground to Databricks Apps

The AI Playground’s export dropdown includes a Databricks Apps option that generates a complete deployable project from your current session: the model, the UC functions you selected as tools and the system prompt.

When you click Export → Databricks Apps, you're asked to select an MLflow experiment. This is the observability layer for your production agent. Every interaction with the deployed agent—every tool call, every response—gets recorded as a trace in this experiment. It's where you go to debug bad answers, compare agent versions and run evaluations. Create one called something like agent-cost-supervisor and treat it as the agent’s flight recorder. This is very important when you go to deploy because this will act as your main debugging interface on the deployed version.

When you first deploy to Databricks Apps, the transition from Playground to deployed model in production is largely dependent on the sophistication of your agent. MCP servers are more common and arguably easier to implement since they require fewer changes to the agent code itself. However, for fully deterministic patterns in the deployed agent, registered UC functions are the obvious choice. 

The exported project follows the architecture that Databricks recommends for production agents (which you used to have to create yourself in the legacy version):

  • Agent framework: Your agent’s code lives in agent.py, using the OpenAI Agents SDK (or any framework). The key is wrapping your agent with MLflow’s ResponsesAgent interface which gives you compatibility with AI Playground, Agent Evaluation, tracing and monitoring. 
  • Tool connection: The agent connects to your UC functions via Databricks MCP servers, which handle tool discovery, execution and security. This is the production pattern: an evolution from the UCFunctionToolkit import you’d otherwise have had to use in a notebook—however for this experiment we are going to wrap our functions in DatabricksFunctionsClient.
  • Serving layer: MLflow Agent Server provides an async FastAPI server with an /invocations endpoint, built-in tracing, streaming responses and a production-ready chat UI with Databricks authentication.

Which path to choose:

Feature 

Databricks Apps (recommended)

Model Serving (legacy) 

Chat UI

Built-in production-ready with streaming and auth 

Review App (evaluation focused) 

Auth model 

OAuth with per-user or app-level service principal

Token-based

Deployment

One-click from Playground Export

agent.deploy() via notebook 

Customization

Sync source locally, edit, redeploy

Edit notebook, re-log, redeploy

Tool connection

MCP servers 

UCFunctionToolkit in agent code

Tracing

Automatic via AgentServer → MLflow experiment

Manual via mlflow.langchain.autolog()

Best for 

Production agent with end users 

API only or embedded in other services 

 

Access control and governance

Building on Unity Catalog means governance comes for free regardless of deployment path:

Function-level permissions: GRANT EXECUTE ON FUNCTION genai_supervisor.genai_total_spend TO finance_group. In this way we can restrict which functions/tools specific roles can use.

What’s more, Databricks has a new feature in public preview called Agent Framework: On Behalf of User Authorization. In the past, and even now for large amounts of use cases, we’ve been limited to m2m in Databricks Apps when it comes to authentication. When you create a new Databricks App it creates a service principal for that app and then you grant permissions to the service principal to allow it to execute on your Databricks workspace.

This is good in a pinch, but you lose the fine-grained access control that comes with user specific authorization, meaning that different users using the same app (service principal) effectively have the same permissions. That is now different. With the new feature, the permission levels are applied all the way down to the model, so you can apply fine-grained permissions on the model responses/tool execution itself.

Monitoring the agent

The agent monitors the GenAI cost, but what monitors the agent? This is where MLflow experiments become especially important. The experiment becomes our primary observability surface. Every agent interaction is recorded as a trace recording keeping track of which tools were called, what results got pulled back and how the agent composed its response. That way, when we go to diagnose the accuracy of an agent or why in particular an answer is wrong it can trace it back to:

  • The function itself (bad SQL for example)
  • The tool selection (the wrong function was called)
  • The synthesis of the response (both the function and tool selection were correct but its analysis of the results was wrong)

Bonus: meta-monitoring. Thanks to the analytical patterns exposed by our UC functions, the agent’s own token consumption shows up in system.serving.endpoint_usage. So the cost supervisor’s own cost is visible and usable by the supervisor agent itself!

Screen capture of an example of what the user's own token consumption data will look like.

The Databricks Apps template also includes evaluation code out of the box (agent_server/evaluate_agent.py). The template evaluates _agent.py using MLflow’s evaluation framework (mlflow.genai.evaluate). It is used to evaluate the relevance and safety of your agent’s responses. Under the hood, this is powered by built-in LLM judge scorers- not traditional ML metrics.

LLM Judges are a type of MLflow scorer that uses LLMs themselves for quality assessment. While code-based scorers use programmatic logic, judges leverage the reasoning capabilities of LLMs to make quality assessments for criteria like safety, relevance, correctness and groundedness.

The template specifically uses the MLflow 3 scorers API. The built-in judges available are:

  • Safety: Checks for harmful or unsafe content
  • RelevanceToQuery: Does the response actually address the user’s question
  • Correctness: Factual accuracy (requires ground truth)
  • RetrievalGroundedness: Is the response grounded in retrieved context (for RAG) 
  • RetrievalSufficiency: Did the retrieval fetch enough relevant context (requires ground truth)
  • Guidelines: Custom pass/fail criteria you define in natural language
  • ExpectationGuidelines: Per-example pass/fail criteria

The code for which looks like this:

from mlflow.genai.scorers import RelevanceToQuery, Safety

mlflow.genai.evaluate (
	data= eval_dataset,
	predict_fn= my_agent,
scorers = [RelevancetoQuery(), Safety()],
)

By default, each Judge uses a Databricks hosted LLM designed to perform GenAI quality assessments. So you're not paying to run these through your own endpoint- Databricks hosts the Judge models.

In particular, the GenAI cost supervisor agent we built is primarily used for calling tools against structured data so the retrieval-focused Judges (RetrievalGroundedness,RetrievalSufficiency) won’t make much of a difference here. The ones that mean the most to us are RelevanceToQuery, Safety and potentially Guidelines where you define the custom criteria like “the agent must always call a tool before stating cost figures” or “the agent must format currency values with dollar signs.”

If necessary, you can also write fully custom code-based scorers with the @scorer decorator for deterministic checks: things like varying the agent actually invoked the expected UC function, which you can inspect from the MLflow trace.

Safety enhancements using the UC function approach

One of the strongest arguments for the UC function approach over text-to-SQL agents is that it structurally eliminates SQL injection from the pattern. Every one of our functions has the SQL hardcoded at CREATE FUNCTION time. The agent can’t write SQL- it can only call genai_total_spend(lookback_days → 30) and pass type parameters. The parameter is an INT, not a string that gets concatenated into a query. There’s no vector for DROP TABLE billing.usage;—because the agent never has to construct the SQL itself. It's the difference between a parameterized stored procedure and a raw query builder.

Compare this paradigm to less deterministic text-to-SQL agents (like Genie or custom LLMs) where the model generates almost arbitrary SQL at runtime. Those have a real injection surface where a prompt injection could trick the model into generating a DELETE or UPDATE statement and end up querying tables outside the intended scope.

Conclusion

We started with a dashboard, but quickly realized the value of the same analytics patterns exposed by the SQL we wrote could be used to create an agent with a deterministic, maintained and safe knowledge layer.

The 20 UC functions, which you can register in your own Unity Catalog instance using the notebook, encode every cost formula, governance check and attribution method from the dashboards only as tools an agent can use to compose, chain and reason about. When someone asks, “What should I fix first?” the agent calls four functions and ranks the results by dollar impact. No dashboard can do that out of the box.

The UC function architecture also makes this pattern well suited for production use cases. In particular, when it comes to safety, we found that fixing the SQL at registration time and limiting the agents ability to edit the function to the parameters themselves, there is no surface through which SQL can be injected, no hallucinated queries and no risk of the model inventing a join that silently miscalculates cost allocation. The cost math is locked in the functions. The agent reasons about the results, not the queries.

On top of that, because everything runs on Unity Catalog, the governance is intrinsic to the functions themselves. Function-level permission controls who can ask which questions. Data access inheritance means users get analytical results without touching raw billing tables and every interaction is traced to its own MLflow experiment!

This, combined with the new “Deploy to Databricks Apps,” makes the path to production shorter than before: OAuth, streaming, chat UI and tracing all wired up for you without having to manually do the work yourself in a notebook.

As your AI workloads grow in sophistication, the tools you use to manage and budget them can’t stay stuck on static dashboards. These tools need to reason, chain and adapt alongside the systems they govern.


Sachin Seth, Technical Writer

Sachin Seth is a data platform architect and analytics product builder known for his deep work benchmarking Databricks compute and delivering high-performance data applications at scale. He develops full-stack analytics solutions—ranging from billion-point time-series engines to portfolio optimization apps and real-time financial dashboards—blending Databricks, Rust, Arrow and modern web technologies. He writes to bring clarity, measurement and engineering rigor to the rapidly evolving world of Databricks and modern data platforms.

Related Content

4 illustrative data charts on blue background with "TPC-DS Benchmark Analysis" headline.
Article | January 8, 2026 |10 min read
Illustration of rocket preparing to launch
Article | November 5, 2025 |9 min read