Building a GenAI cost supervisor agent in Databricks
How we turned Databricks System Tables into a knowledge base for an AI agent that answers any GenAI cost questions on demand.
The problem with dashboards
We built a GenAI cost dashboard. Six KPI counters, twelve charts, four detail tables. It tracked spend by service, user, model and use case. It measured governance gaps. It computed the cost per request. The first feedback we got was, “interesting, but hard to see the value when it’s so vague.”
The critique was right, but speaks to the fact that fundamentally a dashboard gives you one fixed view. What happens when you want answers to pointed questions that extend beyond the prebuilt capabilities of the dashboard, such as:
- “How much did my team spend on GPT-5.2 last week?”
- “Which endpoint should I tag first to fix our chargeback gap?”
- “What would we save if we moved simple queries from GPT-5.2 to Sonnet?”
- “Are errors getting worse? What’s that costing us?”
No dashboard is going to be able to anticipate every question, but an AI agent can answer any of them, if it has the right knowledge layer to refer to. That’s when it hit us. All the SQL queries we wrote for our dashboard are actually much more than chart definitions—they’re analytical patterns. If we registered them as Unity Catalog functions and gave an agent access to them, then every user could have access to a personal GenAI cost analyst.
Setup: From zero to agent in three steps
There is no code required beyond the initial function registrations notebook, which you can find here. The entire agent runs in AI Playground, Databricks’ built-in interface for prototyping tool-calling agents.
Prerequisites
|
Requirement |
What it unlocks |
|---|---|
|
Unity Catalog enabled |
Function registration and System Tables access |
|
|
All 6 cost functions + |
|
|
All attributions, outcomes and remaining governance functions |
|
At least one model serving endpoint with AI Gateway |
Data in the serving tables |
|
|
Registering the 20 functions |
Step 1: Register the knowledge layer
Import and run the accompanying notebook. It creates a schema, registers all 20 Unity Catalog SQL functions with CREATE OR REPLACE FUNCTION and runs tiered smoke tests to validate each function against your actual data.
You’ll notice there are more than 20 functions available for you to register in the notebook. Currently, the agent supervisor configuration is limited to 20 tools, so we’ve included a few extra functions operating an AI gateway enabled inference tables to help deduce latency and performance insights by endpoint and across your workspace.
Set the following three widgets before running the notebooks:
- Catalog: your Unity Catalog name (e.g. main)
- Schema: where to register functions (e.g.
genai_supervisor) payload_table: inference table path, if you have one (optional, only for Tier 3 functions not covered here)
The notebook includes a validation step to ensure the functions were registered:
SELECT routine_name, comment FROM system.information_schema.routines
WHERE routine_catalog = '{catalog}' AND routine_schema = '{schema}' AND routine_name LIKE 'genai_%' ORDER BY routine_nameYou should see at least 20 rows. Each function’s comment is a description the agent reads to decide when to call it.
Step 2: Build and configure the agent
In your Databricks workspace, navigate to Agents (left sidebar) and select “Supervisor Agent” from the available options. Enter a name, description and begin configuring the agent with the 20 functions we just registered to Unity Catalog.
Note that there is an optional dropdown at the bottom of the configuration page which allows you to enter instructions for your model.
Here is the system prompt we used for our supervisor:
Once all the pieces are in place, you can click on “Create Agent” which will take you to one more build pane where you can edit things before deploying your agent to the playground.
You are the GenAI Cost Supervisor, an AI analyst that helps Databricks users understand and optimize their GenAI spend, governance posture, and operational efficiency.
You have access to 20 Unity Catalog functions in the genai_supervisor schema. Each function takes a lookback_days parameter and returns a table.
Reasoning Guidelines:
Start broad, then drill down, Cost questions begin with genai_total_spend, then narrow by product, team, user, or model based on follow-up.
Explain missing data. If a function returns empty or sparse results, tell the user why and what to do about it.
Compare periods. When asked “is X getting worse?”, call the same function with two lookback windows or use genai_spend_comparison.
Present numbers in context. “$3,200 on GPT-5.2” means more as “62% of total genai spend” and “$0.0003 per request.”
Chain functions. “What should I fix first?” requires dormant_endpoints, untagged_endpoints, model_cost_compare, and token_efficiency, ranked by dollar impact
Show your work. When calling multiple functions, briefly explain why you chose each one
Cost Metrics:
List Price = usage_quantity (DBUs) x list_prices.pricing.default
Allocated Cost = endpoint_daily_cost x (user_tokens/total_endpoint_tokens)
Cost per Request = allocated_cost/ total_requests
Cost per Success = allocated_cost / successful_requests
Cost of Errors = allocated_cost x (error_requests/total_requests)Step 3: Start chatting with the agent in the playground
We have created our genai_cost_supervisor, now let's see what it can do.
The knowledge layer: 20 functions across 5 domains
Below is a table of functions that the agent has access to via the knowledge layer. Each function is registered in Unity Catalog with a COMMENT that tells the agent when to use it. Every function takes a lookback_days parameter and returns a table.
All 20 of these require only Tier 1-2 prerequisites: billing and serving info from System Tables (not AI-Gateway enabled inference tables). You can enable those and configure the agent since those additional functions based on inference tables are included in the notebook.
|
# |
Function |
What it answers |
|---|---|---|
|
Cost (6) |
||
|
1 |
|
Total GenAI spend with product breakdown. Entry point for every cost question. |
|
2 |
|
Daily spend by product, SKU, usage type. The workhorse for trending. |
|
3 |
|
Cost by |
|
4 |
|
Two consecutive periods side-by-side. Any window size. |
|
5 |
|
Days where spend deviated >2σ from trailing 14-day average. |
|
6 |
|
Monthly/annual projection from trailing average. “Are we on budget?” |
|
Governance (5) |
||
|
7 |
|
Daily tagged vs untagged spend. The governance health score. |
|
8 |
|
Top 25 untagged endpoints ranked by spend. The fix-it list. |
|
9 |
|
Per-requester |
|
10 |
|
Full endpoint inventory: model, type, tag status, cost, users. |
|
11 |
|
Billing with zero requests. Shutdown candidates. |
|
Attribution (3) |
||
|
12 |
|
Token-share allocated cost per user × model × day. |
|
13 |
|
Model rollup: tokens, requests, input/output split. |
|
14 |
|
Per-user rollup: requests, tokens, models, activity dates. |
|
Outcomes & efficiency (4) |
||
|
15 |
|
Attribution by |
|
16 |
|
Unit economics: $/request, $/success, error rate, error cost. |
|
17 |
|
Input-output ratio per model. Prompt optimization targets. |
|
18 |
|
Errors by status code, endpoint, day. Wasted tokens. |
|
Model optimization (2) |
||
|
19 |
|
Side-by-side $/1K tokens, $/request, error rate across models. |
|
20 |
|
What-if: move X% traffic from model A to B, see savings. |
How cost allocation works: The token-share method
One of the more interesting parts of the knowledge layer is in the attribution section since it bridges billing data (dollars) with usage tracking data (tokens). Here’s the challenge:
- Billing (
system.billing.usage) tells you the endpoint cost per day: total DBUs x list price. This is a great start, but it doesn't break down the cost by model or user, which is what you need in order to take action. - Usage tracking (
system.serving.endpoint_usage) tells you who uses which model and how many tokens were consumed. But it doesn’t associate dollar amounts with users or models.
The genai_user_model_cost bridges the gap with token-share allocation. Each request computes the share of each user/model combination out of the total tokens consumed per endpoint. It then multiples that by that day’s endpoint cost:
Allocated_cost = endpoint_daily_cost x (user_model_tokens / total_endpoint_tokens)
This gives you two derived metrics:
- Allocated spend: How much of the endpoint cost is attributable to a specific user/model pair.
- Cost per 1k tokens (approx): The effective cost rate for each model, letting you answer questions like “is GPT-5.2 actually more expensive per token, or just used more?”
The SQL that powers this uses a three-CTE pattern:
endpoint_cost (from billing)user_tokens (from usage tracking joined to served_entities for model metadata)totals (as the denominator)
The final SELECT does the proportional allocation. This means it isn’t an exact value, a token-weighted approximation, but it’s far better than treating every endpoint as a black box. It also correctly reflects AI Gateway features, like fallbacks and traffic splitting. If a request falls back from GPT to Claude, then the usage tracking table records it under Claude’s served_entity_id, so the cost allocation follows the actual model that served the request.
The agent in action
Everything that follows is from the AI Playground. The agent has access to all 20 UC functions as tools and decides which to call, interprets the results and responds in natural language.
Below you can see a few examples of prompts and responses that showcase the agent’s capabilities on a mock workspace. The final two show mutli-function chaining, the capability that really separates an agent from a dashboard.
Prompt 1: “How much are we spending on GenAI?”
Agent response:
Based on the last 30 days of data, here’s your GenAI spending overview
Total GenAI Spend: $0.20
Breakdown by Endpoint:
1. databricks-meta-llama-3-3-70b-instruct- $0.12 (60%)
68 requests, 152k tokens
1 active user
Chat/completion workload
2.databricks-claude-sonnet-4-6- $0.05 (25%)
Dormant endpoint - $0.05 billed with 0 requests
Shutdown candidate
3. databricks-gpt-5-2- $0.03(15%)
1 request, 1.9K tokens
1 active user
4. databricks-gte-large-en $0.00 (0%)
35 requests, 4.7k tokens
2 active users
Embeddings workload
Key Observations:
Very low overall spend- Your organization is in early-stage GenAI adoption
Governance gap - None of your endpoints have team tags, making cost allocation difficult
Waste opportunity- The Claude Sonnet endpoint is costing $0.05/month
Limited adoption- Only 1-2 users per endpoint suggest pilot or experimental usage
Would you like me to dive deeper into any specific area, such as cost trends over time, user-level breakdowns, or optimization opportunities?This is the entry point. The agent returns total spend broken down by model, requests and users to help you see what's driving the bill. A dashboard would show the same number, but the difference is in what happens next: you can follow up with “break that down by team” without navigating anywhere else.
Notice how the Agent is already drawing key observations from the results, even alerting us to a dormant endpoint which could be a shut down candidate. Let's go to the next prompt.
Prompt 2: “Break that down by team”
Reads custom_tags[‘team’] from billing and rolls up cost per team with an (untagged) bucket. This is the chargeback view. If the untagged bucket dominates, like it does in our example below, then the agent doesn’t just show an empty chart. It explains the gap and offers to show you which endpoints to tag first.
Agent response:
I’ll query the spend breakdown by team to show you the chargeback allocation.
Team Chargeback: Not Possible!
Current State:
Total Spend: $9.23 (over 30 days)
(untagged):$9.23 (100%)
Model Serving: $0.20
Other GenAI Services: $9.03
The Problem: None of your GenAI endpoints have been tagged with team identifiers (custom_tags.team) making cost allocation and chargeback impossible.
What you need to do:
Tag your endpoints- Add custom_tags.team to each serving endpoint configuration
Establish a tagging policy - Require team tags before endpoints can be deployed
Fix existing endpoints- I can show you which specific endpoints need tagging
Would you like me to:
Show the top untagged endpoints ranked by cost so you know where to start?
Check your tagging coverage trend to see if it's getting better or worse?
Provide the endpoint inventory with owner information to help assign teams?We can also see that it prompts the user with both clear next steps and other relevant metrics to help increase the tagging coverage. We can also ask things like “Which endpoints should I tag first to fix our chargeback gap?” to get a list of untagged endpoints ranked by spend, with the agent computing marginal impact.
Now let’s ask it something which triggers a multi-function chain of thinking.
Prompt 3: “I want to reduce our GenAI bill. Where should I start?”
Here the agent is able to identify and extrapolate using multiple tools in parallel to arrive at the answer. In this case, the agent returns a prioritized list of actions.
Agent response:
GenAI Cost Reduction Plan
Based on my analysis, here’s where to focus your optimization efforts ranked by dollar impact.
Priority 1: Stop GPU Training Waste - Save $9.03/month (98% reduction)
The Big Issue: Your biggest cost isn’t GenAI inference at all- it’s GPU training time on interactive clusters
Feb 12: $7.02 (10.8 GPU hours)
Feb 13: $2.01 (3.1 GPU hours)
Total $9.03 = 98% of your “GenAI” bill
Actions:
Identify who's running these GPU workloads and why
Shut down clusters when not in use (these appear to be left running)
Consider spot instances or smaller GPU types if training is needed
Priority 2: Shut Down Dormant Endpoints - Save $0.08/month (1% reduction)
Wasted Endpoints:
Databricks-claude-sonnet-4-6 - $0.05/month, zero requests ever
Databricks-gpt5-2 – $0.03/month, last used Feb 13 (14 days ago)
Action:
Delete or pause these endpoints immediately
Implement a policy: endpoints with <10 requests/week get flagged for review
Priority 3: Optimize Model Selection - Potential 40-60% savings
Current Usage:
Llama 3.3 70B: 68 requests, 152K tokens $ → $0.12
2,243 tokens/request average
GPT-5-2: 1 request, 1.9k tokens → $0.0
Single high-cost request
Action:
For the Llama workload: Consider switching to Llama 3.1 8B for simpler tasks (3-5x cheaper)
Analyze if all 2243 tokens/request are necessary - could prompt engineering reduce output length?
Priority 4: Fix Governance - Enable Future Optimization
Current State: 100% untagged, making it impossible to
Identify wasteful teams/projects
Set budgets and alerts
Track ROI by use case
Action:
Tag all endpoints with custom_tags.team before next billing cycle
This unlocks chargeback and accountability
Bottom line: Your real problem isn’t GenAI inference ($0.20/mo) – it’s GPU training clusters left running ($9.03/mo) Fix that first, then optimize your model choices
Would you like me to prove more details on any of these recommendations?Taking it to production
The AI playground is a prototyping environment. Moving to production means deploying the agent as a governed, monitored endpoint that teams can rely on for recurring cost reviews and ad-hoc analysis.
From AI Playground to Databricks Apps
The AI Playground’s export dropdown includes a Databricks Apps option that generates a complete deployable project from your current session: the model, the UC functions you selected as tools and the system prompt.
When you click Export → Databricks Apps, you're asked to select an MLflow experiment. This is the observability layer for your production agent. Every interaction with the deployed agent—every tool call, every response—gets recorded as a trace in this experiment. It's where you go to debug bad answers, compare agent versions and run evaluations. Create one called something like agent-cost-supervisor and treat it as the agent’s flight recorder. This is very important when you go to deploy because this will act as your main debugging interface on the deployed version.
When you first deploy to Databricks Apps, the transition from Playground to deployed model in production is largely dependent on the sophistication of your agent. MCP servers are more common and arguably easier to implement since they require fewer changes to the agent code itself. However, for fully deterministic patterns in the deployed agent, registered UC functions are the obvious choice.
The exported project follows the architecture that Databricks recommends for production agents (which you used to have to create yourself in the legacy version):
- Agent framework: Your agent’s code lives in
agent.py, using the OpenAI Agents SDK (or any framework). The key is wrapping your agent with MLflow’s ResponsesAgent interface which gives you compatibility with AI Playground, Agent Evaluation, tracing and monitoring. - Tool connection: The agent connects to your UC functions via Databricks MCP servers, which handle tool discovery, execution and security. This is the production pattern: an evolution from the
UCFunctionToolkitimport you’d otherwise have had to use in a notebook—however for this experiment we are going to wrap our functions inDatabricksFunctionsClient. - Serving layer: MLflow Agent Server provides an async FastAPI server with an
/invocationsendpoint, built-in tracing, streaming responses and a production-ready chat UI with Databricks authentication.
Which path to choose:
|
Feature |
Databricks Apps (recommended) |
Model Serving (legacy) |
|---|---|---|
|
Chat UI |
Built-in production-ready with streaming and auth |
Review App (evaluation focused) |
|
Auth model |
OAuth with per-user or app-level service principal |
Token-based |
|
Deployment |
One-click from Playground Export |
|
|
Customization |
Sync source locally, edit, redeploy |
Edit notebook, re-log, redeploy |
|
Tool connection |
MCP servers |
|
|
Tracing |
Automatic via AgentServer → MLflow experiment |
Manual via |
|
Best for |
Production agent with end users |
API only or embedded in other services |
Access control and governance
Building on Unity Catalog means governance comes for free regardless of deployment path:
Function-level permissions: GRANT EXECUTE ON FUNCTION genai_supervisor.genai_total_spend TO finance_group. In this way we can restrict which functions/tools specific roles can use.
What’s more, Databricks has a new feature in public preview called Agent Framework: On Behalf of User Authorization. In the past, and even now for large amounts of use cases, we’ve been limited to m2m in Databricks Apps when it comes to authentication. When you create a new Databricks App it creates a service principal for that app and then you grant permissions to the service principal to allow it to execute on your Databricks workspace.
This is good in a pinch, but you lose the fine-grained access control that comes with user specific authorization, meaning that different users using the same app (service principal) effectively have the same permissions. That is now different. With the new feature, the permission levels are applied all the way down to the model, so you can apply fine-grained permissions on the model responses/tool execution itself.
Monitoring the agent
The agent monitors the GenAI cost, but what monitors the agent? This is where MLflow experiments become especially important. The experiment becomes our primary observability surface. Every agent interaction is recorded as a trace recording keeping track of which tools were called, what results got pulled back and how the agent composed its response. That way, when we go to diagnose the accuracy of an agent or why in particular an answer is wrong it can trace it back to:
- The function itself (bad SQL for example)
- The tool selection (the wrong function was called)
- The synthesis of the response (both the function and tool selection were correct but its analysis of the results was wrong)
Bonus: meta-monitoring. Thanks to the analytical patterns exposed by our UC functions, the agent’s own token consumption shows up in system.serving.endpoint_usage. So the cost supervisor’s own cost is visible and usable by the supervisor agent itself!
The Databricks Apps template also includes evaluation code out of the box (agent_server/evaluate_agent.py). The template evaluates _agent.py using MLflow’s evaluation framework (mlflow.genai.evaluate). It is used to evaluate the relevance and safety of your agent’s responses. Under the hood, this is powered by built-in LLM judge scorers- not traditional ML metrics.
LLM Judges are a type of MLflow scorer that uses LLMs themselves for quality assessment. While code-based scorers use programmatic logic, judges leverage the reasoning capabilities of LLMs to make quality assessments for criteria like safety, relevance, correctness and groundedness.
The template specifically uses the MLflow 3 scorers API. The built-in judges available are:
- Safety: Checks for harmful or unsafe content
RelevanceToQuery: Does the response actually address the user’s question- Correctness: Factual accuracy (requires ground truth)
RetrievalGroundedness: Is the response grounded in retrieved context (for RAG)RetrievalSufficiency: Did the retrieval fetch enough relevant context (requires ground truth)- Guidelines: Custom pass/fail criteria you define in natural language
ExpectationGuidelines: Per-example pass/fail criteria
The code for which looks like this:
from mlflow.genai.scorers import RelevanceToQuery, Safety
mlflow.genai.evaluate (
data= eval_dataset,
predict_fn= my_agent,
scorers = [RelevancetoQuery(), Safety()],
)By default, each Judge uses a Databricks hosted LLM designed to perform GenAI quality assessments. So you're not paying to run these through your own endpoint- Databricks hosts the Judge models.
In particular, the GenAI cost supervisor agent we built is primarily used for calling tools against structured data so the retrieval-focused Judges (RetrievalGroundedness,RetrievalSufficiency) won’t make much of a difference here. The ones that mean the most to us are RelevanceToQuery, Safety and potentially Guidelines where you define the custom criteria like “the agent must always call a tool before stating cost figures” or “the agent must format currency values with dollar signs.”
If necessary, you can also write fully custom code-based scorers with the @scorer decorator for deterministic checks: things like varying the agent actually invoked the expected UC function, which you can inspect from the MLflow trace.
Safety enhancements using the UC function approach
One of the strongest arguments for the UC function approach over text-to-SQL agents is that it structurally eliminates SQL injection from the pattern. Every one of our functions has the SQL hardcoded at CREATE FUNCTION time. The agent can’t write SQL- it can only call genai_total_spend(lookback_days → 30) and pass type parameters. The parameter is an INT, not a string that gets concatenated into a query. There’s no vector for DROP TABLE billing.usage;—because the agent never has to construct the SQL itself. It's the difference between a parameterized stored procedure and a raw query builder.
Compare this paradigm to less deterministic text-to-SQL agents (like Genie or custom LLMs) where the model generates almost arbitrary SQL at runtime. Those have a real injection surface where a prompt injection could trick the model into generating a DELETE or UPDATE statement and end up querying tables outside the intended scope.
Conclusion
We started with a dashboard, but quickly realized the value of the same analytics patterns exposed by the SQL we wrote could be used to create an agent with a deterministic, maintained and safe knowledge layer.
The 20 UC functions, which you can register in your own Unity Catalog instance using the notebook, encode every cost formula, governance check and attribution method from the dashboards only as tools an agent can use to compose, chain and reason about. When someone asks, “What should I fix first?” the agent calls four functions and ranks the results by dollar impact. No dashboard can do that out of the box.
The UC function architecture also makes this pattern well suited for production use cases. In particular, when it comes to safety, we found that fixing the SQL at registration time and limiting the agents ability to edit the function to the parameters themselves, there is no surface through which SQL can be injected, no hallucinated queries and no risk of the model inventing a join that silently miscalculates cost allocation. The cost math is locked in the functions. The agent reasons about the results, not the queries.
On top of that, because everything runs on Unity Catalog, the governance is intrinsic to the functions themselves. Function-level permission controls who can ask which questions. Data access inheritance means users get analytical results without touching raw billing tables and every interaction is traced to its own MLflow experiment!
This, combined with the new “Deploy to Databricks Apps,” makes the path to production shorter than before: OAuth, streaming, chat UI and tracing all wired up for you without having to manually do the work yourself in a notebook.
As your AI workloads grow in sophistication, the tools you use to manage and budget them can’t stay stuck on static dashboards. These tools need to reason, chain and adapt alongside the systems they govern.

