Free Snowflake health check app to help guide optimization
Every Snowflake user knows that the health and efficiency of their account impacts everything from query performance to operational costs and even the reliability of the data platform. To help teams address these challenges, Capital One Slingshot has built a free Snowflake Health Check application. This Snowflake native application surfaces insights in the form of 20 key reports covering warehouses, storage, queries and overall account health.
We built this free Snowflake health check based on our learnings from using and scaling on the platform to Petabytes of data and thousands of users running millions of queries a day. It queries ACCOUNT_USAGE and ORGANIZATION_USAGE views for waste, inefficiencies and surfaces opportunities for optimization across your account. Use it to identify your most expensive warehouses, detect potential under- or overprovisioned compute, uncover hidden storage costs and redundant tables, pinpoint inefficient query patterns and much more.
In this post, we share the insights available from the health check dashboard broken down into four categories: account overview, warehouses, storage and queries. Let’s dive in.
Account overview
The account overview provides a high-level view of spending trends and credit consumption, both crucial for budgeting and planning purposes. This page contains three high-level reports that provide users with a financial snapshot of their Snowflake account, with insights into:
-
How the org is spending on Snowflake: a breakdown of monthly costs by service type, e.g. compute, storage, serverless functions and data transfers. With 12 months of data, it’s easy to detect trends and pinpoint the services that are driving up costs.
-
Credit burn rate: projections on when you’ll run out of credits. This report projects the cumulative daily spending trend against active contracts to provide an estimation of when the account will run out of pre-paid credits.
-
Storage credit trends: a breakdown of credits consumed by live data, staged files, backups and retention (i.e. Active and Stage storage vs costs from Time Travel, Fail-safe and Retained for Clone storage). High Stage costs can indicate that ETL jobs aren’t properly cleaning up temporary files, while high Time Travel costs can identify tables where the DATA_RETENTION_TIME_IN_DAYS parameter might need to be reviewed and lowered.
Warehouse performance and optimization
The warehouses page contains six reports to help you rightsize and effectively manage your virtual warehouses. Warehouse compute charges are typically among the largest components of users’ Snowflake bill, if not the costliest. This page is designed to create a logical workflow, moving from identifying high-cost resources to surfacing potential root causes for those costs.
The process begins with a list of your 10 most expensive warehouses in the last 30 days. This provides a prioritized list of targets that consumed the most credits and require further analysis and optimization. By identifying “who” your engineers should look into, you can ensure they focus their efforts on high-impact targets.
Next, a series of reports help determine “why” a warehouse is expensive, with insights into:
-
Idle costs: credits consumed by warehouses that ran but did not execute queries. A high idle cost is a direct indicator of autosuspend settings that are set too high and lowering this value is a quick way to reduce waste.
-
Potentially overprovisioned warehouses: warehouses that show signs of underutilization. This report surfaces the warehouses with low concurrency and no queueing for the last two weeks. It provides the data needed to consider downsizing these resources, as they are not being used to their full potential.
-
Likely underprovisioned warehouses: warehouses that are struggling to keep up. This report identifies warehouses with consistent high query queuing over the last two weeks, marking them as candidates for upsizing to improve throughput.
-
Dormant warehouses: warehouses that have most likely been abandoned. These warehouses have consumed less than one credit in the last two weeks. We recommend you consider consolidating or achieving these resources for account hygiene and to prevent accidental usage.
-
Mixed workloads: warehouses that are running a diverse set of workloads. This report identifies a common source of bottlenecks and frustration, where long-running ETL jobs interfere with short BI queries. It provides data engineers with the data they need to make informed decisions on splitting workloads into dedicated warehouses to improve concurrency and the user experience.
Storage trends and hidden costs
The storage page focuses on the often-overlooked costs associated with data storage and retention. We start with a high-level analysis of storage trends. The first report tracks the growth of your storage footprint in gigabytes over the last 12 months, with a breakdown of the storage used each month for live data in tables vs staged files vs Fail-safe storage.
The rest of the page surfaces insights into potential waste and cost drivers, including:
-
Potential redundant tables: a two pronged approach to detecting tables that are incurring unnecessary storage costs. First, we look for structurally identical tables. These tables tend to be redundant and are often leftover dev/test objects in production. Then we identify large tables (+1GB) that haven’t been accessed in the last two weeks. These tables are prime candidates for archiving or deletion, as they represent cold data that consumes storage without providing value.
-
Your most and least used tables: vital context for data cloud management. This report identifies the most and least accessed tables in your account in the last 30 days, distinguishing between direct and base access (indirect access). A high base access count reveals critical, and sometimes hidden, dependencies for views or dashboards. This report provides analysts and engineers with the much needed context prior to altering or dropping a table and can help focus clustering efforts on tables that provide the most downstream benefit.
-
Hidden storage costs: find zombie storage and tables that retain data for too long. This report ranks the top 10 active and dropped tables by the amount of storage they consume for Time Travel and Fail-safe. This uncovers costs incurred for historical data from tables that no longer exist but are still being retained. For active tables, this report helps identify candidates for shorter Time Travel policies, while for dropped tables it offers a path to cost savings by permanently purging unneeded historical data.
Query tuning and insights
The query page contains six reports to help data engineers and analysts detect, diagnose and resolve query performance issues. The page is structured to create a logical debugging workflow, starting with a high-level overview that tracks total query count and average execution time over the last 30 days. A sudden spike in the average execution time is often the first indicator of a performance issue that requires further investigation.
The other reports provide more context into potential root causes, with insights into:
-
Longest running queries: a "hit list" of your slowest queries in the last two weeks. These queries are often the primary cause of slow dashboards and can drive up warehouse costs. This report provides a list of targets for data engineers to look into for performance troubleshooting and optimization.
-
Under-resourced queries: warehouses that are likely too small. This report identifies the top 10 queries that spilled the most data to disk in the last two weeks. Disk spill is a clear sign that the warehouse lacks sufficient memory, as it is forced to use slower disk storage to complete workloads.
-
Poor partition pruning: improving pruning is one of the effective ways to optimize queries. Poor pruning drives up costs because the warehouse has to read more data than it should. This report flags queries that scanned more than 50% of a table's micro-partitions for further investigation and optimization. If the table is large, make sure it has an appropriate clustering key and that queries filter on that key as this is often the cause for poor pruning.
-
Frequently executed parameterized queries: queries that should use query result cache to save time and credits. This report provides a side-by-side view of your most frequent queries in the past two weeks and your frequently executed parameterized queries (same structure, different literal values). While your most executed queries are obvious candidates for result caching, frequently executed parameterized queries can also benefit from Snowflake’s query result cache. Note that you will need to use bind variables instead of literal values for Snowflake to be able to reuse cached results.
-
Query peak and off hours: a heatmap of your busiest times to run queries. This report analyzes the number of queries that ran in the last 30 days by week day and hour of the day. The resulting heatmap can help engineers schedule large loads and slow ETL jobs during off-peak hours to avoid resource contention. It can also inform more complex warehouse schedules where the size of the warehouse changes throughout the day based on usage trends.
Capital One Slingshot can help you automatically rightsize your warehouse throughout the day, so you’re not paying for compute you don’t need.
Snowflake health check app insights table
Reported object | What it shows | Why it's important |
Monthly costs by service | A breakdown of your monthly Snowflake bill, by service (e.g. compute, storage, serverless, etc.). | You need to see the full picture to manage your budget. This report provides a holistic view of your Snowflake spending, allowing you to understand total incurred costs, forecast costs, identify trends and focus optimization efforts on high-impact targets. |
Credit burn rate | A projection of when you will fully exhaust your pre-paid credits (based on total value across active contracts). | This report provides a forward-looking view of your Snowflake commitment. It enables you to see the full picture across contracts, avoid surprises and leverage this data for strategic renewal planning. |
Storage credit trends | Your monthly storage credit consumption for active data vs storage used for backups and Fail-safe. | Storage costs can be a significant part of your bill. This report helps you pinpoint cost drivers. It answers questions like “Are credits being spent on live tables or staged files or data retention?” |
Most expensive warehouses | The top 10 warehouses by credits consumed over the last 30 days. | This report is fundamental for cost optimization, as it helps you see if a warehouse's cost is appropriate for its size. It surfaces insights into top spenders, inefficient warehouses (an XS warehouse near the top might be running inefficient queries constantly) and low-hanging fruit. |
Most idle warehouses | Top 10 most expensive warehouses for idle costs in the last 30 days. |
Idle cost is a direct measure of wasted spend. It represents costs incurred by the warehouse for the time between the completion of the last query and when the warehouse auto-suspended. This report identifies warehouses that might be able to shorten their autosuspend settings for engineers to review and take action. |
Underutilized or dormant warehouses | Your unused or underutilized warehouses in the last two weeks. | This report identifies inefficient warehouses by flagging two types of underutilization: underutilized warehouses running less than two queries at a time and dormant warehouses that spent less than one credit in the last two weeks, indicating they have most likely been abandoned. The output is a list of warehouses you can likely consolidate or archive. |
Warehouses with mixed workloads | Warehouses that handled queries with widely varying runtimes in the last 30 days. | A warehouse handling a wide range of jobs is a major red flag for performance issues. A single "slow" query can easily monopolize the warehouse, forcing dozens of "fast" and "medium" queries to wait in a long queue. This is often called "head-of-line blocking." This report provides the insights needed to pinpoint critical bottlenecks, justify workload separation and improve predictability. |
Underprovisioned warehouses | Warehouses that were struggling to keep up in the last two weeks. | Queued queries are a direct sign of a performance bottleneck and a poor user experience. This report flags warehouses where queries have consistently been forced to wait in a queue because the warehouse was too busy. It helps you pinpoint overloaded warehouses, identity warehouses that are candidates for upsizing and can help you tune scaling policies. |
Storage trends | Your monthly storage in GBs of primary data, staged files and retention features. | Monitoring your storage trend is key for managing and forecasting costs. This report tracks your data storage footprint over the last 12 months and provides insights into data growth, storage drivers and helps anticipate future storage costs. |
Redundant tables | Potentially duplicative or unaccessed tables, based on activity in the last two weeks. | This is a two-pronged analysis to identify tables that may be unnecessary and are consuming storage costs, surfacing: 1. Structurally identical tables: often created as backups or as part of development and may hold duplicate data. 2. Large, inactive tables: tables with >1GB of data that have not been accessed the last two weeks. These tables consume considerable storage costs and provide no value. |
Most and least accessed tables | The hottest and coolest tables in the last 30 days. | Understanding how your tables are used is critical for effective data management and governance. This report counts every read or write operation to determine which datasets are the most/least active and breaks down how they are accessed into direct vs base access. It provides insights into hidden dependencies (e.g. tables with high base access counts) and helps guide your optimization efforts. |
Hidden storage costs |
Zombie storage and tables with high retention costs. |
Zombie storage and data retention costs can add up to significant amounts. These reports provide insight into what you’re storing (live data, historical data, backups and copies) and identify deleted tables that continue to incur costs. This two-pronged approach provides the context required to guide optimization. |
Query performance trends | Query load and performance trends for the last 30 days. | A high-level of the overall health and usage patterns of your compute resources. This report tracks your total query count and average execution time (in secs) for those queries to complete, including time in queue. It excludes internal system queries to provide a clear view of user-driven activity and provides insights into the relationship between query volume and performance on a daily basis. |
Longest running queries | Your longest running queries in the last two weeks. | These queries are often the primary cause of user complaints. This report lists your top 10 queries with the longest “total elapsed time” (execution time + time spent in queue). It provides insights into inefficient query code and identifies targets for cost and performance optimization. |
Under-resourced queries | Queries that spilled the most data to disk in the last weeks. | Spilling is a primary cause of slow query performance and a clear sign of a resource bottleneck. This report lists the top 10 queries based on spilled data for further investigation and optimization. An easy fix is to upsize the warehouse and see if the extra memory prevents spilling and improves query performance. |
Poor partition pruning | The queries that scanned more than half of a table’s partitions in the last 2 weeks. | Improving pruning is one of the most effective ways to optimize query performance. This report flags the top 10 queries that scanned more than 50% of a table's partitions, indicating that pruning was not effective. To improve performance, review the query's WHERE clause and the table's structure. For large tables, ensure an appropriate clustering key is in place and that queries filter on that key. |
Queries that should have used query result cache | Frequently executed parameterized queries, based on the last two weeks. | This report surfaces the queries that should use bind variables by analyzing your most frequent queries along with your structurally similar queries to identify queries with the same structure that are run repeatedly with different literal values (e.g., WHERE id = 123 vs. WHERE id = 456). We recommend you use bind variables for these queries to allow Snowflake to reuse cached results to improve performance and lower costs. |
Query peak and off hours | Your query heatmap, based on the last 30 days. |
Understanding your account's peak usage times is crucial for efficient resource management. This report analyzes query activity to create a heatmap that aggregates the number of queries run every hour of each day in the week. It provides insights into capacity planning and schedule efficiency (i.e. when to run large data loads, ETL jobs, etc to avoid impacting performance). Slingshot can help you take these insights into automatic dynamic warehouse schedules. |
Conclusion
The Slingshot health check app for Snowflake is a Streamlit-based utility that queries ACCOUNT_USAGE and ORGANIZATION_USAGE to provide data engineers and DBAs with data-driven diagnostics for cost, compute, storage and query optimization. Its value lies in the learnings that it is based on. The health check was built to surface insights that we found crucial as we scaled on Snowflake to thousands of users running millions of queries daily.
The Snowflake health check app consolidates key metrics and analyses into a single dashboard. It was designed to enable systematic identification of inefficiencies and waste across a Snowflake account with a series of interconnected reports.
We hope this dashboard helps you eliminate waste and identify low hanging-fruit for optimization. If you need any help interpreting the results, or would like to learn how Slingshot can take action on those results on your behalf, book some time with the team here.