Snowflake warehouse TPC-DS benchmark analysis
Gen1 vs. Gen2 vs. Snowpark-optimized warehouses
TL;DR We ran the full TPC-DS benchmark at 1TB scale on three Snowflake warehouse types and across multiple sizes (small through XL). Comparing credit consumption and performance of Gen1 vs. Gen2 vs. Snowpark-optimized warehouses, we found significant performance differences driven by memory architecture. Read on for clear guidance on when each warehouse type provides optimal value.
Introduction: How much of an improvement is Gen2?
When Snowflake announced Gen2 warehouses, the marketing was compelling: Gen2 warehouses are the next evolution in warehouse compute architecture. Powered by ARM-based processors and DDR5 memory, it promises improved performance and concurrency without requiring any code changes. The implicit message was clear: you should upgrade to Gen2.
But as someone who spends considerable time optimizing cloud data costs, I've learned to be skeptical. The questions that kept me up at night were:
- Is Gen2 always better, or just in specific scenarios?
- What exactly changed architecturally, and does it matter for my workloads?
- Is the upgrade worth the operational overhead, or is this a solution looking for a problem?
So I did what any self-respecting data engineer would do: ran the benchmarks.
We ran the full TPC-DS benchmark at a 1TB scale to simulate real-life production workloads and assess where Gen2 and Snowpark-optimized warehouses stand in regards to costs and performance compared to Gen1 warehouses. This blog presents our findings from a comprehensive analysis of more than 1,200 TPC-DS queries across multiple warehouse sizes (small through XL), exploring how memory architecture impacts data spilling.
Methodology
The benchmark was conducted using Snowflake's official TPC-DS query set against a 1TB dataset sampled from SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL. We tested Gen1 vs. Gen2 vs. Snowpark-optimized warehouses across four sizes (small, medium, large and XL). Each warehouse configuration executed all 103 TPC-DS queries (1 iteration per query) with cold cache to eliminate any unfair advantages.
Methodology Note: We deliberately ran single iterations with cold cache. This represents a "worst case" scenario that exposes architectural differences. Production workloads with warm cache would show smaller gaps. Your mileage will vary, and that's precisely the point.
The contenders:
|
Type |
Architecture |
Snowflake description *Source: Snowflake Documentation |
|---|---|---|
|
Gen1 Standard |
|
The original virtual warehouse |
|
Gen2 Standard |
|
2.1x faster, 4.4x DML improvement |
|
Snowpark-Optimized |
|
For memory-intensive ML workloads |
Test configuration
|
Parameter |
Value |
|---|---|
|
Benchmark |
TPC-DS |
|
Queries |
103 TPC-DS queries |
|
Scale Factor |
SF1000 (1TB) |
|
Iterations |
1 per query per warehouse |
|
Cache State |
Cold start (0% cache hit observed) |
The results: Not what we expected
Let me show you the headline numbers, then explain why they're misleading if you stop there:
|
Warehouse |
Type |
Size |
Total runtime |
Avg query |
|---|---|---|---|---|
|
BENCH_SP_S |
Snowpark |
Small |
0.55 min |
0.32 sec |
|
BENCH_GEN1_L |
Gen1 |
Large |
0.62 min |
0.36 sec |
|
BENCH_GEN2_S |
Gen2 |
Small |
0.64 min |
0.37 sec |
|
BENCH_GEN1_M |
Gen1 |
Medium |
0.71 min |
0.41 sec |
|
BENCH_GEN2_L |
Gen2 |
Large |
0.85 min |
0.50 sec |
|
BENCH_GEN1_S |
Gen1 |
Small |
38.24 min |
22.28 sec |
At first glance, the anomaly really stands out. Small Gen1 warehouses took 38 minutes while a small Gen2 took 0.64 minutes. That's a 60x difference. Either we made a mistake, or something very interesting is happening here.
The mystery: Why did Gen1 small fail?
Our first instinct was to blame the cache. Maybe Gen1 ran first with cold cache while Gen2 benefited from warmed storage. However, the cache hit rates told a different story.
|
Warehouse |
Cache Hit % |
TB scanned |
|---|---|---|
|
BENCH_GEN1_S |
0.6% |
2.05 TB |
|
All others |
0% |
~0 |
Every query was running cold and ALTER SESSION SET USED_CACHED_RESULT = FALSE. The difference wasn't cache, it was something architectural.
Everything clicked when we checked spillage metrics:
|
Warehouse |
Queries spilled |
% spilled |
GB spilled |
|---|---|---|---|
|
BENCH_GEN1_S |
12 |
11.4% |
153.91 GB |
|
All others |
0 |
0% |
0 |
The aha moment: Gen1 small was the only warehouse that spilled to disk. When a query's intermediate results exceed available memory, Snowflake writes to local SSD (spilling).
But here's what puzzled us: a small Gen1 and a small Gen2 are both "Standard" small sized warehouses. They should have the same memory allocation, right? So why did Gen2 small handle the workload gracefully while Gen1 small suffered significant performance issues?
The architecture revelation
Doing a deeper dive into the architecture we find that Gen2 isn't just "newer,” it's fundamentally different hardware:
| Component | Gen1 | Gen2 | Impact |
|---|---|---|---|
| CPU | Intel x86 | ARM Graviton3 | Better IPC (instructions per cycle) |
| Memory | DDR4 | DDR5 | Higher bandwidth, lower latency |
| L2 Cache | Standard | Larger | Fewer cache misses → less memory pressure |
Gen2's advantage isn't just being "faster," but being more memory-efficient. The DDR5 and larger L2 caches mean Gen2 can handle the same workload with less memory pressure. Gen2 small doesn’t have more memory than Gen1 small, but Gen2 uses memory more efficiently.
This explains the 60x performance gap (38 mins vs. 0.64 mins). Gen1 small hit memory limits and spilled 154GB to SSD. Gen2 small, with identical nominal specs but superior memory architecture, processed the same queries entirely in-memory.
The software angle
The hardware itself, however, only tells part of the story. Gen2 also includes targeted software optimization that reduces memory pressure right before it starts. To better understand the software improvements to Gen2, let's get a few things straight.
Gen2 does not change:
- Storage format/table layout fundamentals- still Snowflake micro-partitions, columnar, immutable storage units
- SQL semantics
- Data model
So the fact that spilling was so dramatically reduced in Gen2 warehouses is because the same query plan now produces smaller distributed intermediates. The data shows Gen2 warehouses consistently scans fewer partitions than Gen1 warehouses for identical queries. Less data scanned means fewer intermediate results to hold in memory.
Why does memory spilling occur in the first place?
The docs are very explicit: performance degrades when a warehouse runs out of memory and must “spill” intermediate data. First the data spills to local disk storage, then to remote cloud-provider storage if more space is needed. Performance degrades with each step.
Spills come from stateful operators like:
- Hash joins (the built side hash table gets too big)
- Group-by aggregations
- Sorts/window functions
- Big shuffles with skew
Memory spilling in Snowflake
Snowflake’s columnar and immutable micro-partition design is wonderful for analytics but unavoidably micro-partitions are not updated in place. This means that changes require creating new partitions.
We’ve described the practical impact of this in a previous post. In a nutshell, UPDATE, DELETE and MERGE (DML operation) can require significant column segments/partitions rather than modifying individual rows like a row-store would. It’s the natural price we pay for prioritizing columnar analytics optimized storage formats in OLAP systems vs. the high throughput and single row update OLTP systems.
Let’s take a look at a simplified DML pipeline:
Phase 1: Identify affected data
- Determine which micro-partitions might contain rows that match predicates/join keys
- Read only the required columns
- Apply pruning using metadata and possible indexes
Phase 2: Compute row-level changes
- For MERGE, this is often a joint between source and target keys, producing:
- Matched rows to update
- Unmatched rows to insert
- Rows to delete
Phase 3: Write new micro-partitions
- Rewrite the affected portions into new micro-partitions
- Commit by updating metadata pointers (conceptually “copy-on-write”)
Phase 2 (joins/grouping/sorting) and phase 3 (repartitioning data for writes) can create large intermediates and heavy shuffles-classic spill triggers. This is where spilling occurs in the pipeline.
How Snowflake reduces spilling with Gen2 improvements
Snowflake Gen2 documentation calls out both table scan operation enhancements and DELETE, UPDATE, MERGE improvements.
Those are the two stages with the most influence on whether downstream operators spill. But the more mechanistic explanation comes from the 2025 performance improvement notes and Optima docs. Snowflake Optima is only available on Gen2 warehouses.
Optima indexing:
- Continuously analyzes workload patterns
- Automatically creates “hidden indexes” in the background
- Specifically calls out “repetitive point-lookup queries” as a target
- Is built on top of Search Optimization Service (SOS)
Search Optimization Service itself works by building a search access path— a persistent structure that tracks which column values may appear in which micro-partitions so more partitions can be skipped at scan time. Fewer micro-partitions scanned → fewer rows/bytes survive scan → smaller hash tables/ aggregation stats/sort buffer → less spilling.
Evidence of this is exposed in the Query Profile, where Optima adds a status like “Partitions pruned by Snowflake Optima.” If you saw spill drop after Gen2 migration, this line item is one of the first things to check.
The general direction seems to be: less data early = less spill later. We corroborated this with internal research. We analyzed more than 2 billion query profiles in April of 2025 and found that Gen2 warehouses scanned significantly fewer partitions than Gen1 warehouses:
As part of the research, we monitored and analyzed the following metrics to identify characteristics of the spillage.
|
Metric |
Source |
Why It Matters |
|---|---|---|
|
|
|
Memory pressure indicator |
|
|
|
Severe memory pressure |
|
|
|
DML vs. SELECT ratio |
|
|
|
Pruning efficiency |
We also examined the number of credits consumed by query type. We found that Gen2 delivers substantial improvements to DELETE, UPDATE and MERGE operations, with as much as a 25% reduction in cost for UPDATE.
Credit-weighted impact by query type
Here are our findings from internal research into 2 billion query profiles:
|
Query type |
Gen2 impact |
Notes |
|---|---|---|
|
UPDATE |
~25% cost reduction |
Significant improvement across most sizes |
|
DELETE |
~20% cost reduction |
Improvement varies by size |
|
MERGE |
~5% cost reduction |
Modest improvement |
|
PARTITIONED_UNLOAD |
~2-5% cost reduction |
Small improvement |
|
SELECT |
1-18% cost increase |
Regression at all sizes |
|
INSERT |
Variable (up to +46%) |
Regresses at larger sizes |
|
CREATE_TABLE_AS_SELECT |
Variable |
Mixed results |
As you can see in the table above, Gen2 was found to be more cost-effective for DML operations, while Gen1 wins for SELECT.
Bear in mind that these results are based on internal research and may not be indicative of future results for your business.
TPC-DS queries most affected by spilling
To cross-verify the DML findings of our internal study, we delved a little deeper and took a look at the exact TPC-DS queries that benefited the most from Gen2 warehouses.
The following queries showed the most dramatic performance difference between Gen1 and Gen2 at a small size:
|
Query |
Gen1 (sec) |
Gen2 (sec) |
Gen2 faster |
|---|---|---|---|
|
query04 |
370.42 |
0.31 |
99.9% |
|
query11 |
218.66 |
0.29 |
99.9% |
|
query23_p2 |
147.81 |
0.39 |
99.7% |
|
query23_p1 |
141.32 |
0.35 |
99.8% |
|
query78 |
134.54 |
0.36 |
99.7% |
|
query74 |
111.59 |
0.24 |
99.8% |
|
query87 |
75.45 |
0.40 |
99.5% |
|
query09 |
61.46 |
0.43 |
99.3% |
These queries involve complex joins, aggregations and window functions that require significant intermediate memory.
The SELECT regression problem
Our internal study also found that Gen2 shows a cost regression on SELECT queries. This is an interesting finding as SELECT queries are the bread and butter of analytics.
|
Size |
SELECT Cost Change (Gen1 → Gen2) |
|---|---|
|
X-Small |
+18.2% (regression) |
|
Small |
+14.6% (regression) |
|
Medium |
+1.9% (regression) |
|
Large |
+0.7% (regression) |
|
X-Large |
+2.2% (regression) |
Meanwhile, DML operations improved significantly on Gen2:
|
Operation |
Typical improvement |
|---|---|
|
UPDATE |
~25 cost reduction |
|
DELETE |
~20% cost reduction |
|
MERGE |
~5% cost reduction |
What this means: Gen2 is optimized for data engineering workloads (ETL, DML-heavy pipelines), not pure analytics. The ARM architecture and query engine optimizations favor write operations. For read-heavy analytical workloads (which is what TPC-DS measures) Gen1's x86 architecture is still competitive, and sometimes offers better performance.
TPC-DS is 100% SELECT queries. That's why Gen1 wins at large when there is no memory pressure (its weakness) plus SELECT-optimized architecture (its strength).
Plot twist: Gen1 wins at large sizes
Here's where it gets interesting. If Gen2's memory efficiency is its superpower, what happens when memory isn't the bottleneck?
We determined the fastest warehouse type at each size per query. The results show a clear pattern based on memory requirements:
- Gen1 warehouses win at large+ sizes, where memory is sufficient. They provide the best performance for 62% of queries tested.
- Gen2 warehouses dominate execution times at small/medium sizes, due to built in query optimizations and higher memory bandwidth. They provide the best performance for more than half of the queries.
Here are the win rates by warehouse size:
|
Size |
Winner |
Win % |
|---|---|---|
|
Large |
Gen1 |
62.1% |
|
Large |
Gen2 |
37.9% |
|
Medium |
Gen2 |
56.3% |
|
Medium |
Gen1 |
43.7% |
|
Small |
Gen2 |
100% |
|
Small |
Gen1 |
0% |
At large size, Gen1 wins 62% of queries. At small, Gen1 wins zero queries. The crossover happens at medium.
This is the finding that should make you pause before upgrading: Gen2 is not universally better. It depends on multiple factors, such as warehouse size as well as the type of query you are running. This just reinforces what every Snowflake admin already knows, Snowflake warehouse optimization requires experimentation.
This is precisely why we built Capital One Slingshot: to help data teams test, validate, and optimize their data clouds. You can easily set up or migrate a warehouse to Gen2 in Slingshot to assess the impact of that change. Read more about that in this post.
Cost analysis
Up until now we covered the differences in performance between warehouse types. Now we’re looking at the credit consumptions and overall cost of each warehouse tested.
Both the benchmark and our internal research indicate that while Gen2 has a higher cost per hour, its efficiency gains can negate the higher rate. In our analysis of 2 billion query profiles in April 2025, we found that Gen2 can be up to 25% more cost-effective for DML operations.
In the benchmark, we found that Gen1 small (10.58 credits) and Gen2 Small (9.66 credits) consumed nearly identical credits despite a 60x difference in query runtime. This seems counterintuitive- if Gen2 finished in under a minute, shouldn’t it cost dramatically less? The explanation lies in how Snowflake bills compute.
Snowflake credits accrue based on warehouse active time, not query execution time. Gen1 small’s queries ran for 38 minutes but much of that time was spent waiting on disk I/O during spilling. This means that the warehouse was “active” but throttled by storage bottlenecks, not efficiently consuming compute. Gen2 small completed the same work in 40 seconds using actual compute cycles.
The near identical credit consumption reveals something important: you’re paying roughly the same either way, but Gen2 can give you 60x better performance for the spend. For that reason it makes more sense to say Gen2 delivers more value per credit.
When we look at other sizes where Gen1 does not fail, we found Gen1 can be more cost-effective than Gen2 warehouses. Bear in mind that the TPC-DS queries are SELECT queries, which is where Gen1 really shines.
As for Snowpark-optimized warehouses, they offered either comparable or better cost-effectiveness than their most effective standard counterparts, making them a viable option especially for small sized warehouses where their 16x memory per node makes a significant impact.
|
WAREHOUSE |
WAREHOUSE_TYPE |
SIZE |
TOTAL_CREDITS |
COST_USD |
Difference |
|---|---|---|---|---|---|
|
GEN1_L |
GEN1_STANDARD |
L |
1.62 |
3.23 |
30% more efficient than Gen2 |
|
GEN1_M |
GEN1_STANDARD |
M |
0.56 |
1.12 |
52% more efficient than Gen2 |
|
GEN1_S |
GEN1_STANDARD |
S |
10.58 |
21.15 |
10% less efficient than Gen2 |
|
GEN1_XL |
GEN1_STANDARD |
XL |
3.31 |
6.62 |
27% more efficient than Gen2 |
|
GEN2_L |
GEN2_STANDARD |
L |
2.33 |
4.67 |
|
|
GEN2_M |
GEN2_STANDARD |
M |
1.18 |
2.36 |
|
|
GEN2_S |
GEN2_STANDARD |
S |
9.66 |
19.33 |
|
|
GEN2_XL |
GEN2_STANDARD |
XL |
4.51 |
9.01 |
|
|
SP_L_64X |
SNOWPARK-OPTIMIZED |
L |
1.38 |
2.77 |
14% more efficient than Gen1 |
|
SP_M_16X |
SNOWPARK- |
M |
0.56 |
1.13 |
Comparable to Gen1 |
|
SP_S_1X |
SNOWPARK-OPTIMIZED |
S |
0.19 |
0.37 |
52x more efficient than Gen2 |
When should you migrate to Gen2?
Based on the benchmark analysis and our internal research, here’s an honest path to make your decision:
Upgrade to Gen2 if:
-
>30% of credits come from DML (UPDATE, DELETE, MERGE) – Gen2’s sweet spot
-
You’re running small/medium sized warehouses- Gen2’s memory efficiency shines here
-
You see frequent spilling - Check
BYTES_SPILLED_TO_LOCAL_STORAGEinQUERY_HISTORY -
>30% of queries scan >1TB - Large scans benefit from DDR5 bandwidth
Think twice about Gen2 if:
- Your workload is SELECT-heavy analytics – TPC-DS shows 1-18% regression
- You’re already running large or XL warehouses – Gen1 wins 62% of queries at this tier
- Cost efficiency is paramount- Gen1 is often more cost-effective than Gen2 and offers equivalent performance on read workloads
Consider Snowpark-optimized if:
- Memory pressure is your primary bottleneck
- You’d otherwise need to scale large/XL just to avoid spilling
- You’re running ML, complex UDFs, or large DataFrame operations
Here’s a query to help you assess if a warehouse is a good candidate for Gen2.
-- Gen2 Candidate Assessment Query
WITH warehouse_metrics AS (
SELECT
WAREHOUSE_NAME,
COUNT(*) as total_queries,
AVG(PARTITIONS_SCANNED / NULLIF(PARTITIONS_TOTAL, 0)) * 100
as pct_partitions_scanned,
SUM(CASE WHEN QUERY_TYPE IN ('DELETE', 'MERGE', 'UPDATE')
THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) /
NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100
as pct_credits_dml,
SUM(CASE WHEN BYTES_SCANNED > 1e12
THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) /
NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100
as pct_credits_large_scans,
SUM(CASE WHEN TOTAL_ELAPSED_TIME > 1800000
THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) /
NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100
as pct_credits_long_queries
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME > DATEADD(day, -28, CURRENT_TIMESTAMP())
GROUP BY WAREHOUSE_NAME
)
SELECT
WAREHOUSE_NAME,
ROUND(pct_partitions_scanned, 1) as pct_partitions,
ROUND(pct_credits_dml, 1) as pct_dml,
ROUND(pct_credits_large_scans, 1) as pct_large_scans,
ROUND(pct_credits_long_queries, 1) as pct_long_queries,
(CASE WHEN pct_partitions_scanned > 40 THEN 1 ELSE 0 END +
CASE WHEN pct_credits_dml > 30 THEN 1 ELSE 0 END +
CASE WHEN pct_credits_large_scans > 30 THEN 1 ELSE 0 END +
CASE WHEN pct_credits_long_queries > 30 THEN 1 ELSE 0 END) as criteria_met
FROM warehouse_metrics
ORDER BY criteria_met DESC;Run this query on your warehouses to see if they are a good candidate for Gen2 based on the criteria we shared above.
Alternatively, you can let Slingshot assess that for you. Slingshot offers a low-risk way to test and validate whether Gen2 is right for your workloads. If you’d like to see this in action, book a time to meet with the team.
Conclusion
We ran the full TPC-DS benchmark on Snowflake Gen1, Gen2 and Snowpark-optimized warehouses in an attempt to assess when to use each warehouse type. Overall, the data points to differences in cost and performance across warehouse types, based on size and query types.
If you take one thing from this analysis, let it be this: profile before you upgrade.
Run the assessment query against your workspace and see what kind of workloads you are running. If you see significant DML credits or spilling, Gen2 will undoubtedly help. If you have SELECT- dominated workloads with no spilling, Gen1 is probably the better bet.
Snowflake’s most important strength isn't any single warehouse generation. It’s the model: decoupled storage and compute, a consistent SQL surface area and increasingly specialized compute choices you can swap in and out as your workload evolves.
Gen2 is a meaningful part of that evolution. In the scenarios it targets DML-heavy pipelines, memory sensitive execution and performance under pressure. It delivers exactly the kind of stability and efficiency improvements that modern data engineering teams need.
For completeness, here is the full table of results from our study.
|
WAREHOUSE |
WAREHOUSE_TYPE |
SIZE |
TOTAL_MIN |
AVG_SEC |
MEDIAN_SEC |
P95_SEC |
Gen2 vs. Gen1 vs. Snowpark -optimized P95s |
|---|---|---|---|---|---|---|---|
|
GEN1_L |
GEN1_STANDARD |
L |
0.62 |
0.36 |
0.28 |
0.76 |
|
|
GEN1_XL |
GEN1_STANDARD |
XL |
0.65 |
0.38 |
0.31 |
0.79 |
|
|
GEN2_L |
GEN2_STANDARD |
L |
0.85 |
0.5 |
0.42 |
1.01 |
33% less efficient than Gen1 |
|
GEN2_XL |
GEN2_STANDARD |
XL |
0.67 |
0.39 |
0.31 |
0.86 |
8% less efficient than Gen1 |
|
GEN1_M |
GEN1_STANDARD |
M |
0.71 |
0.41 |
0.32 |
0.93 |
|
|
GEN2_M |
GEN2_STANDARD |
M |
0.65 |
0.38 |
0.29 |
0.87 |
6% more efficient than Gen1 |
|
GEN1_S |
GEN1_STANDARD |
S |
38.24 |
22.28 |
8.43 |
107.3 |
|
|
GEN2_S |
GEN2_STANDARD |
S |
0.64 |
0.37 |
0.31 |
0.89 |
120x less efficient than Gen1 |
|
SP_M_16X |
SNOWPARK-OPTIMIZED |
M |
0.7 |
0.41 |
0.34 |
0.8 |
8% more efficient than Gen2 |
|
SP_L_64X |
SNOWPARK-OPTIMIZED |
L |
0.59 |
0.34 |
0.3 |
0.64 |
19% more efficient than Gen1 |
|
SP_S_1X |
SNOWPARK-OPTIMIZED |
S |
0.55 |
0.32 |
0.27 |
0.63 |
29% more efficient than Gen2 |

