Snowflake warehouse TPC-DS benchmark analysis

Sachin Seth

February 5, 2026

Gen1 vs. Gen2 vs. Snowpark-optimized warehouses

TL;DR We ran the full TPC-DS benchmark at 1TB scale on three Snowflake warehouse types and across multiple sizes (small through XL). Comparing credit consumption and performance of Gen1 vs. Gen2 vs. Snowpark-optimized warehouses, we found significant performance differences driven by memory architecture. Read on for clear guidance on when each warehouse type provides optimal value.

Introduction: How much of an improvement is Gen2?

When Snowflake announced Gen2 warehouses, the marketing was compelling: Gen2 warehouses are the next evolution in warehouse compute architecture. Powered by ARM-based processors and DDR5 memory, it promises improved performance and concurrency without requiring any code changes. The implicit message was clear: you should upgrade to Gen2.

But as someone who spends considerable time optimizing cloud data costs, I've learned to be skeptical. The questions that kept me up at night were:

Is Gen2 always better, or just in specific scenarios?
What exactly changed architecturally, and does it matter for my workloads?
Is the upgrade worth the operational overhead, or is this a solution looking for a problem?

So I did what any self-respecting data engineer would do: ran the benchmarks.

We ran the full TPC-DS benchmark at a 1TB scale to simulate real-life production workloads and assess where Gen2 and Snowpark-optimized warehouses stand in regards to costs and performance compared to Gen1 warehouses. This blog presents our findings from a comprehensive analysis of more than 1,200 TPC-DS queries across multiple warehouse sizes (small through XL), exploring how memory architecture impacts data spilling.

Methodology

The benchmark was conducted using Snowflake's official TPC-DS query set against a 1TB dataset sampled from SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL. We tested Gen1 vs. Gen2 vs. Snowpark-optimized warehouses across four sizes (small, medium, large and XL). Each warehouse configuration executed all 103 TPC-DS queries (1 iteration per query) with cold cache to eliminate any unfair advantages.

Methodology Note: We deliberately ran single iterations with cold cache. This represents a "worst case" scenario that exposes architectural differences. Production workloads with warm cache would show smaller gaps. Your mileage will vary, and that's precisely the point.

The contenders:

Type	Architecture	Snowflake description *Source: Snowflake Documentation
Gen1 Standard	Traditional Snowflake compute architecture Intel x86 processors DDR4 memory Standard L2 cache sizes Baseline query engine	The original virtual warehouse
Gen2 Standard	ARM-based CPUs (AWS Graviton3) with improved IPC (instructions per cycle) DDR5 memory - faster memory bandwidth Larger L2 caches - reduces memory pressure Optimized query engine - targeted improvements for DML operations Drop-in replacement - no code changes required	2.1x faster, 4.4x DML improvement
Snowpark-Optimized	16x memory per node compared to standard warehouses Designed for memory-intensive operations (ML, large DataFrames, complex UDFs)	For memory-intensive ML workloads

Type

Architecture

Snowflake description

*Source: Snowflake Documentation

Gen1 Standard

Traditional Snowflake compute architecture
Intel x86 processors
DDR4 memory
Standard L2 cache sizes
Baseline query engine

The original virtual warehouse

Gen2 Standard

ARM-based CPUs (AWS Graviton3) with improved IPC (instructions per cycle)
DDR5 memory - faster memory bandwidth
Larger L2 caches - reduces memory pressure
Optimized query engine - targeted improvements for DML operations
Drop-in replacement - no code changes required

2.1x faster, 4.4x DML improvement

Snowpark-Optimized

16x memory per node compared to standard warehouses
Designed for memory-intensive operations (ML, large DataFrames, complex UDFs)

For memory-intensive ML workloads

Test configuration

Parameter	Value
Benchmark	TPC-DS
Queries	103 TPC-DS queries
Scale Factor	SF1000 (1TB)
Iterations	1 per query per warehouse
Cache State	Cold start (0% cache hit observed)

The results: Not what we expected

Let me show you the headline numbers, then explain why they're misleading if you stop there:

Warehouse	Type	Size	Total runtime	Avg query
BENCH_SP_S	Snowpark	Small	0.55 min	0.32 sec
BENCH_GEN1_L	Gen1	Large	0.62 min	0.36 sec
BENCH_GEN2_S	Gen2	Small	0.64 min	0.37 sec
BENCH_GEN1_M	Gen1	Medium	0.71 min	0.41 sec
BENCH_GEN2_L	Gen2	Large	0.85 min	0.50 sec
BENCH_GEN1_S	Gen1	Small	38.24 min	22.28 sec

At first glance, the anomaly really stands out. Small Gen1 warehouses took 38 minutes while a small Gen2 took 0.64 minutes. That's a 60x difference. Either we made a mistake, or something very interesting is happening here.

The mystery: Why did Gen1 small fail?

Our first instinct was to blame the cache. Maybe Gen1 ran first with cold cache while Gen2 benefited from warmed storage. However, the cache hit rates told a different story.

Warehouse	Cache Hit %	TB scanned
BENCH_GEN1_S	0.6%	2.05 TB
All others	0%	~0

Every query was running cold and ALTER SESSION SET USED_CACHED_RESULT = FALSE. The difference wasn't cache, it was something architectural.

Everything clicked when we checked spillage metrics:

Warehouse	Queries spilled	% spilled	GB spilled
BENCH_GEN1_S	12	11.4%	153.91 GB
All others	0	0%	0

The aha moment: Gen1 small was the only warehouse that spilled to disk. When a query's intermediate results exceed available memory, Snowflake writes to local SSD (spilling).

But here's what puzzled us: a small Gen1 and a small Gen2 are both "Standard" small sized warehouses. They should have the same memory allocation, right? So why did Gen2 small handle the workload gracefully while Gen1 small suffered significant performance issues?

The architecture revelation

Doing a deeper dive into the architecture we find that Gen2 isn't just "newer,” it's fundamentally different hardware:

Component	Gen1	Gen2	Impact
CPU	Intel x86	ARM Graviton3	Better IPC (instructions per cycle)
Memory	DDR4	DDR5	Higher bandwidth, lower latency
L2 Cache	Standard	Larger	Fewer cache misses → less memory pressure

Gen2's advantage isn't just being "faster," but being more memory-efficient. The DDR5 and larger L2 caches mean Gen2 can handle the same workload with less memory pressure. Gen2 small doesn’t have more memory than Gen1 small, but Gen2 uses memory more efficiently.

This explains the 60x performance gap (38 mins vs. 0.64 mins). Gen1 small hit memory limits and spilled 154GB to SSD. Gen2 small, with identical nominal specs but superior memory architecture, processed the same queries entirely in-memory.

The software angle

The hardware itself, however, only tells part of the story. Gen2 also includes targeted software optimization that reduces memory pressure right before it starts. To better understand the software improvements to Gen2, let's get a few things straight.
Gen2 does not change:

Storage format/table layout fundamentals- still Snowflake micro-partitions, columnar, immutable storage units
SQL semantics
Data model

So the fact that spilling was so dramatically reduced in Gen2 warehouses is because the same query plan now produces smaller distributed intermediates. The data shows Gen2 warehouses consistently scans fewer partitions than Gen1 warehouses for identical queries. Less data scanned means fewer intermediate results to hold in memory.

Why does memory spilling occur in the first place?

The docs are very explicit: performance degrades when a warehouse runs out of memory and must “spill” intermediate data. First the data spills to local disk storage, then to remote cloud-provider storage if more space is needed. Performance degrades with each step.

Spills come from stateful operators like:

Hash joins (the built side hash table gets too big)
Group-by aggregations
Sorts/window functions
Big shuffles with skew

Memory spilling in Snowflake

Snowflake’s columnar and immutable micro-partition design is wonderful for analytics but unavoidably micro-partitions are not updated in place. This means that changes require creating new partitions.

We’ve described the practical impact of this in a previous post. In a nutshell, UPDATE, DELETE and MERGE (DML operation) can require significant column segments/partitions rather than modifying individual rows like a row-store would. It’s the natural price we pay for prioritizing columnar analytics optimized storage formats in OLAP systems vs. the high throughput and single row update OLTP systems.

Let’s take a look at a simplified DML pipeline:

Phase 1: Identify affected data

Determine which micro-partitions might contain rows that match predicates/join keys
Read only the required columns
Apply pruning using metadata and possible indexes

Phase 2: Compute row-level changes

For MERGE, this is often a joint between source and target keys, producing:
- Matched rows to update
- Unmatched rows to insert
- Rows to delete

Phase 3: Write new micro-partitions

Rewrite the affected portions into new micro-partitions
Commit by updating metadata pointers (conceptually “copy-on-write”)

Phase 2 (joins/grouping/sorting) and phase 3 (repartitioning data for writes) can create large intermediates and heavy shuffles-classic spill triggers. This is where spilling occurs in the pipeline.

How Snowflake reduces spilling with Gen2 improvements

Snowflake Gen2 documentation calls out both table scan operation enhancements and DELETE, UPDATE, MERGE improvements.

Those are the two stages with the most influence on whether downstream operators spill. But the more mechanistic explanation comes from the 2025 performance improvement notes and Optima docs. Snowflake Optima is only available on Gen2 warehouses.

Optima indexing:

Continuously analyzes workload patterns
Automatically creates “hidden indexes” in the background
Specifically calls out “repetitive point-lookup queries” as a target
Is built on top of Search Optimization Service (SOS)

Search Optimization Service itself works by building a search access path— a persistent structure that tracks which column values may appear in which micro-partitions so more partitions can be skipped at scan time. Fewer micro-partitions scanned → fewer rows/bytes survive scan → smaller hash tables/ aggregation stats/sort buffer → less spilling.

Evidence of this is exposed in the Query Profile, where Optima adds a status like “Partitions pruned by Snowflake Optima.” If you saw spill drop after Gen2 migration, this line item is one of the first things to check.

The general direction seems to be: less data early = less spill later. We corroborated this with internal research. We analyzed more than 2 billion query profiles in April of 2025 and found that Gen2 warehouses scanned significantly fewer partitions than Gen1 warehouses:

Gen1 and Gen2 line graph with red and blue data lines.

As part of the research, we monitored and analyzed the following metrics to identify characteristics of the spillage.

Metric	Source	Why It Matters
`BYTES_SPILLED_TO_LOCAL_STORAGE`	`QUERY_HISTORY`	Memory pressure indicator
`BYTES_SPILLED_TO_REMOTE_STORAGE`	`QUERY_HISTORY`	Severe memory pressure
`QUERY_TYPE` distribution	`QUERY_HISTORY`	DML vs. SELECT ratio
`PARTITIONS_SCANNED / PARTITIONS_TOTAL`	`QUERY_HISTORY`	Pruning efficiency

We also examined the number of credits consumed by query type. We found that Gen2 delivers substantial improvements to DELETE, UPDATE and MERGE operations, with as much as a 25% reduction in cost for UPDATE.

Credit-weighted impact by query type

Here are our findings from internal research into 2 billion query profiles:

Query type	Gen2 impact	Notes
UPDATE	~25% cost reduction	Significant improvement across most sizes
DELETE	~20% cost reduction	Improvement varies by size
MERGE	~5% cost reduction	Modest improvement
PARTITIONED_UNLOAD	~2-5% cost reduction	Small improvement
SELECT	1-18% cost increase	Regression at all sizes
INSERT	Variable (up to +46%)	Regresses at larger sizes
CREATE_TABLE_AS_SELECT	Variable	Mixed results

As you can see in the table above, Gen2 was found to be more cost-effective for DML operations, while Gen1 wins for SELECT.

Bear in mind that these results are based on internal research and may not be indicative of future results for your business.

TPC-DS queries most affected by spilling

To cross-verify the DML findings of our internal study, we delved a little deeper and took a look at the exact TPC-DS queries that benefited the most from Gen2 warehouses.

The following queries showed the most dramatic performance difference between Gen1 and Gen2 at a small size:

Query	Gen1 (sec)	Gen2 (sec)	Gen2 faster
query04	370.42	0.31	99.9%
query11	218.66	0.29	99.9%
query23_p2	147.81	0.39	99.7%
query23_p1	141.32	0.35	99.8%
query78	134.54	0.36	99.7%
query74	111.59	0.24	99.8%
query87	75.45	0.40	99.5%
query09	61.46	0.43	99.3%

These queries involve complex joins, aggregations and window functions that require significant intermediate memory.

The SELECT regression problem

Our internal study also found that Gen2 shows a cost regression on SELECT queries. This is an interesting finding as SELECT queries are the bread and butter of analytics.

Size	SELECT Cost Change (Gen1 → Gen2)
X-Small	+18.2% (regression)
Small	+14.6% (regression)
Medium	+1.9% (regression)
Large	+0.7% (regression)
X-Large	+2.2% (regression)

Meanwhile, DML operations improved significantly on Gen2:

Operation	Typical improvement
UPDATE	~25 cost reduction
DELETE	~20% cost reduction
MERGE	~5% cost reduction

What this means: Gen2 is optimized for data engineering workloads (ETL, DML-heavy pipelines), not pure analytics. The ARM architecture and query engine optimizations favor write operations. For read-heavy analytical workloads (which is what TPC-DS measures) Gen1's x86 architecture is still competitive, and sometimes offers better performance.

TPC-DS is 100% SELECT queries. That's why Gen1 wins at large when there is no memory pressure (its weakness) plus SELECT-optimized architecture (its strength).

Plot twist: Gen1 wins at large sizes

Here's where it gets interesting. If Gen2's memory efficiency is its superpower, what happens when memory isn't the bottleneck?

We determined the fastest warehouse type at each size per query. The results show a clear pattern based on memory requirements:

Gen1 warehouses win at large+ sizes, where memory is sufficient. They provide the best performance for 62% of queries tested.
Gen2 warehouses dominate execution times at small/medium sizes, due to built in query optimizations and higher memory bandwidth. They provide the best performance for more than half of the queries.

Here are the win rates by warehouse size:

Size	Winner	Win %
Large	Gen1	62.1%
Large	Gen2	37.9%
Medium	Gen2	56.3%
Medium	Gen1	43.7%
Small	Gen2	100%
Small	Gen1	0%

At large size, Gen1 wins 62% of queries. At small, Gen1 wins zero queries. The crossover happens at medium.

This is the finding that should make you pause before upgrading: Gen2 is not universally better. It depends on multiple factors, such as warehouse size as well as the type of query you are running. This just reinforces what every Snowflake admin already knows, Snowflake warehouse optimization requires experimentation.

This is precisely why we built Capital One Slingshot: to help data teams test, validate, and optimize their data clouds. You can easily set up or migrate a warehouse to Gen2 in Slingshot to assess the impact of that change. Read more about that in this post.

Cost analysis

Up until now we covered the differences in performance between warehouse types. Now we’re looking at the credit consumptions and overall cost of each warehouse tested.

Both the benchmark and our internal research indicate that while Gen2 has a higher cost per hour, its efficiency gains can negate the higher rate. In our analysis of 2 billion query profiles in April 2025, we found that Gen2 can be up to 25% more cost-effective for DML operations.

In the benchmark, we found that Gen1 small (10.58 credits) and Gen2 Small (9.66 credits) consumed nearly identical credits despite a 60x difference in query runtime. This seems counterintuitive- if Gen2 finished in under a minute, shouldn’t it cost dramatically less? The explanation lies in how Snowflake bills compute.

Snowflake credits accrue based on warehouse active time, not query execution time. Gen1 small’s queries ran for 38 minutes but much of that time was spent waiting on disk I/O during spilling. This means that the warehouse was “active” but throttled by storage bottlenecks, not efficiently consuming compute. Gen2 small completed the same work in 40 seconds using actual compute cycles.

The near identical credit consumption reveals something important: you’re paying roughly the same either way, but Gen2 can give you 60x better performance for the spend. For that reason it makes more sense to say Gen2 delivers more value per credit.

When we look at other sizes where Gen1 does not fail, we found Gen1 can be more cost-effective than Gen2 warehouses. Bear in mind that the TPC-DS queries are SELECT queries, which is where Gen1 really shines.

As for Snowpark-optimized warehouses, they offered either comparable or better cost-effectiveness than their most effective standard counterparts, making them a viable option especially for small sized warehouses where their 16x memory per node makes a significant impact.

WAREHOUSE	WAREHOUSE_TYPE	SIZE	TOTAL_CREDITS	COST_USD	Difference
GEN1_L	GEN1_STANDARD	L	1.62	3.23	30% more efficient than Gen2
GEN1_M	GEN1_STANDARD	M	0.56	1.12	52% more efficient than Gen2
GEN1_S	GEN1_STANDARD	S	10.58	21.15	10% less efficient than Gen2
GEN1_XL	GEN1_STANDARD	XL	3.31	6.62	27% more efficient than Gen2
GEN2_L	GEN2_STANDARD	L	2.33	4.67
GEN2_M	GEN2_STANDARD	M	1.18	2.36
GEN2_S	GEN2_STANDARD	S	9.66	19.33
GEN2_XL	GEN2_STANDARD	XL	4.51	9.01
SP_L_64X	SNOWPARK-OPTIMIZED	L	1.38	2.77	14% more efficient than Gen1
SP_M_16X	SNOWPARK- OPTIMIZED	M	0.56	1.13	Comparable to Gen1
SP_S_1X	SNOWPARK-OPTIMIZED	S	0.19	0.37	52x more efficient than Gen2

When should you migrate to Gen2?

Based on the benchmark analysis and our internal research, here’s an honest path to make your decision:

Upgrade to Gen2 if:

>30% of credits come from DML (UPDATE, DELETE, MERGE) – Gen2’s sweet spot
You’re running small/medium sized warehouses- Gen2’s memory efficiency shines here
You see frequent spilling - Check BYTES_SPILLED_TO_LOCAL_STORAGE in QUERY_HISTORY
>30% of queries scan >1TB - Large scans benefit from DDR5 bandwidth

Think twice about Gen2 if:

Your workload is SELECT-heavy analytics – TPC-DS shows 1-18% regression
You’re already running large or XL warehouses – Gen1 wins 62% of queries at this tier
Cost efficiency is paramount- Gen1 is often more cost-effective than Gen2 and offers equivalent performance on read workloads

Consider Snowpark-optimized if:

Memory pressure is your primary bottleneck
You’d otherwise need to scale large/XL just to avoid spilling
You’re running ML, complex UDFs, or large DataFrame operations

Here’s a query to help you assess if a warehouse is a good candidate for Gen2.

-- Gen2 Candidate Assessment Query
WITH warehouse_metrics AS (
    SELECT 
        WAREHOUSE_NAME,
        COUNT(*) as total_queries,
        AVG(PARTITIONS_SCANNED / NULLIF(PARTITIONS_TOTAL, 0)) * 100 
            as pct_partitions_scanned,
        SUM(CASE WHEN QUERY_TYPE IN ('DELETE', 'MERGE', 'UPDATE') 
            THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) / 
            NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100 
            as pct_credits_dml,
        SUM(CASE WHEN BYTES_SCANNED > 1e12 
            THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) /
            NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100 
            as pct_credits_large_scans,
        SUM(CASE WHEN TOTAL_ELAPSED_TIME > 1800000 
            THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) /
            NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100 
            as pct_credits_long_queries
    FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
    WHERE START_TIME > DATEADD(day, -28, CURRENT_TIMESTAMP())
    GROUP BY WAREHOUSE_NAME
)
SELECT 
    WAREHOUSE_NAME,
    ROUND(pct_partitions_scanned, 1) as pct_partitions,
    ROUND(pct_credits_dml, 1) as pct_dml,
    ROUND(pct_credits_large_scans, 1) as pct_large_scans,
    ROUND(pct_credits_long_queries, 1) as pct_long_queries,
    (CASE WHEN pct_partitions_scanned > 40 THEN 1 ELSE 0 END +
     CASE WHEN pct_credits_dml > 30 THEN 1 ELSE 0 END +
     CASE WHEN pct_credits_large_scans > 30 THEN 1 ELSE 0 END +
     CASE WHEN pct_credits_long_queries > 30 THEN 1 ELSE 0 END) as criteria_met
FROM warehouse_metrics
ORDER BY criteria_met DESC;

Run this query on your warehouses to see if they are a good candidate for Gen2 based on the criteria we shared above.

Alternatively, you can let Slingshot assess that for you. Slingshot offers a low-risk way to test and validate whether Gen2 is right for your workloads. If you’d like to see this in action, book a time to meet with the team.

Conclusion

We ran the full TPC-DS benchmark on Snowflake Gen1, Gen2 and Snowpark-optimized warehouses in an attempt to assess when to use each warehouse type. Overall, the data points to differences in cost and performance across warehouse types, based on size and query types.

If you take one thing from this analysis, let it be this: profile before you upgrade.

Run the assessment query against your workspace and see what kind of workloads you are running. If you see significant DML credits or spilling, Gen2 will undoubtedly help. If you have SELECT- dominated workloads with no spilling, Gen1 is probably the better bet.

Snowflake’s most important strength isn't any single warehouse generation. It’s the model: decoupled storage and compute, a consistent SQL surface area and increasingly specialized compute choices you can swap in and out as your workload evolves.

Gen2 is a meaningful part of that evolution. In the scenarios it targets DML-heavy pipelines, memory sensitive execution and performance under pressure. It delivers exactly the kind of stability and efficiency improvements that modern data engineering teams need.

For completeness, here is the full table of results from our study.

WAREHOUSE	WAREHOUSE_TYPE	SIZE	TOTAL_MIN	AVG_SEC	MEDIAN_SEC	P95_SEC	Gen2 vs. Gen1 vs. Snowpark -optimized P95s
GEN1_L	GEN1_STANDARD	L	0.62	0.36	0.28	0.76
GEN1_XL	GEN1_STANDARD	XL	0.65	0.38	0.31	0.79
GEN2_L	GEN2_STANDARD	L	0.85	0.5	0.42	1.01	33% less efficient than Gen1
GEN2_XL	GEN2_STANDARD	XL	0.67	0.39	0.31	0.86	8% less efficient than Gen1
GEN1_M	GEN1_STANDARD	M	0.71	0.41	0.32	0.93
GEN2_M	GEN2_STANDARD	M	0.65	0.38	0.29	0.87	6% more efficient than Gen1
GEN1_S	GEN1_STANDARD	S	38.24	22.28	8.43	107.3
GEN2_S	GEN2_STANDARD	S	0.64	0.37	0.31	0.89	120x less efficient than Gen1
SP_M_16X	SNOWPARK-OPTIMIZED	M	0.7	0.41	0.34	0.8	8% more efficient than Gen2
SP_L_64X	SNOWPARK-OPTIMIZED	L	0.59	0.34	0.3	0.64	19% more efficient than Gen1
SP_S_1X	SNOWPARK-OPTIMIZED	S	0.55	0.32	0.27	0.63	29% more efficient than Gen2

Sachin Seth, Technical Writer

Sachin Seth is a data platform architect and analytics product builder known for his deep work benchmarking Databricks compute and delivering high-performance data applications at scale. He develops full-stack analytics solutions—ranging from billion-point time-series engines to portfolio optimization apps and real-time financial dashboards—blending Databricks, Rust, Arrow and modern web technologies. He writes to bring clarity, measurement and engineering rigor to the rapidly evolving world of Databricks and modern data platforms.