Snowflake warehouse TPC-DS benchmark analysis

Gen1 vs. Gen2 vs. Snowpark-optimized warehouses

TL;DR We ran the full TPC-DS benchmark at 1TB scale on three Snowflake warehouse types and across multiple sizes (small through XL). Comparing credit consumption and performance of Gen1 vs. Gen2 vs. Snowpark-optimized warehouses, we found significant performance differences driven by memory architecture. Read on for clear guidance on when each warehouse type provides optimal value.

Introduction: How much of an improvement is Gen2?

When Snowflake announced Gen2 warehouses, the marketing was compelling: Gen2 warehouses are the next evolution in warehouse compute architecture. Powered by ARM-based processors and DDR5 memory, it promises improved performance and concurrency without requiring any code changes. The implicit message was clear: you should upgrade to Gen2.

But as someone who spends considerable time optimizing cloud data costs, I've learned to be skeptical. The questions that kept me up at night were:

  • Is Gen2 always better, or just in specific scenarios?
  • What exactly changed architecturally, and does it matter for my workloads?
  • Is the upgrade worth the operational overhead, or is this a solution looking for a problem?

So I did what any self-respecting data engineer would do: ran the benchmarks.

We ran the full TPC-DS benchmark at a 1TB scale to simulate real-life production workloads and assess where Gen2 and Snowpark-optimized warehouses stand in regards to costs and performance compared to Gen1 warehouses. This blog presents our findings from a comprehensive analysis of more than 1,200 TPC-DS queries across multiple warehouse sizes (small through XL), exploring how memory architecture impacts data spilling.

Methodology

The benchmark was conducted using Snowflake's official TPC-DS query set against a 1TB dataset sampled from SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL. We tested Gen1 vs. Gen2 vs. Snowpark-optimized warehouses across four sizes (small, medium, large and XL). Each warehouse configuration executed all 103 TPC-DS queries (1 iteration per query) with cold cache to eliminate any unfair advantages.

Methodology Note: We deliberately ran single iterations with cold cache. This represents a "worst case" scenario that exposes architectural differences. Production workloads with warm cache would show smaller gaps. Your mileage will vary, and that's precisely the point.

The contenders:

Type

Architecture

Snowflake description

*Source: Snowflake Documentation

Gen1 Standard

  • Traditional Snowflake compute architecture

  • Intel x86 processors

  • DDR4 memory

  • Standard L2 cache sizes

  • Baseline query engine

The original virtual warehouse

Gen2 Standard

  • ARM-based CPUs (AWS Graviton3) with improved IPC (instructions per cycle)

  • DDR5 memory - faster memory bandwidth

  • Larger L2 caches - reduces memory pressure

  • Optimized query engine - targeted improvements for DML operations

  • Drop-in replacement - no code changes required

2.1x faster, 4.4x DML improvement

Snowpark-Optimized

  • 16x memory per node compared to standard warehouses

  • Designed for memory-intensive operations (ML, large DataFrames, complex UDFs)

For memory-intensive ML workloads

 

Test configuration

Parameter

Value

Benchmark

TPC-DS

Queries

103 TPC-DS queries

Scale Factor

SF1000 (1TB)

Iterations

1 per query per warehouse

Cache State

Cold start (0% cache hit observed)

 

The results: Not what we expected

Let me show you the headline numbers, then explain why they're misleading if you stop there:

Warehouse

Type

Size

Total runtime

Avg query

BENCH_SP_S

Snowpark

Small

0.55 min

0.32 sec

BENCH_GEN1_L

Gen1

Large

0.62 min

0.36 sec

BENCH_GEN2_S

Gen2

Small

0.64 min

0.37 sec

BENCH_GEN1_M

Gen1

Medium

0.71 min

0.41 sec

BENCH_GEN2_L

Gen2

Large

0.85 min

0.50 sec

BENCH_GEN1_S

Gen1

Small

38.24 min

22.28 sec

 
At first glance, the anomaly really stands out. Small Gen1 warehouses took 38 minutes while a small Gen2 took 0.64 minutes. That's a 60x difference. Either we made a mistake, or something very interesting is happening here.

The mystery: Why did Gen1 small fail?

Our first instinct was to blame the cache. Maybe Gen1 ran first with cold cache while Gen2 benefited from warmed storage. However, the cache hit rates told a different story.

Warehouse

Cache Hit %

TB scanned

BENCH_GEN1_S

0.6%

2.05 TB

All others

0%

~0


Every query was running cold and ALTER SESSION SET USED_CACHED_RESULT = FALSE. The difference wasn't cache, it was something architectural.

Everything clicked when we checked spillage metrics:

Warehouse

Queries spilled

% spilled

GB spilled

BENCH_GEN1_S

12

11.4%

153.91 GB

All others

0

0%

0


The aha moment: Gen1 small was the only warehouse that spilled to disk. When a query's intermediate results exceed available memory, Snowflake writes to local SSD (spilling). 

But here's what puzzled us: a small Gen1 and a small Gen2 are both "Standard" small sized warehouses. They should have the same memory allocation, right? So why did Gen2 small handle the workload gracefully while Gen1 small suffered significant performance issues?

The architecture revelation

Doing a deeper dive into the architecture we find that Gen2 isn't just "newer,” it's fundamentally different hardware:

Component Gen1 Gen2 Impact
CPU Intel x86 ARM Graviton3 Better IPC (instructions per cycle)
Memory DDR4 DDR5 Higher bandwidth, lower latency
L2 Cache Standard Larger Fewer cache misses → less memory pressure


Gen2's advantage isn't just being "faster," but being more memory-efficient. The DDR5 and larger L2 caches mean Gen2 can handle the same workload with less memory pressure. Gen2 small doesn’t have more memory than Gen1 small, but Gen2 uses memory more efficiently.

This explains the 60x performance gap (38 mins vs. 0.64 mins). Gen1 small hit memory limits and spilled 154GB to SSD. Gen2 small, with identical nominal specs but superior memory architecture, processed the same queries entirely in-memory.

The software angle

The hardware itself, however, only tells part of the story. Gen2 also includes targeted software optimization that reduces memory pressure right before it starts. To better understand the software improvements to Gen2, let's get a few things straight. 
Gen2 does not change: 

  • Storage format/table layout fundamentals- still Snowflake micro-partitions, columnar, immutable storage units 
  • SQL semantics 
  • Data model

So the fact that spilling was so dramatically reduced in Gen2 warehouses is because the same query plan now produces smaller distributed intermediates. The data shows Gen2 warehouses consistently scans fewer partitions than Gen1 warehouses for identical queries. Less data scanned means fewer intermediate results to hold in memory. 

Why does memory spilling occur in the first place?

The docs are very explicit: performance degrades when a warehouse runs out of memory and must “spill” intermediate data. First the data spills to local disk storage, then to remote cloud-provider storage if more space is needed. Performance degrades with each step.

Spills come from stateful operators like: 

  • Hash joins (the built side hash table gets too big) 
  • Group-by aggregations 
  • Sorts/window functions
  • Big shuffles with skew 

Run the Snowflake benchmark

Download the worksheet to run the TPC-DS Snowflake benchmark & verify our findings.

Memory spilling in Snowflake

Snowflake’s columnar and immutable micro-partition design is wonderful for analytics but unavoidably micro-partitions are not updated in place. This means that changes require creating new partitions. 

We’ve described the practical impact of this in a previous post. In a nutshell, UPDATE, DELETE and MERGE (DML operation) can require significant column segments/partitions rather than modifying individual rows like a row-store would. It’s the natural price we pay for prioritizing columnar analytics optimized storage formats in OLAP systems vs. the high throughput and single row update OLTP systems. 

Let’s take a look at a simplified DML pipeline:

Phase 1:  Identify affected data 

  • Determine which micro-partitions might contain rows that match predicates/join keys 
  • Read only the required columns 
  • Apply pruning using metadata and possible indexes 

Phase 2: Compute row-level changes 

  • For MERGE, this is often a joint between source and target keys, producing: 
    • Matched rows to update
    • Unmatched rows to insert
    • Rows to delete 

Phase 3: Write new micro-partitions 

  • Rewrite the affected portions into new micro-partitions
  • Commit by updating metadata pointers (conceptually “copy-on-write”)

Phase 2 (joins/grouping/sorting) and phase 3 (repartitioning data for writes) can create large intermediates and heavy shuffles-classic spill triggers. This is where spilling occurs in the pipeline. 

How Snowflake reduces spilling with Gen2 improvements

Snowflake Gen2 documentation calls out both table scan operation enhancements and DELETE, UPDATE, MERGE improvements. 

Those are the two stages with the most influence on whether downstream operators spill. But the more mechanistic explanation comes from the 2025 performance improvement notes and Optima docs. Snowflake Optima is only available on Gen2 warehouses.

Optima indexing:

  • Continuously analyzes workload patterns
  • Automatically creates “hidden indexes” in the background
  • Specifically calls out “repetitive point-lookup queries” as a target
  • Is built on top of Search Optimization Service (SOS) 

Search Optimization Service itself works by building a search access path— a persistent structure that tracks which column values may appear in which micro-partitions so more partitions can be skipped at scan time. Fewer micro-partitions scanned → fewer rows/bytes survive scan → smaller hash tables/ aggregation stats/sort buffer → less spilling. 

Evidence of this is exposed in the Query Profile, where Optima adds a status like “Partitions pruned by Snowflake Optima.” If you saw spill drop after Gen2 migration, this line item is one of the first things to check. 

The general direction seems to be: less data early = less spill later. We corroborated this with internal research. We analyzed more than 2 billion query profiles in April of 2025 and found that Gen2 warehouses scanned significantly fewer partitions than Gen1 warehouses: 

Gen1 and Gen2 line graph with red and blue data lines.

As part of the research, we monitored and analyzed the following metrics to identify characteristics of the spillage. 

Metric

Source

Why It Matters

BYTES_SPILLED_TO_LOCAL_STORAGE

QUERY_HISTORY

Memory pressure indicator

BYTES_SPILLED_TO_REMOTE_STORAGE

QUERY_HISTORY

Severe memory pressure

QUERY_TYPE distribution

QUERY_HISTORY

DML vs. SELECT ratio

PARTITIONS_SCANNED / PARTITIONS_TOTAL

QUERY_HISTORY

Pruning efficiency


We also examined the number of credits consumed by query type. We found that Gen2 delivers substantial improvements to DELETE, UPDATE and MERGE operations, with as much as a 25% reduction in cost for UPDATE.

Credit-weighted impact by query type

Here are our findings from internal research into 2 billion query profiles:

Query type

Gen2 impact

Notes

UPDATE

~25% cost reduction

Significant improvement across most sizes

DELETE

~20% cost reduction

Improvement varies by size

MERGE

~5% cost reduction

Modest improvement

PARTITIONED_UNLOAD

~2-5% cost reduction

Small improvement

SELECT

1-18% cost increase

Regression at all sizes

INSERT

Variable (up to +46%)

Regresses at larger sizes

CREATE_TABLE_AS_SELECT

Variable

Mixed results


As you can see in the table above, Gen2 was found to be more cost-effective for DML operations, while Gen1 wins for SELECT.

Bear in mind that these results are based on internal research and may not be indicative of future results for your business.

TPC-DS queries most affected by spilling

To cross-verify the DML findings of our internal study, we delved a little deeper and took a look at the exact TPC-DS queries that benefited the most from Gen2 warehouses.

The following queries showed the most dramatic performance difference between Gen1 and Gen2 at a small size:

Query

Gen1 (sec)

Gen2 (sec)

Gen2 faster

query04

370.42

0.31

99.9%

query11

218.66

0.29

99.9%

query23_p2

147.81

0.39

99.7%

query23_p1

141.32

0.35

99.8%

query78

134.54

0.36

99.7%

query74

111.59

0.24

99.8%

query87

75.45

0.40

99.5%

query09

61.46

0.43

99.3%


These queries involve complex joins, aggregations and window functions that require significant intermediate memory.

The SELECT regression problem

Our internal study also found that Gen2 shows a cost regression on SELECT queries. This is an interesting finding as SELECT queries are the bread and butter of analytics.

Size

SELECT Cost Change (Gen1 → Gen2)

X-Small

+18.2% (regression)

Small

+14.6% (regression)

Medium

+1.9% (regression)

Large

+0.7% (regression)

X-Large

+2.2% (regression)


Meanwhile, DML operations improved significantly on Gen2:

Operation

Typical improvement

UPDATE

~25 cost reduction

DELETE

~20% cost reduction

MERGE

~5% cost reduction


What this means: Gen2 is optimized for data engineering workloads (ETL, DML-heavy pipelines), not pure analytics. The ARM architecture and query engine optimizations favor write operations. For read-heavy analytical workloads (which is what TPC-DS measures) Gen1's x86 architecture is still competitive, and sometimes offers better performance. 

TPC-DS is 100% SELECT queries. That's why Gen1 wins at large when there is no memory pressure (its weakness) plus SELECT-optimized architecture (its strength).

Plot twist: Gen1 wins at large sizes

Here's where it gets interesting. If Gen2's memory efficiency is its superpower, what happens when memory isn't the bottleneck? 

We determined the fastest warehouse type at each size per query. The results show a clear pattern based on memory requirements:

  1. Gen1 warehouses win at large+ sizes, where memory is sufficient. They provide the best performance for 62% of queries tested.
  2. Gen2 warehouses dominate execution times at small/medium sizes, due to built in query optimizations and higher memory bandwidth. They provide the best performance for more than half of the queries. 

Here are the win rates by warehouse size: 

Size

Winner

Win %

Large

Gen1

62.1%

Large

Gen2

37.9%

Medium

Gen2

56.3%

Medium

Gen1

43.7%

Small

Gen2

100%

Small

Gen1

0%


At large size, Gen1 wins 62% of queries. At small, Gen1 wins zero queries. The crossover happens at medium.

This is the finding that should make you pause before upgrading: Gen2 is not universally better. It depends on multiple factors, such as warehouse size as well as the type of query you are running. This just reinforces what every Snowflake admin already knows, Snowflake warehouse optimization requires experimentation. 

This is precisely why we built Capital One Slingshot: to help data teams test, validate, and optimize their data clouds. You can easily set up or migrate a warehouse to Gen2 in Slingshot to assess the impact of that change. Read more about that in this post.

Cost analysis

Up until now we covered the differences in performance between warehouse types. Now we’re looking at the credit consumptions and overall cost of each warehouse tested. 

Both the benchmark and our internal research indicate that while Gen2 has a higher cost per hour, its efficiency gains can negate the higher rate. In our analysis of 2 billion query profiles in April 2025, we found that Gen2 can be up to 25% more cost-effective for DML operations.

In the benchmark, we found that Gen1 small (10.58 credits) and Gen2 Small (9.66 credits) consumed nearly identical credits despite a 60x difference in query runtime. This seems counterintuitive- if Gen2 finished in under a minute, shouldn’t it cost dramatically less? The explanation lies in how Snowflake bills compute. 

Snowflake credits accrue based on warehouse active time, not query execution time. Gen1 small’s queries ran for 38 minutes but much of that time was spent waiting on disk I/O during spilling. This means that the warehouse was “active” but throttled by storage bottlenecks, not efficiently consuming compute. Gen2 small completed the same work in 40 seconds using actual compute cycles. 

The near identical credit consumption reveals something important: you’re paying roughly the same either way, but Gen2 can give you 60x better performance for the spend. For that reason it makes more sense to say Gen2 delivers more value per credit.

When we look at other sizes where Gen1 does not fail, we found Gen1 can be more cost-effective than Gen2 warehouses. Bear in mind that the TPC-DS queries are SELECT queries, which is where Gen1 really shines. 

As for Snowpark-optimized warehouses, they offered either comparable or better cost-effectiveness than their most effective standard counterparts, making them a viable option especially for small sized warehouses where their 16x memory per node makes a significant impact.

WAREHOUSE

WAREHOUSE_TYPE

SIZE

TOTAL_CREDITS

COST_USD

Difference

GEN1_L

GEN1_STANDARD

L

1.62

3.23

30% more efficient than Gen2  

GEN1_M

GEN1_STANDARD

M

0.56

1.12

52% more efficient than Gen2

GEN1_S

GEN1_STANDARD

S

10.58

21.15

10% less efficient than Gen2

GEN1_XL

GEN1_STANDARD

XL

3.31

6.62

27% more efficient than Gen2

GEN2_L

GEN2_STANDARD

L

2.33

4.67

 

GEN2_M

GEN2_STANDARD

M

1.18

2.36

 

GEN2_S

GEN2_STANDARD

S

9.66

19.33

 

GEN2_XL

GEN2_STANDARD

XL

4.51

9.01

 

SP_L_64X

SNOWPARK-OPTIMIZED

L

1.38

2.77

14% more efficient than Gen1

SP_M_16X

SNOWPARK-
OPTIMIZED

M

0.56

1.13

Comparable to Gen1

SP_S_1X

SNOWPARK-OPTIMIZED

S

0.19

0.37

52x more efficient than Gen2

 

When should you migrate to Gen2?

Based on the benchmark analysis and our internal research, here’s an honest path to make your decision: 

Upgrade to Gen2 if: 

  • >30% of credits come from DML (UPDATE, DELETE, MERGE) – Gen2’s sweet spot

  • You’re running small/medium sized warehouses- Gen2’s memory efficiency shines here 

  • You see frequent spilling - Check BYTES_SPILLED_TO_LOCAL_STORAGE in QUERY_HISTORY 

  • >30% of queries scan >1TB - Large scans benefit from DDR5 bandwidth

Think twice about Gen2 if: 

  • Your workload is SELECT-heavy analytics – TPC-DS shows 1-18% regression 
  • You’re already running large or XL warehouses – Gen1 wins 62% of queries at this tier 
  • Cost efficiency is paramount- Gen1 is often more cost-effective than Gen2 and offers equivalent performance on read workloads 

Consider Snowpark-optimized if: 

  • Memory pressure is your primary bottleneck 
  • You’d otherwise need to scale large/XL just to avoid spilling 
  • You’re running ML, complex UDFs, or large DataFrame operations

Here’s a query to help you assess if a warehouse is a good candidate for Gen2.

-- Gen2 Candidate Assessment Query
WITH warehouse_metrics AS (
    SELECT 
        WAREHOUSE_NAME,
        COUNT(*) as total_queries,
        AVG(PARTITIONS_SCANNED / NULLIF(PARTITIONS_TOTAL, 0)) * 100 
            as pct_partitions_scanned,
        SUM(CASE WHEN QUERY_TYPE IN ('DELETE', 'MERGE', 'UPDATE') 
            THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) / 
            NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100 
            as pct_credits_dml,
        SUM(CASE WHEN BYTES_SCANNED > 1e12 
            THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) /
            NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100 
            as pct_credits_large_scans,
        SUM(CASE WHEN TOTAL_ELAPSED_TIME > 1800000 
            THEN CREDITS_USED_CLOUD_SERVICES ELSE 0 END) /
            NULLIF(SUM(CREDITS_USED_CLOUD_SERVICES), 0) * 100 
            as pct_credits_long_queries
    FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
    WHERE START_TIME > DATEADD(day, -28, CURRENT_TIMESTAMP())
    GROUP BY WAREHOUSE_NAME
)
SELECT 
    WAREHOUSE_NAME,
    ROUND(pct_partitions_scanned, 1) as pct_partitions,
    ROUND(pct_credits_dml, 1) as pct_dml,
    ROUND(pct_credits_large_scans, 1) as pct_large_scans,
    ROUND(pct_credits_long_queries, 1) as pct_long_queries,
    (CASE WHEN pct_partitions_scanned > 40 THEN 1 ELSE 0 END +
     CASE WHEN pct_credits_dml > 30 THEN 1 ELSE 0 END +
     CASE WHEN pct_credits_large_scans > 30 THEN 1 ELSE 0 END +
     CASE WHEN pct_credits_long_queries > 30 THEN 1 ELSE 0 END) as criteria_met
FROM warehouse_metrics
ORDER BY criteria_met DESC;

Run this query on your warehouses to see if they are a good candidate for Gen2 based on the criteria we shared above.

Alternatively, you can let Slingshot assess that for you. Slingshot offers a low-risk way to test and validate whether Gen2 is right for your workloads. If you’d like to see this in action, book a time to meet with the team.

Conclusion

We ran the full TPC-DS benchmark on Snowflake Gen1, Gen2 and Snowpark-optimized warehouses in an attempt to assess when to use each warehouse type. Overall, the data points to differences in cost and performance across warehouse types, based on size and query types. 

If you take one thing from this analysis, let it be this: profile before you upgrade. 

Run the assessment query against your workspace and see what kind of workloads you are running. If you see significant DML credits or spilling, Gen2 will undoubtedly help. If you have SELECT- dominated workloads with no spilling, Gen1 is probably the better bet.

Snowflake’s most important strength isn't any single warehouse generation. It’s the model: decoupled storage and compute, a consistent SQL surface area and increasingly specialized compute choices you can swap in and out as your workload evolves. 

Gen2 is a meaningful part of that evolution. In the scenarios it targets DML-heavy pipelines, memory sensitive execution and performance under pressure. It delivers exactly the kind of stability and efficiency improvements that modern data engineering teams need.

For completeness, here is the full table of results from our study. 

WAREHOUSE

WAREHOUSE_TYPE

SIZE

TOTAL_MIN

AVG_SEC

MEDIAN_SEC

P95_SEC

Gen2 vs. Gen1 vs. Snowpark -optimized P95s

GEN1_L

GEN1_STANDARD

L

0.62

0.36

0.28

0.76

 

GEN1_XL

GEN1_STANDARD

XL

0.65

0.38

0.31

0.79

 

GEN2_L

GEN2_STANDARD

L

0.85

0.5

0.42

1.01

33% less efficient than Gen1

GEN2_XL

GEN2_STANDARD

XL

0.67

0.39

0.31

0.86

8% less efficient than Gen1

GEN1_M

GEN1_STANDARD

M

0.71

0.41

0.32

0.93

 

GEN2_M

GEN2_STANDARD

M

0.65

0.38

0.29

0.87

6% more efficient than Gen1

GEN1_S

GEN1_STANDARD

S

38.24

22.28

8.43

107.3

 

GEN2_S

GEN2_STANDARD

S

0.64

0.37

0.31

0.89

120x less efficient than Gen1

SP_M_16X

SNOWPARK-OPTIMIZED

M

0.7

0.41

0.34

0.8

8% more efficient than Gen2

SP_L_64X

SNOWPARK-OPTIMIZED

L

0.59

0.34

0.3

0.64

19% more efficient than Gen1

SP_S_1X

SNOWPARK-OPTIMIZED

S

0.55

0.32

0.27

0.63

29% more efficient than Gen2

 

Run the Snowflake benchmark

Download the worksheet to run the TPC-DS Snowflake benchmark & verify our findings.

Sachin Seth, Technical Writer

Sachin Seth is a data platform architect and analytics product builder known for his deep work benchmarking Databricks compute and delivering high-performance data applications at scale. He develops full-stack analytics solutions—ranging from billion-point time-series engines to portfolio optimization apps and real-time financial dashboards—blending Databricks, Rust, Arrow and modern web technologies. He writes to bring clarity, measurement and engineering rigor to the rapidly evolving world of Databricks and modern data platforms.

Related Content

4 illustrative data charts on blue background with "TPC-DS Benchmark Analysis" headline.
Article | January 8, 2026 |10 min read
Article | January 29, 2026 |5 min read