Reduce your snowflake costs by accurately estimating the costs of each SQL statement.

7 min readMar 4, 2024

Abstract: An efficient way to accurately estimate the fair cost of snowflake SQL statements, using SQL only, leveraging SQL:2016match_recognizeto reconstruct warehouse activity periods from the query history table. This approach is also efficient, computing the cost at approximately 160.000 queries per second on an X-SMALL warehouse. The estimates can be used to focus on which SQL patterns need optimization.

Snowflake has become my go-to database for heavy analytics in the last two years due to its performance, advanced features and usability.

Unlike other databases, costs are driven mainly by the time virtual warehouses are active[1], so there is a strong incentive to keep your SQL nimble, right-sized warehouse clusters and optimal scaling policies.

Sadly, as of today, users are not provided with good information about how much each SQL statement contributes to the overall cost. In the admin console, users can see how much each warehouse costs, storage allocation and a list of the "most expensive queries":

However, this view is misleading, as it only considers total execution time and warehouse size. This approach will both:

Overestimate costs: Snowflake warehouses can run multiple queries in parallel in each cluster. Therefore, a query's fair cost should split the warehouse activity cost by all the queries that ran in the same activity window.
Underestimate costs: A snowflake query that executes very quickly — say 1 ms -but is run every minute can keep a warehouse permanently busy, incurring costs even though most of the warehouse compute capacity is going to waste.

What if we could reconstruct the warehouse activity using only the query history table? The query history table provides:

Query start and end time
Warehouse name
Warehouse size
Cluster number

This information can be used to reconstruct warehouse cluster activity windows alongside the queries that execute inside them. By applying snowflake billing rules, we can estimate the total cost of each activity window and split the cost for each query proportionally to its share of the full execution time.

Preparation

Before going into the computation details, we must prepare the database to analyse the query history table efficiently. It is not strictly necessary, but queries on a cloned table can be over ten times faster than the same query on the base snowflake.account_usage table:

-- clone table
create or replace table query_history_backup as 
select * 
from snowflake.account_usage.query_history 
limit 0;

-- copy the data incrementally
insert into query_history_backup
    select qh.* 
    from snowflake.account_usage.query_history qh
    where qh.end_time > (
        select coalesce(max(end_time), '2000-01-01'::timestamp_ntz) 
        from query_history_backup
    )
    order by start_time;

To account for warehouse suspend time we also need to obtain warehouse information.

show warehouses;
create table warehouses as select * from table(result_scan(last_query_id()));

Sadly, snowflake doesn't currently provide a way to programmatically fetch the warehouse suspend time for a specific time interval. So this means this approach assumes warehouse suspend times haven't changed in the current analysis.

To assert how close the estimated costs are to the real warehouse, it is also convenient to backup the warehouse metering history table:

create or replace table warehouse_metering_history_backup as select * from snowflake.account_usage.warehouse_metering_history limit 0;

insert into warehouse_metering_history_backup
    select wmh.* 
    from snowflake.account_usage.warehouse_metering_history wmh
    where wmh.end_time > (
        select coalesce(max(end_time), '2000-01-01'::timestamp_ntz) 
        from warehouse_metering_history_backup
    )
    order by start_time;

Efficiently estimating query costs:

create or replace table warehouse_costs_md as
with wh_etl as (
    select qhb.query_id,
           qhb.query_tag,
           qhb.user_name,
           qhb.role_name,
           qhb.start_time,
           qhb.end_time, 
           min(qhb.start_time) over ( -- see explanation below
                partition by qhb.warehouse_name, qhb.warehouse_size, qhb.cluster_number 
                order by end_time desc 
                rows between unbounded preceding and current row
           ) as min_start,
           qhb.warehouse_name,
           qhb.warehouse_size,
           qhb.warehouse_type,
           qhb.cluster_number,
           qhb.total_elapsed_time,
           qhb.credits_used_cloud_services,
           qhb.query_parameterized_hash,
           w."auto_suspend" as auto_suspend
    from query_history_backup qhb  
         inner join warehouses w     
         on (qhb.warehouse_name = w."name") 
    where warehouse_size is not null -- if warehouse size is null compute costs are 0
),
queries as (
    select criteria,
            match_nr,
            warehouse_name,
            cluster_number,
            warehouse_size,
            warehouse_type
            query_id,
            query_parameterized_hash,
            query_tag,
            user_name,
            role_name,
            start_time,
            end_time,
            min_start,
            active_start,
            active_end,
            activity,
            total_activity,
            credits_used_cloud_services,
            total_elapsed_time,
            billable_time,
            case
                when warehouse_size is null then 0
                else (activity / total_activity) * billable_time end
            as execution_share,
            case when warehouse_type = 'STANDARD' then 1
                 when warehouse_type = 'SNOWPARK-OPTIMIZED' then 1.5
                 else null
            end as cost_multiplier,
            execution_share / 3600 *
            case
                when warehouse_size = 'X-Small'  then 1
                when warehouse_size = 'Small'    then 2
                when warehouse_size = 'Medium'   then 4
                when warehouse_size = 'Large'    then 8
                when warehouse_size = 'X-Large'  then 16
                when warehouse_size = '2X-Large' then 32
                when warehouse_size = '3X-Large' then 64
                when warehouse_size = '4X-Large' then 128
                when warehouse_size = '5X-Large' then 256
                when warehouse_size = '6X-Large' then 512
            else
                null
            end * cost_multiplier as estimated_credits
    from wh_etl match_recognize (
        partition by warehouse_name,warehouse_size, cluster_number
        order by end_time desc
        measures
            final min(min_start) as active_start,
            final max(timestampadd('seconds', auto_suspend, end_time)) as active_end,
            total_elapsed_time as activity,
            final sum(total_elapsed_time) as total_activity,
            match_number() as match_nr,
            classifier as criteria
        all rows per match
        pattern (warehouse_end warehouse_continue*)
        define
            warehouse_end as true,
            warehouse_continue as end_time >= timestampadd(second, -auto_suspend, lag(min_start) )
    ) as matchr, 
    lateral (
        select timestampdiff(second, active_start, active_end) as diff,
               case when warehouse_size is null then 0
                    when diff < 60 then 60
                    else diff end as billable_time
        )
)
select *
from queries
order by start_time;

Let's unpack things from the query above:

min(qhb.start_time) over (
    partition by qhb.warehouse_name, qhb.warehouse_size, qhb.cluster_number 
    order by end_time desc 
    rows between unbounded preceding and current row
) as min_start

The snippet above adds the minimum start_time in all preceding queries inside the set of queries partitioned by warehouse_name , warehouse_size and cluster_number when in descending order by end_time.

from wh_etl match_recognize (
        partition by warehouse_name,warehouse_size, cluster_number
        order by end_time desc
---(...)
        pattern (warehouse_end warehouse_continue*)
        define
            warehouse_end as true,
            warehouse_continue as 
                end_time >= timestampadd(second, 
                                         -auto_suspend, 
                                         lag(min_start) )
    ) as matchr,

The snippet above is where the snowflake magic happens. The match_recognize statement identifies activity windows for each warehouse cluster number: a pattern starting from the end of the activity window warehouse_end and the last row that matches thewarehouse_continue predicate. To match the end_time the current row must be larger or equal to the min_start of the previous row subtracted by the current warehouse auto_suspend configuration.

from wh_etl match_recognize (
  -- (...)
        measures
            final min(min_start) as active_start,
            final max(timestampadd('seconds', auto_suspend, end_time)) as active_end,
            total_elapsed_time as activity,
            final sum(total_elapsed_time) as total_activity,
            match_number() as match_nr,
            classifier as criteria
   --(...)

For each activity window, we're computing

When it started active_start
When it ended active_end . Considers warehouse suspend time.
How long this query used the current cluster activity
Total activity of all the queries in the current activity window total_activity
Used for debugging — match_nr and the predicate that matchedclassifier

lateral (
    select timestampdiff(second, active_start, active_end) as diff,
           case when warehouse_size is null then 0
                when diff < 60 then 60
                else diff end as billable_time
    )

The snippet above computes the total billable time for the activity window. Accounts for the minimum 1 minute of billable time.

case
      when warehouse_size is null then 0
      else (activity / total_activity) * billable_time end
  as execution_share,
  case when warehouse_type = 'STANDARD' then 1
       when warehouse_type = 'SNOWPARK-OPTIMIZED' then 1.5
       else null
  end as cost_multiplier,
  execution_share / 3600 *
  case
      when warehouse_size = 'X-Small'  then 1
      when warehouse_size = 'Small'    then 2
      when warehouse_size = 'Medium'   then 4
      when warehouse_size = 'Large'    then 8
      when warehouse_size = 'X-Large'  then 16
      when warehouse_size = '2X-Large' then 32
      when warehouse_size = '3X-Large' then 64
      when warehouse_size = '4X-Large' then 128
      when warehouse_size = '5X-Large' then 256
      when warehouse_size = '6X-Large' then 512
  else
      null
  end * cost_multiplier as estimated_credits

The snippet above computes the fair cost of a single query, scaling the costs to the ratio between this query activity and the total activity of all queries in the activity window.

How accurate is the estimate?

The sum of all the individual query cost estimates should be close to the total compute costs reported in the warehouse metering table. The statement below can be used to verify the estimated monthly compute costs to the real ones. Results are sorted in descending order of estimated costs.

with compare as (
select time, 
    warehouse_name, 
    sum(query_count) as query_count,
    round(sum(estimated_credits),2) as estimated, 
    round(sum(credits_used_compute),2) as warehouse_metering,
    round(sum(estimated_credits) - sum(credits_used_compute),2)  as diff, 
    case when sum(credits_used_compute) = 0 then null else 
     round((sum(estimated_credits) - sum(credits_used_compute))  / sum(credits_used_compute),2) end
    as relative_diff
from (
select substring(end_time,1,7) as time, warehouse_name, 
       1 as query_count, 
       estimated_credits as estimated_credits, 
       0 as credits_used_compute
from warehouse_costs_md
union all
select substring(end_time,1,7) as time, warehouse_name, 0, 0, credits_used_compute
from warehouse_metering_history_backup
where  end_time between 
             (select min(end_time) from warehouse_costs_md) 
             and 
             (select max(end_time) from warehouse_costs_md)
)
group by 1,2
)
select *
from compare
order by 4 desc;

Diff contains the difference in credits between the actual and estimate. The sum of the estimates is under 6% of the real value, which should be considered good to reason about query anti-patterns and wasteful warehouse usage.

Now that we have estimated query costs, we can build a meaningful cost dashboard that highlights the top 100 queries by total costs:

SQL for top_100_expensive_patterns:

with top_queries as (
select query_parameterized_hash, 
       count(*) as cnt, 
       sum(estimated_credits) as total_credits, 
       round(sum(estimated_credits) /  (
        select sum(x.estimated_credits)  
        from warehouse_costs_md x 
        where start_time > '2024-01-01')*100,2) as credits_share_pct
from warehouse_costs_md wc 
where start_time > '2024-01-01'
group by 1
order by 3 desc
limit 100
)
select 
       max(cnt) as execution_count,
       max(total_credits) as total_credits,
       max(credits_share_pct) as credits_share_pct,
       max(query_text) as sql_sample
from top_queries tq inner join query_history_backup qh on (tq.query_parameterized_hash = qh.query_parameterized_hash)
where qh.start_time > '2024-01-01'
group by tq.query_parameterized_hash
order by 2 desc;

SQL for top_users:

select substring(start_time, 1, 10) as day, user_name, sum(estimated_credits)
from warehouse_costs_md
where start_time > '2024-01-01'
group by 1,2
order by 1 asc;

[1] — https://docs.snowflake.com/en/user-guide/cost-understanding-compute

Reduce your snowflake costs by accurately estimating the costs of each SQL statement.

Preparation

Efficiently estimating query costs:

How accurate is the estimate?

Written by Miguel Duarte

No responses yet