Reduce your snowflake costs by accurately estimating the costs of each SQL statement.

Miguel Duarte
7 min readMar 4, 2024

--

Abstract: An efficient way to accurately estimate the fair cost of snowflake SQL statements, using SQL only, leveraging SQL:2016match_recognizeto reconstruct warehouse activity periods from the query history table. This approach is also efficient, computing the cost at approximately 160.000 queries per second on an X-SMALL warehouse. The estimates can be used to focus on which SQL patterns need optimization.

Snowflake has become my go-to database for heavy analytics in the last two years due to its performance, advanced features and usability.

Unlike other databases, costs are driven mainly by the time virtual warehouses are active[1], so there is a strong incentive to keep your SQL nimble, right-sized warehouse clusters and optimal scaling policies.

Sadly, as of today, users are not provided with good information about how much each SQL statement contributes to the overall cost. In the admin console, users can see how much each warehouse costs, storage allocation and a list of the "most expensive queries":

Snowflake Cost Management dashboard

However, this view is misleading, as it only considers total execution time and warehouse size. This approach will both:

  • Overestimate costs: Snowflake warehouses can run multiple queries in parallel in each cluster. Therefore, a query's fair cost should split the warehouse activity cost by all the queries that ran in the same activity window.
  • Underestimate costs: A snowflake query that executes very quickly — say 1 ms -but is run every minute can keep a warehouse permanently busy, incurring costs even though most of the warehouse compute capacity is going to waste.

What if we could reconstruct the warehouse activity using only the query history table? The query history table provides:

  • Query start and end time
  • Warehouse name
  • Warehouse size
  • Cluster number

This information can be used to reconstruct warehouse cluster activity windows alongside the queries that execute inside them. By applying snowflake billing rules, we can estimate the total cost of each activity window and split the cost for each query proportionally to its share of the full execution time.

Preparation

Before going into the computation details, we must prepare the database to analyse the query history table efficiently. It is not strictly necessary, but queries on a cloned table can be over ten times faster than the same query on the base snowflake.account_usage table:

-- clone table
create or replace table query_history_backup as
select *
from snowflake.account_usage.query_history
limit 0;

-- copy the data incrementally
insert into query_history_backup
select qh.*
from snowflake.account_usage.query_history qh
where qh.end_time > (
select coalesce(max(end_time), '2000-01-01'::timestamp_ntz)
from query_history_backup
)
order by start_time;

To account for warehouse suspend time we also need to obtain warehouse information.

show warehouses;
create table warehouses as select * from table(result_scan(last_query_id()));

Sadly, snowflake doesn't currently provide a way to programmatically fetch the warehouse suspend time for a specific time interval. So this means this approach assumes warehouse suspend times haven't changed in the current analysis.

To assert how close the estimated costs are to the real warehouse, it is also convenient to backup the warehouse metering history table:

create or replace table warehouse_metering_history_backup as select * from snowflake.account_usage.warehouse_metering_history limit 0;

insert into warehouse_metering_history_backup
select wmh.*
from snowflake.account_usage.warehouse_metering_history wmh
where wmh.end_time > (
select coalesce(max(end_time), '2000-01-01'::timestamp_ntz)
from warehouse_metering_history_backup
)
order by start_time;

Efficiently estimating query costs:

create or replace table warehouse_costs_md as
with wh_etl as (
select qhb.query_id,
qhb.query_tag,
qhb.user_name,
qhb.role_name,
qhb.start_time,
qhb.end_time,
min(qhb.start_time) over ( -- see explanation below
partition by qhb.warehouse_name, qhb.warehouse_size, qhb.cluster_number
order by end_time desc
rows between unbounded preceding and current row
) as min_start,
qhb.warehouse_name,
qhb.warehouse_size,
qhb.warehouse_type,
qhb.cluster_number,
qhb.total_elapsed_time,
qhb.credits_used_cloud_services,
qhb.query_parameterized_hash,
w."auto_suspend" as auto_suspend
from query_history_backup qhb
inner join warehouses w
on (qhb.warehouse_name = w."name")
where warehouse_size is not null -- if warehouse size is null compute costs are 0
),
queries as (
select criteria,
match_nr,
warehouse_name,
cluster_number,
warehouse_size,
warehouse_type
query_id,
query_parameterized_hash,
query_tag,
user_name,
role_name,
start_time,
end_time,
min_start,
active_start,
active_end,
activity,
total_activity,
credits_used_cloud_services,
total_elapsed_time,
billable_time,
case
when warehouse_size is null then 0
else (activity / total_activity) * billable_time end
as execution_share,
case when warehouse_type = 'STANDARD' then 1
when warehouse_type = 'SNOWPARK-OPTIMIZED' then 1.5
else null
end as cost_multiplier,
execution_share / 3600 *
case
when warehouse_size = 'X-Small' then 1
when warehouse_size = 'Small' then 2
when warehouse_size = 'Medium' then 4
when warehouse_size = 'Large' then 8
when warehouse_size = 'X-Large' then 16
when warehouse_size = '2X-Large' then 32
when warehouse_size = '3X-Large' then 64
when warehouse_size = '4X-Large' then 128
when warehouse_size = '5X-Large' then 256
when warehouse_size = '6X-Large' then 512
else
null
end * cost_multiplier as estimated_credits
from wh_etl match_recognize (
partition by warehouse_name,warehouse_size, cluster_number
order by end_time desc
measures
final min(min_start) as active_start,
final max(timestampadd('seconds', auto_suspend, end_time)) as active_end,
total_elapsed_time as activity,
final sum(total_elapsed_time) as total_activity,
match_number() as match_nr,
classifier as criteria
all rows per match
pattern (warehouse_end warehouse_continue*)
define
warehouse_end as true,
warehouse_continue as end_time >= timestampadd(second, -auto_suspend, lag(min_start) )
) as matchr,
lateral (
select timestampdiff(second, active_start, active_end) as diff,
case when warehouse_size is null then 0
when diff < 60 then 60
else diff end as billable_time
)
)
select *
from queries
order by start_time;

Let's unpack things from the query above:

min(qhb.start_time) over (
partition by qhb.warehouse_name, qhb.warehouse_size, qhb.cluster_number
order by end_time desc
rows between unbounded preceding and current row
) as min_start

The snippet above adds the minimum start_time in all preceding queries inside the set of queries partitioned by warehouse_name , warehouse_size and cluster_number when in descending order by end_time.

from wh_etl match_recognize (
partition by warehouse_name,warehouse_size, cluster_number
order by end_time desc
---(...)
pattern (warehouse_end warehouse_continue*)
define
warehouse_end as true,
warehouse_continue as
end_time >= timestampadd(second,
-auto_suspend,
lag(min_start) )
) as matchr,

The snippet above is where the snowflake magic happens. The match_recognize statement identifies activity windows for each warehouse cluster number: a pattern starting from the end of the activity window warehouse_end and the last row that matches thewarehouse_continue predicate. To match the end_time the current row must be larger or equal to the min_start of the previous row subtracted by the current warehouse auto_suspend configuration.

from wh_etl match_recognize (
-- (...)
measures
final min(min_start) as active_start,
final max(timestampadd('seconds', auto_suspend, end_time)) as active_end,
total_elapsed_time as activity,
final sum(total_elapsed_time) as total_activity,
match_number() as match_nr,
classifier as criteria
--(...)

For each activity window, we're computing

  • When it started active_start
  • When it ended active_end . Considers warehouse suspend time.
  • How long this query used the current cluster activity
  • Total activity of all the queries in the current activity window total_activity
  • Used for debugging — match_nr and the predicate that matchedclassifier
lateral (
select timestampdiff(second, active_start, active_end) as diff,
case when warehouse_size is null then 0
when diff < 60 then 60
else diff end as billable_time
)

The snippet above computes the total billable time for the activity window. Accounts for the minimum 1 minute of billable time.

case
when warehouse_size is null then 0
else (activity / total_activity) * billable_time end
as execution_share,
case when warehouse_type = 'STANDARD' then 1
when warehouse_type = 'SNOWPARK-OPTIMIZED' then 1.5
else null
end as cost_multiplier,
execution_share / 3600 *
case
when warehouse_size = 'X-Small' then 1
when warehouse_size = 'Small' then 2
when warehouse_size = 'Medium' then 4
when warehouse_size = 'Large' then 8
when warehouse_size = 'X-Large' then 16
when warehouse_size = '2X-Large' then 32
when warehouse_size = '3X-Large' then 64
when warehouse_size = '4X-Large' then 128
when warehouse_size = '5X-Large' then 256
when warehouse_size = '6X-Large' then 512
else
null
end * cost_multiplier as estimated_credits

The snippet above computes the fair cost of a single query, scaling the costs to the ratio between this query activity and the total activity of all queries in the activity window.

How accurate is the estimate?

The sum of all the individual query cost estimates should be close to the total compute costs reported in the warehouse metering table. The statement below can be used to verify the estimated monthly compute costs to the real ones. Results are sorted in descending order of estimated costs.

with compare as (
select time,
warehouse_name,
sum(query_count) as query_count,
round(sum(estimated_credits),2) as estimated,
round(sum(credits_used_compute),2) as warehouse_metering,
round(sum(estimated_credits) - sum(credits_used_compute),2) as diff,
case when sum(credits_used_compute) = 0 then null else
round((sum(estimated_credits) - sum(credits_used_compute)) / sum(credits_used_compute),2) end
as relative_diff
from (
select substring(end_time,1,7) as time, warehouse_name,
1 as query_count,
estimated_credits as estimated_credits,
0 as credits_used_compute
from warehouse_costs_md
union all
select substring(end_time,1,7) as time, warehouse_name, 0, 0, credits_used_compute
from warehouse_metering_history_backup
where end_time between
(select min(end_time) from warehouse_costs_md)
and
(select max(end_time) from warehouse_costs_md)
)
group by 1,2
)
select *
from compare
order by 4 desc;

Diff contains the difference in credits between the actual and estimate. The sum of the estimates is under 6% of the real value, which should be considered good to reason about query anti-patterns and wasteful warehouse usage.

Now that we have estimated query costs, we can build a meaningful cost dashboard that highlights the top 100 queries by total costs:

SQL for top_100_expensive_patterns:

with top_queries as (
select query_parameterized_hash,
count(*) as cnt,
sum(estimated_credits) as total_credits,
round(sum(estimated_credits) / (
select sum(x.estimated_credits)
from warehouse_costs_md x
where start_time > '2024-01-01')*100,2) as credits_share_pct
from warehouse_costs_md wc
where start_time > '2024-01-01'
group by 1
order by 3 desc
limit 100
)
select
max(cnt) as execution_count,
max(total_credits) as total_credits,
max(credits_share_pct) as credits_share_pct,
max(query_text) as sql_sample
from top_queries tq inner join query_history_backup qh on (tq.query_parameterized_hash = qh.query_parameterized_hash)
where qh.start_time > '2024-01-01'
group by tq.query_parameterized_hash
order by 2 desc;

SQL for top_users:

select substring(start_time, 1, 10) as day, user_name, sum(estimated_credits)
from warehouse_costs_md
where start_time > '2024-01-01'
group by 1,2
order by 1 asc;

[1] — https://docs.snowflake.com/en/user-guide/cost-understanding-compute

--

--

No responses yet