Databricks has now become a default choice of service for big data computation in Azure, by its own merit. As more and more clients are embracing it (and Apache Spark) with their versatile use cases, some people started complaining about the hefty Azure bill they’re getting and Azure Databricks’ contribution on that!
Though cloud services has brought infrastructures & services provisioning time from months to seconds however, appropriate governance & controls have become more important.
So, instead of blaming the cloud services (here, Databricks) why not we learn the cost optimization techniques and spend money based on our business needs only.
First, we’ll check on how to get the cost information for Azure Databricks. So, straight away we’ll go to the Cost Management + Billing section & will select the Cost Management > Cost analysis for the subscription.
Though we generally look for the azure databricks from the Service name dashboard but, that’ll only give the cost of the Azure Databricks service; the actual cost should be more if we consider the cost contributed by the Azure infrastructures like, virtual machines, storage, virtual network etc.
So, if we really want to understand the total cost of a particular Databricks installation or instance, we should check the Cost analysis of the:
(i) Resource group, for Databricks service cost and
(ii) Managed Resource Group, for Azure infrastructure cost.
- Click on the Resource group > go to Cost Management > Cost analysis > check the cost of the azure databricks service.
2. Click on the Managed Resource Group > go to Cost Management > Cost analysis > check the cost splits for different infrastructure components used.
Once we have these details at the resource group or project level we can then prioritize the projects which need cost optimization based on their spending.
Mostly the Databricks cost is dependent on the following items:
- Infrastructure: Azure VM instance types & numbers (for drivers & workers) we choose while configuring Databricks cluster. In addition, cost will incur for managed disks, public IP address or any other resources such as Azure Storage etc.
- Pricing Tier: Premium, Standard.
- Workload: Data Analytics, Data Engineering, Data Engineering Light.
- Apache Spark coding (how optimized our code is — very important; however, out of scope for today’s discussion).
In this blog we’ll look at the first three points which can help us to save some cost. (Please go here for Azure Databricks latest pricing details.)
Note: Most of the optimizations described below are more relevant for the development environment where we can afford less powered clusters.
Infrastructure — choosing the right VM types
For this analysis, let us assume —
- Number of VM (worker/driver) = 1
- Hours to run per day = 24
- Days to run = 30 (so total hours/month = 720 hours)
- Cores I’m looking for (specially for development) = 8
- General workload, while we’re in development phase
Now if we take the low end VMs and calculate the VM costs & DBU costs, we can come up with the total costs of different cluster types.
1. VM Cost = [Total Hours] X [No. of Instances] X [Linux VM Price]2. DBU Cost = [Total Hours] X [No. of Instances] X [DBU] X [DBU Price/hour - Standard / Premium Tier]
The above tables shows the total costs of few cluster types for different pricing tiers & workloads. If we zoom into the green boxes, the three General Purpose types are generally cheaper options.
Now, if we want to select one among the 3 cheapest VM types, we should find the best option in terms of performance. Find below another comparison (for further details, check here):
Though DS4 v2 has less memory compared to D8 v3 & D8s v3 and costlier as well, however better in terms of storage & disk throughput and network bandwidth.
Considering these we can choose Standard_DS4_v2 for our driver and worker VM types to start with.
We can also run the desired workload in different types of VM types and measure the job completion time.
- For specific use cases where we’re looking for performance optimizations in development environment as well, we can migrate or move to memory/storage/compute optimized cluster types.
- For production work loads, we should select the cluster type based on performance & then we can consider the cost.
Choosing the right Pricing Tier & Workload Type
Let’s see a short description about the tiers & types:
- Data Analytics — Interactive workloads.
- Data Engineering — Job cluster (faster).
- Data Engineering Light — Job cluster with a lot of Databricks features not supported.
- Premium — RBAC, JDBC/ODBC Endpoint Authentication, Audit logs (preview)
- Standard — Interactive, Delta, collaboration, ML flow etc.
Check the full comparison from here.
Now if we assume, the Cluster type = General Purpose, VM Type = Standard_DS4_v2, Hours to Run per Day = 24 / Days to Run = 30, for different Databricks pricing tiers & workloads the total cost would be:
If we observe the above, we can find:
- Premium — Data Analytics — very costly!
- Standard — Data Engineering / Light — cheaper.
- Data Engineering Light — slower, the cheaper ‘total cost’ may be overridden by its lower speed!
- For development purpose, start with a smaller cluster, General Purpose — Standard_DS4_v2 or VMs like this should give a cost benefit compared to other types.
- Go for compute/memory — optimized etc. special types of clusters for your specific use cases only.
- Most of the cases, in development we probably don’t need Databricks Premium Tier. If you already have and wants to migrate, follow this.
- In case of development, you can use the interactive cluster (Data Analytics), in case of testing try to use the job cluster.
- A very common ‘costly’ usage — if we configure a Databicks interactive cluster as a linked service for ADFv2 pipelines.
- Try to use auto-scaling wherever possible.
- For production cluster you’ll probably need the Premium tier as it supports one important feature — role based access control.
Few more tips
- Group the small batches into larger ones — to reduce multiple times VM warm ups & cool downs, better utilization of cluster.
- Choose low number of higher VM types over high number of smaller VM types — to reduce data shuffling.
- Keep data & computations are in the same region - to avoid inter-region data transfers.
- Watch out for unused ADFv2 pipelines — once development phase is over and we move on, we may forget to stop the running pipelines which would be hitting Databricks clusters incurring unnecessary costs. Use Azure Advisor to identify failing ADF pipelines.
Thanks for reading. In case you have any use case and want to connect, please ping me via LinkedIn.