Databricks — AutoML & Model Serving
In our previous blog, we talked about different MLflow components and concentrated on tracking, managing models & deploying into model registry. In this blog, we’ll talk about Databricks AutoML feature and MLflow model serving.
AutoML
Databricks AutoML helps you automatically apply machine learning to a dataset. It prepares the dataset for model training and then performs and records a set of trials, creating, tuning, and evaluating multiple models. It displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.
To start with, let’s first prepare the dataset (I have taken a sample dataset from sklearn) to train the model and save this as a Delta table.
%python
import sklearn
input_pdf = sklearn.datasets.fetch_california_housing(as_frame=True)chDf = spark.createDataFrame(input_pdf)
chDf.write\
.format("delta")\
.save("/mnt/delta/california_housing")
spark.sql ("CREATE TABLE default.california_housing USING DELTA LOCATION '/mnt/delta/california_housing'")
Once we prepare the training dataset we can use Databricks AutoML experience to train multiple models and present to us.
We’ll select a cluster (I’m using Azure Databricks Runtime for Machine Learning 8.3 ML Beta). As this is a regression problem we’re trying to solve, we’ll select ML problem type as Regression (other option available is Classification).
Next, we’ll browse and select the training dataset i.e. the Delta table we’ve created and the prediction target column.
Once the right configuration is set, AutoML will start training. The training process will last for 60 minutes. We can stop the process if required.