Azure Databricks & Version Management

Azure Databricks supports notebook integration with Github, Bitbucket Cloud & Azure DevOps Services. In this blog we’ll see how Databricks can be easily integrated with GitHub and Azure DevOps Repos.

Image for post
Image for post

Integration of Databricks with Github

1.1. Create a Personal Git Access Token: Go to Settings > Developer settings (ref: https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/).

Image for post
Image for post
Select Developer settings

Select Personal access tokens. Add token description and appropriate scopes. Press the ‘Generate token’ button.

Image for post
Image for post
Add details in the Personal access tokens screen

Save the token safely.

Image for post
Image for post
Note the token

1.2. Create a Git Repository:

Image for post
Image for post
Create a new repository
Image for post
Image for post
Note the .git link
Image for post
Image for post

2.1. Git Provider Selection: Go to User Settings > Git Integration. Select ‘GitHub’ as Git provider, input the token previously copied and save.

Image for post
Image for post
Git provider set-up

2.2. Individual Notebook Integration with Git: Go to your notebook and select Revision history. By default (if external Git is not linked), Databricks manages the version.

Image for post
Image for post
Select Revision history

Click on the ‘Git: Not linked’, Add the repository link, select the appropriate branch, input the folder (if any) into the ‘Path in Git Repo’ and save. If the inputs are fine, Git will be synced successfully.

Image for post
Image for post
Input the Git details
Image for post
Image for post
Add comments and save

Though this is fine for any individual Databricks notebook integration (periodic commit) with Git however, while doing multiple code check-ins or Continuous Integration /Continuous Delivery (CI-CD), we should use Databricks CLI along with Git command line. The steps have been described below.

3.1. Databricks CLI with Connection Profiles:

databricks configure --profile <profile name> --token

Image for post
Image for post
Configure CLI say for Connection Profile = Development

Providing profile name is useful when we want to connect Databricks CLI to more than one Databricks workspace/instance e.g. Development & Test.

3.2. Export Databricks Workspace to Local Computer

databricks workspace --profile <profile name> export_dir <Databricks Workspace/folder> <target local folder/repo>

Image for post
Image for post
Export Databricks code to local repo

3.2. Check-in the code from Local to Github

git init
git add *
git commit -m “first commit”
git remote add origin https://github.com/cprosenjit/DatabricksDemo.git
git push -u origin master

You’ll be asked for user id & password while pushing the code into the remote origin.

Image for post
Image for post
After successful check in

Integration of Azure Databricks with Azure DevOps Repos

1.1. Launch Azure DevOps & create a new project under your selected organization. Select ‘Visibility’ (Public or Private) as appropriate. Select ‘Version control’ as appropriate. As per Microsoft,

Git is the default version control provider for new projects. You should use Git for version control in your projects unless you have a specific need for centralized version control features in TFVC.

Image for post
Image for post
Create new project
Image for post
Image for post
The project is created and will be listed under the selected ‘My organizations’

1.2. Go to the Repo and copy the URL.

Image for post
Image for post
Copy the URL

2.1. Integrate Azure Databricks with Azure DevOps by going to User Settings > Git Integration. This doesn’t need any extra authentication to be supplied.

Image for post
Image for post
Select Git provider as Azure DevOps Services

2.2 Select a Notebook, click on the ‘Revision history’, input the Git Repo link, select branch and input path as appropriate.

Image for post
Image for post
Input the link copied before
Image for post
Image for post
Input commit message
Image for post
Image for post
Git will be Synced
Image for post
Image for post
Azure Repo will have the code now

Any further changes in the code can be manually committed into the Repo.

Pushing individual notebook to repository manually is quite laborious so, we would like to use Databricks CLI to download the code into developer’s machine and upload to repository using Git command-line.

Follow the previous section on how to ‘Databricks CLI with Connection Profiles’ and ‘Export Databricks Workspace to Local Computer’.

Once the code is exported in your local directory, use Git command-line to check-in the code into Azure Repos.

git init
git add *
git commit -m “first commit”
git remote add origin https://<xxxx/yyyy>/DatabricksDemo/_git/DatabricksDemo
git push -u origin master

Written by

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store