Azure Databricks with Azure Key Vaults

Why?
While connecting to any output storage/systems from Databricks we need to provide user ids/passwords or access keys. These secrets are in clear texts and whoever is having the Databricks workspace access, can see these!

Few examples below –
- Connection setting to Azure Blob Storage
%scala
spark.conf.set(“fs.azure.account.key.<storage_account>.blob.core.windows.net”, “<storage_account_access_key in clear text>”)
2. Connection setting to Azure SQL DW
%scala
val df = spark.read
.format(“com.databricks.spark.sqldw”)
.option(“url”, “jdbc:sqlserver://<server-name>:1433;database=<database_name>;user=<user>;password=<password in clear text>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;”)
.option(“tempdir”, “wasbs://<container>@<storage_account>.blob.core.windows.net/<container>”)
.option(“forward_spark_azure_storage_credentials”, “true”)
.option(“query”, “SELECT * FROM MyTable WHERE PrimaryKey = 123456”)
.load()
What is the Solution?
There are couple of solutions:
- Create a secret in a Databricks-backed scope
Refer: https://docs.azuredatabricks.net/user-guide/secrets/secrets.html for further details.
2. Create a secret in an Azure Key Vault-backed scope
Azure Key Vault integration with Azure Databricks is in preview mode at the time of this writing.
Step 1: Create a new Key Vault



Step 2: Create a Secret


Step 3: Create a Secret Scope at Databricks
Open the Databricks Create Secret Scope link: https://<Databricks link>#secrets/createScope

Azure Databricks is now linked with the Azure Key Vault!
Step 4: Use the Secrets from Azure Databricks
We can access the secret value from Azure Key Vault by using the following:
dbutils.secrets.get(scope = “my-azure-key-vault-scope”, key = “MySecret”)
So, the connections to the external systems can be changed as follows:
- Azure Blob Storage:
%scala
spark.conf.set(“fs.azure.account.key.<storage_account>.blob.core.windows.net”, dbutils.secrets.get(scope = “my-azure-key-vault-scope”, key = “secret-access-key”))
2. Azure SQL Data Warehouse:
%scala
val df = spark.read
.format(“com.databricks.spark.sqldw”)
.option(“url”, s”””jdbc:sqlserver://<server-name>:1433;database=<database_name>;user=${dbutils.secrets.get(scope = “my-azure-key-vault-scope”, key = “username”)};password=${dbutils.secrets.get(scope = “my-azure-key-vault-scope”, key = “MySecret”)};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;”””)
.option(“tempdir”, “wasbs://<container>@<storage_account>.blob.core.windows.net/<container>”)
.option(“forward_spark_azure_storage_credentials”, “true”)
.option(“query”, “SELECT * FROM MyTable WHERE PrimaryKey = 123456”)
.load()
Now, users having access to Databricks notebooks can only see the Azure Key Vault secret names but not the actual secrets! They can only use it to access the external system from other notebooks. Databricks blocks printing the actual value in notebook execution output.

If we want to apply role based access control to restrict read/access a notebook at all, we should follow: https://docs.microsoft.com/en-us/azure/role-based-access-control/role-assignments-portal
That’s all?
Wait! Databricks maintains revision history. So, we need to clean this up to remove any trace of passwords. Take some care, as this could remove any important version changes.
If we have already integrated Git with Databricks, we need to remove passwords/access keys from our the version management system as well.

Cautions
Azure Key Vault comes with some extra cost, for details follow: https://azure.microsoft.com/en-us/pricing/details/key-vault/
Update — 25th June 2020
Never pass the secrets to another notebook; fetch & use them locally!


The clear text parameters can be retrieved from the job execution history.