While organizations are feeling the need to migrate into cloud, they are worrying on how to restrict access to their applications and data from any kind of unauthorized access (like protecting a castle from enemy attacks!). As public cloud systems can be accessed from anywhere, data theft protection has become challenging.
To avoid these, while designing a big data system we may think of creating data stores and Spark clusters inside our own appropriately locked virtual networks — IAAS model but, then we’ll miss out all of the latest and greatest features available in PAAS!
In this blog, we’ll see how we can restrict the access to data and Spark environment by employing:
(i) Azure Databricks with Azure Active Directory Conditional Access Policies.
(ii) Azure Blob Storage & Azure Data Lake Storage (Gen2) — Service Endpoints.
Create a Virtual Network & a Subnet
To start with we’ll first create a Virtual Network (VNet) and a public subnet. We’ll create a virtual machine (VM) in the public subnet. We’ll use the VM as a jump box and will be used to connect Azure Databricks Workspace, Azure Blob Storage or Azure Data Lake Gen 2.
We want to connect to our big data components from the jump box only!
To make this example simple, we haven’t configured the Azure Firewall however, that can be enabled for further protection. Follow this.
Find below the CIDRs we have considered for our example.
- VNet — CIDR Range = 10.0.0.0/20
- Subnet-A — CIDR Range = 10.0.1.0/24 (IP Range = 10.0.1.0–10.0.1.255)
Create an Azure Databricks Workspace
Once we have our virtual network ready, we’ll install Azure Databricks on this VNet. Though this feature (installing in our VNet) is in preview now, however works fine for this example.
Find below the CIDRs we have used:
- databricks-public-subnet — CIDR Range = 10.0.2.0/24 (IP Range = 10.0.2.0–10.0.2.255)
- databricks-private-subnet — CIDR Range = 10.0.3.0/24 (IP Range = 10.0.3.0–10.0.3.255)
Once we have created the Databricks service, we can open the Workspace URL from any machine!
Deploying Azure Databricks in your Azure Virtual Network — Databricks Documentation
This section describes how to create a Azure Databricks workspace in the Azure Portal and deploy it in your own…
Create a VM & associate a Public IP
While creating the VNet we have already created a subnet — Subnet-A. Now we’ll provision a Windows VM and will associate a public IP. We can connect to the VM from outside of the virtual network by using the public IP.
For this demo, we’re not creating any special Network Security Group for NIC.
Once the VM is created, download the RDP file and connect to the VM (we could use Linux / any other OS version as well).
Once the internet is appropriately configured, we can open the Azure Databricks Workspace from the VM as well!
Azure Active Directory — Conditional Access
Now we’ll look into a feature called Conditional Access available under Azure AD.
Create a new Named trusted location
Let’s create a Named location and add all of the IP ranges which should access the Databricks. If we’re behind a NAT we should give the NAT IP/range.
Create a Conditional Access Policy
We need to create a new conditional access policy which should be applicable for all users and groups (unless we want to exclude a specific group).
Next, we need to select AzureDatabricks as the app this policy should apply on.
Select the Locations as selected Conditions. Only the Names location we created should be exempted from this policy.
Once the conditional access policy is applied, let’s try to open the Databricks URL from any outside machine which is not part of the named location IPs (My Jump Boxes), we’ll hit the following — ‘You cannot access this right now’ page!
Now, log into our VM/jump box (inside Subnet-A) and try…Azure Databricks will be opened successfully!
Create a Storage Account with restricted access
In this step we’ll create an Azure Storage Account — Blob which should be accessed from only the Azure Databricks and the jump box/VM, that means only from the VNet we have created earlier.
To achieve this, while creating the storage account select Allow access from to Selected network and select the virtual network we have created; in our example, we can select all or precisely Subnet-A & the databricks-public-subnet.
Once the storage account has been created, we’ll not be able to access it from the Azure portal launched from outside the selected subnets!
Let’s go to our jump box inside Subnet-A, open the Azure portal and create couple of containers — input and output.
So, the newly created Storage Account can only be accessed from the VNet and not from outside!
To note, in case we want to modify the service endpoints, we can do so by going to the Azure Storage — Firewalls and virtual networks menu item.
For any further details on Azure Storage firewalls and virtual networks, refer:
Create an Azure Data Lake with restricted access
In this step, we’ll create a Data Lake Storage Gen 2 (ADLS Gen 2)and will allow access to the specific VNet as we have done in the previous step.
Similar with the previous example, we can select all or precisely Subnet-A & the databricks-public-subnet.
Once done, we’ll face Access denied error if we want to access the ADLS Gen 2 outside of the VM.
If we use Azure Storage Explorer to access the storages, we’ll see the following error.
Databricks code to work with the Blob & ADLS Gen2
Now as we have the infrastructures ready, we can concentrate on the Spark code. Here, we’ll read a CSV file from the Blob-input folder, calculate & create an output Spark DataFrame and save into Blob-output & ADLS Gen2 — outputCalculation folders.
(The following code snippets are just samples and not following the best practices)
Input from Azure Blob Storage
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net", "<storage-account-access-key>")val inputBlobPath: String = s"wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory-name>/<input-file-name>.csv"val dfInputRead =
Perform some calculation
import org.apache.spark.sql.functions.colval dfOutput = dfInput
.withColumn("Sum", col("Number1") + col ("Number2"))display (dfOutput)
Output to Azure Blob Storage
val outputBlobPath: String = s"wasbs://<container>@<storage-account-name>.blob.core.windows.net/<output-directory-name>/"dfOutput.coalesce(1)
Azure Blob Storage - Databricks Documentation
You can read data from public storage accounts without any additional settings. To read data from a private storage…
Output to Azure Data Lake Storage Gen2
spark.conf.set("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key")val outputADLSPath: String = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>"dfOutput.coalesce(1)
So, to conclude -
(i) We have installed Azure Databricks into out own Azure Virtual Network and used Azure Active Directory — Conditional Access to allow access of Databricks from that network only.
(ii) We have used Azure Blob Storage & Azure Data Lake Storage — Gen2 service endpoints to restrict access with the virtual network only.
(iv) Any access to the Databricks, Blob or ADLS can be done from a virtual machine i.e. our jump box provisioned inside the network. The virtual machine access should be limited to authentic on-premises network only (e.g. from our office premises).
Thanks for reading! If you wish, connect with me on LinkedIn.