Restricting Access To Your Big Data System On Azure

A well-protected Castle!

hile organizations are feeling the need to migrate into cloud, they are worrying on how to restrict access to their applications and data from any kind of unauthorized access (like protecting a castle from enemy attacks!). As public cloud systems can be accessed from anywhere, data theft protection has become challenging.

To avoid these, while designing a big data system we may think of creating data stores and Spark clusters inside our own appropriately locked virtual networks — IAAS model but, then we’ll miss out all of the latest and greatest features available in PAAS!

In this blog, we’ll see how we can restrict the access to data and Spark environment by employing:

(i) Azure Databricks with Azure Active Directory Conditional Access Policies.

(ii) Azure Blob Storage & Azure Data Lake Storage (Gen2) — Service Endpoints.

Architecture we’ll be trying to achieve in this example

Create a Virtual Network & a Subnet

To start with we’ll first create a Virtual Network (VNet) and a public subnet. We’ll create a virtual machine (VM) in the public subnet. We’ll use the VM as a jump box and will be used to connect Azure Databricks Workspace, Azure Blob Storage or Azure Data Lake Gen 2.

We want to connect to our big data components from the jump box only!

Create a Virtual Network

To make this example simple, we haven’t configured the Azure Firewall however, that can be enabled for further protection. Follow this.

The first subnet is created

Find below the CIDRs we have considered for our example.

  • VNet — CIDR Range = 10.0.0.0/20
  • Subnet-A — CIDR Range = 10.0.1.0/24 (IP Range = 10.0.1.0–10.0.1.255)

Create an Azure Databricks Workspace

Once we have our virtual network ready, we’ll install Azure Databricks on this VNet. Though this feature (installing in our VNet) is in preview now, however works fine for this example.

Azure Databricks Service creation — deploy in our own VNet

Find below the CIDRs we have used:

  • databricks-public-subnet — CIDR Range = 10.0.2.0/24 (IP Range = 10.0.2.0–10.0.2.255)
  • databricks-private-subnet — CIDR Range = 10.0.3.0/24 (IP Range = 10.0.3.0–10.0.3.255)
Our VNet — the three subnets are created so far

Once we have created the Databricks service, we can open the Workspace URL from any machine!

Reference:

Create a VM & associate a Public IP

While creating the VNet we have already created a subnet — Subnet-A. Now we’ll provision a Windows VM and will associate a public IP. We can connect to the VM from outside of the virtual network by using the public IP.

We have selected Windows VM for this demo
We’re keeping the RDP port open so, we can connect from our machines

For this demo, we’re not creating any special Network Security Group for NIC.

Create the VM inside Subnet-A

Once the VM is created, download the RDP file and connect to the VM (we could use Linux / any other OS version as well).

Note down the Public IP

Once the internet is appropriately configured, we can open the Azure Databricks Workspace from the VM as well!

Azure Active Directory — Conditional Access

Now we’ll look into a feature called Conditional Access available under Azure AD.

Create a new Named trusted location

Let’s create a Named location and add all of the IP ranges which should access the Databricks. If we’re behind a NAT we should give the NAT IP/range.

Create a new Named location
Mark the new location as trusted location
Once the new location has been trusted

Create a Conditional Access Policy

We need to create a new conditional access policy which should be applicable for all users and groups (unless we want to exclude a specific group).

Selected ‘All user’ for our example

Next, we need to select AzureDatabricks as the app this policy should apply on.

Search for AzureDatabricks and select

Select the Locations as selected Conditions. Only the Names location we created should be exempted from this policy.

Exclude the Named location from applying the policy
Select the Access controls & ‘Block access’
Once all settings are done, enable the policy and create

Once the conditional access policy is applied, let’s try to open the Databricks URL from any outside machine which is not part of the named location IPs (My Jump Boxes), we’ll hit the following — ‘You cannot access this right now’ page!

Now, log into our VM/jump box (inside Subnet-A) and try…Azure Databricks will be opened successfully!

Create a Storage Account with restricted access

In this step we’ll create an Azure Storage Account — Blob which should be accessed from only the Azure Databricks and the jump box/VM, that means only from the VNet we have created earlier.

To achieve this, while creating the storage account select Allow access from to Selected network and select the virtual network we have created; in our example, we can select all or precisely Subnet-A & the databricks-public-subnet.

Allow the Databricks subnets (Private & Public) and the Subnet-A to access the storage account

Once the storage account has been created, we’ll not be able to access it from the Azure portal launched from outside the selected subnets!

Let’s go to our jump box inside Subnet-A, open the Azure portal and create couple of containers — input and output.

Newly created Blob containers

So, the newly created Storage Account can only be accessed from the VNet and not from outside!

To note, in case we want to modify the service endpoints, we can do so by going to the Azure Storage — Firewalls and virtual networks menu item.

Amend the settings

Create an Azure Data Lake with restricted access

In this step, we’ll create a Data Lake Storage Gen 2 (ADLS Gen 2)and will allow access to the specific VNet as we have done in the previous step.

While creating a Storage Account, select our VNet only

Similar with the previous example, we can select all or precisely Subnet-A & the databricks-public-subnet.

Once done, we’ll face Access denied error if we want to access the ADLS Gen 2 outside of the VM.

If we use Azure Storage Explorer to access the storages, we’ll see the following error.

Databricks code to work with the Blob & ADLS Gen2

Now as we have the infrastructures ready, we can concentrate on the Spark code. Here, we’ll read a CSV file from the Blob-input folder, calculate & create an output Spark DataFrame and save into Blob-output & ADLS Gen2 — outputCalculation folders.

(The following code snippets are just samples and not following the best practices)

Input from Azure Blob Storage

spark.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net", "<storage-account-access-key>")
val inputBlobPath: String = s"wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory-name>/<input-file-name>.csv"val dfInputRead =
spark.read.format("csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load(s"${inputBlobPath}")

Perform some calculation

import org.apache.spark.sql.functions.colval dfOutput = dfInput
.withColumn("Sum", col("Number1") + col ("Number2"))
display (dfOutput)

Output to Azure Blob Storage

val outputBlobPath: String = s"wasbs://<container>@<storage-account-name>.blob.core.windows.net/<output-directory-name>/"dfOutput.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.csv(s"${outputBlobPath}")

Reference:

Output to Azure Data Lake Storage Gen2

spark.conf.set("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key")val outputADLSPath: String = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>"dfOutput.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.csv(s"${outputADLSPath}")
Output from Azure Storage Explorer installed in my VM (inside Subnet-A)

Conclusion

So, to conclude -

(i) We have installed Azure Databricks into out own Azure Virtual Network and used Azure Active Directory — Conditional Access to allow access of Databricks from that network only.

(ii) We have used Azure Blob Storage & Azure Data Lake Storage — Gen2 service endpoints to restrict access with the virtual network only.

(iv) Any access to the Databricks, Blob or ADLS can be done from a virtual machine i.e. our jump box provisioned inside the network. The virtual machine access should be limited to authentic on-premises network only (e.g. from our office premises).

Thanks for reading! If you wish, connect with me on LinkedIn.

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store