Masking Sensitive Data in Azure Data Lake

Prosenjit Chakraborty
5 min readNov 2, 2020

“Sensitive data is a part of every large organization’s normal business practice. Allowing sensitive data from production applications to be copied and used for development and testing environments increases the potential for theft, loss or exposure — thus increasing the organization’s risk. Data masking is emerging as a best practice for obfuscating real data so it can be safely used in non-production environments. This helps organizations meet compliance requirements for PCI, HIPAA, GLBA and other data privacy regulations.” — CIO.com

Data masking is an important feature for any types of data storage and the reasons are rightly mentioned in the above extract. If we look at the Azure data store tech stack, this can be achieved easily using Azure SQL Database and Azure Synapse Analytics. However, in case we’re keeping any sensitive information in Azure Data Lake, we don’t have any inbuilt feature to obfuscate selective data attributes.

In this blog, we’ll discuss about couple of patterns based on the place of masking. We’ll also go through couple of ways to mask Azure Data Lake data using Azure Data Factory and Apache Spark/Azure Databricks.

Pattern 1 — Mask at the source of data

In this pattern, the data are masked inside the source storage system. Only administrators can see the data whereas unprivileged users get masked records (configurable). Microsoft SQL Server, Azure SQL Database, Azure Synapse Analytics support Dynamic Data Masking feature. Check my previous blog to know further.

Data Masking is done at source storage system

Advantage: Plain text data never leave the source storage software/system.

Limitation: Not all storage system supports data masking e.g. Azure Data Lake.

Pattern 2 — Mask while data copy

In case the storage system doesn’t support data masking itself, we can apply this pattern. In this pattern we’ll copy the unmasked data from the storage, mask it and copy into target. The masking logic should run inside the source storage boundary i.e. either within the same virtual private network…

--

--