Masking Sensitive Data in Azure Data Lake

“Sensitive data is a part of every large organization’s normal business practice. Allowing sensitive data from production applications to be copied and used for development and testing environments increases the potential for theft, loss or exposure — thus increasing the organization’s risk. Data masking is emerging as a best practice for obfuscating real data so it can be safely used in non-production environments. This helps organizations meet compliance requirements for PCI, HIPAA, GLBA and other data privacy regulations.” — CIO.com
Data masking is an important feature for any types of data storage and the reasons are rightly mentioned in the above extract. If we look at the Azure data store tech stack, this can be achieved easily using Azure SQL Database and Azure Synapse Analytics. However, in case we’re keeping any sensitive information in Azure Data Lake, we don’t have any inbuilt feature to obfuscate selective data attributes.
In this blog, we’ll discuss about couple of patterns based on the place of masking. We’ll also go through couple of ways to mask Azure Data Lake data using Azure Data Factory and Apache Spark/Azure Databricks.
Pattern 1 — Mask at the source of data
In this pattern, the data are masked inside the source storage system. Only administrators can see the data whereas unprivileged users get masked records (configurable). Microsoft SQL Server, Azure SQL Database, Azure Synapse Analytics support Dynamic Data Masking feature. Check my previous blog to know further.

Advantage: Plain text data never leave the source storage software/system.
Limitation: Not all storage system supports data masking e.g. Azure Data Lake.
Pattern 2 — Mask while data copy
In case the storage system doesn’t support data masking itself, we can apply this pattern. In this pattern we’ll copy the unmasked data from the storage, mask it and copy into target. The masking logic should run inside the source storage boundary i.e. either within the same virtual private network or same subscription where unprivileged user shouldn’t have access . Here, we’ll try couple of options to copy and mask the data.
Option 1 — Using Azure Data Factory Data Flow

Data Factory yet to have any in-built data masking function. So, we can use the following expression functions.

- Using crc32(256, Salary):

- Using sha2(256, Salary)
- Overriding with a fixed value: e.g. Salary = 100
- toInteger((Salary*millisecond(currentUTC()))/1000)
- mod(Salary, second(currentUTC()))*1000
Advantage: This option is simple and can be used where we’re not looking for advanced masking functions.
Limitation: Option for data masking is very limited.
Option 2 — Using Azure Databricks

First we’ll create a configurable table listing the entities, their attributes which we want to be masked with appropriate masking rules based on the attribute types. Few examples as follow:
- Phone number: Expose the first three digits e.g. 123 XXX XXX
- Credit card: Expose the last four digits e.g. XXXXXXXXXXXX1234
- Email id: Expose the first letter and replaces the domain aXX@XXXX.com
- Any integer e.g. salary or patient age: Random integer within the selected boundaries and so on…

We’ll try to create the masking rules using regular expressions and if not possible we may use Spark user-defined functions.

Once we have the masking rules defined, we’ll read the source data, match the attributes against the configuration table and if matches apply the masking rule.

If we have source data containing following attributes e.g.

…we’ll pass the the source plain text data along with the masking rules into the utility method. Based on relevant masking rules, the masking utility method will produce a DataFrame after masking the sensitive attributes only.

The above logic can be enhanced further to support more masking rules.
Advantage: Wide range of custom data masking logic can be applied.
Limitation: Maintenance of the code / custom framework.
Conclusion
Static data masking permanently replaces sensitive data by altering data at rest so ideally we shouldn’t use it in production. This is more appropriate to produce production grade data at development. Dynamic data masking on the other hand is very flexible as data at source are never changed. However, this will not hide the sensitive data from the system administrators. If that concerns us, we should look for data encryption rather than data masking.