Masking Sensitive Data in Azure Data Lake

“Sensitive data is a part of every large organization’s normal business practice. Allowing sensitive data from production applications to be copied and used for development and testing environments increases the potential for theft, loss or exposure — thus increasing the organization’s risk. Data masking is emerging as a best practice for obfuscating real data so it can be safely used in non-production environments. This helps organizations meet compliance requirements for PCI, HIPAA, GLBA and other data privacy regulations.” — CIO.com

Data masking is an important feature for any types of data storage and the reasons are rightly mentioned in the above extract. If we look at the Azure data store tech stack, this can be achieved easily using Azure SQL Database and Azure Synapse Analytics. However, in case we’re keeping any sensitive information in Azure Data Lake, we don’t have any inbuilt feature to obfuscate selective data attributes.

In this blog, we’ll discuss about couple of patterns based on the place of masking. We’ll also go through couple of ways to mask Azure Data Lake data using Azure Data Factory and Apache Spark/Azure Databricks.

Pattern 1 — Mask at the source of data

In this pattern, the data are masked inside the source storage system. Only administrators can see the data whereas unprivileged users get masked records (configurable). Microsoft SQL Server, Azure SQL Database, Azure Synapse Analytics support Dynamic Data Masking feature. Check my previous blog to know further.

Data Masking is done at source storage system

Advantage: Plain text data never leave the source storage software/system.

Limitation: Not all storage system supports data masking e.g. Azure Data Lake.

Pattern 2 — Mask while data copy

In case the storage system doesn’t support data masking itself, we can apply this pattern. In this pattern we’ll copy the unmasked data from the storage, mask it and copy into target. The masking logic should run inside the source storage boundary i.e. either within the same virtual private network or same subscription where unprivileged user shouldn’t have access . Here, we’ll try couple of options to copy and mask the data.

Option 1 — Using Azure Data Factory Data Flow

Data Masking is done using Azure Data Factory Data Flows

Data Factory yet to have any in-built data masking function. So, we can use the following expression functions.

  • Using crc32(256, Salary):
  • Using sha2(256, Salary)
  • Overriding with a fixed value: e.g. Salary = 100
  • toInteger((Salary*millisecond(currentUTC()))/1000)
  • mod(Salary, second(currentUTC()))*1000

Advantage: This option is simple and can be used where we’re not looking for advanced masking functions.

Limitation: Option for data masking is very limited.

Option 2 — Using Azure Databricks

Data Masking is done using Azure Databricks

First we’ll create a configurable table listing the entities, their attributes which we want to be masked with appropriate masking rules based on the attribute types. Few examples as follow:

  • Phone number: Expose the first three digits e.g. 123 XXX XXX
  • Credit card: Expose the last four digits e.g. XXXXXXXXXXXX1234
  • Email id: Expose the first letter and replaces the domain aXX@XXXX.com
  • Any integer e.g. salary or patient age: Random integer within the selected boundaries and so on…
A sample configuration table mapping the attributes & relevant associated masking rules.

We’ll try to create the masking rules using regular expressions and if not possible we may use Spark user-defined functions.

Any custom masking logic which can’t be handled using regex.

Once we have the masking rules defined, we’ll read the source data, match the attributes against the configuration table and if matches apply the masking rule.

A sample masking utility code

If we have source data containing following attributes e.g.

Plain text data

…we’ll pass the the source plain text data along with the masking rules into the utility method. Based on relevant masking rules, the masking utility method will produce a DataFrame after masking the sensitive attributes only.

Plain texts have been masked after applying the masking logic.

The above logic can be enhanced further to support more masking rules.

Advantage: Wide range of custom data masking logic can be applied.

Limitation: Maintenance of the code / custom framework.

Conclusion

Static data masking permanently replaces sensitive data by altering data at rest so ideally we shouldn’t use it in production. This is more appropriate to produce production grade data at development. Dynamic data masking on the other hand is very flexible as data at source are never changed. However, this will not hide the sensitive data from the system administrators. If that concerns us, we should look for data encryption rather than data masking.

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium & LinkedIn.

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store