Marking & Masking Sensitive Data with Azure SQL Database

andling sensitive data is a common use case for an enterprise however, keeping these secure in a cloud environment is more challenging than on-premise. If not protected well, enterprises run the risk of breaching sensitive information causing financial and reputation loss.

We can keep confidential records in a cloud data lake and restrict using RBAC (role based access control) and ACL (access control list) however, those will restrict the data asset on it’s entirety and we’ll not be able to read the non-sensitive attributes. Along with that, we require a feature to mark data elements with appropriate sensitivity labels & track their usages which is not supported by the data lake.

In this blog we’ll be exploring Data Discovery & Classification and Dynamic Data Masking — the two very important security features, available in Azure SQL Database & Azure Synapse Analytics. We’ll also unify the two functionalities to implement a robust solution.

Data Discovery & Classification

Using this feature we can mark the data by applying proper sensitivity-classification labels and provide visibility of classification state from Azure portal. Refer here for further details.

Azure by default provides the following sensitivity labels:

Default Sensitivity Labels.
We can create custom label based on our organization requirements.

By default, the following information types are provided:

Information Types.
We can create custom information type based on our requirements.
Overview of data classification & information type distribution at column level for a database.
Add/update data classification from Azure portal.

Retrieving the sensitivity and information type using T-SQL:

-- Execute in the target databaseSELECT s.name AS schema_name, ao.name AS table_name, 
ac.name AS column_name, t.name AS column_type,
sc.label, sc.information_type
FROM sys.sensitivity_classifications sc
INNER JOIN sys.all_objects ao
ON ao.object_id = sc.major_id
INNER JOIN sys.schemas s
ON s.schema_id = ao.schema_id
INNER JOIN sys.all_columns ac
ON sc.major_id = ac.object_id
AND sc.minor_id = ac.column_id
INNER JOIN sys.systypes t
ON t.xtype = ac.system_type_id
AND t.xusertype = ac.user_type_id;

Dynamic Data Masking

This feature masks confidential data to non privileged users. There are different masking rules available which can be applied to database table columns. For further details, refer here.

Manually masking columns from the Azure portal.

We can use T-SQL to retrieve masked columns:

SELECT s.name AS schema_name, ao.name AS table_name, 
ac.name AS column_name, mc.is_masked, mc.masking_function
FROM sys.masked_columns mc
INNER JOIN sys.all_objects ao
ON mc.object_id = ao.object_id
INNER JOIN sys.schemas s
ON s.schema_id = ao.schema_id
INNER JOIN sys.all_columns ac
ON ac.object_id = mc.object_id
AND ac.column_id = mc.column_id;

Though the data aren’t masked to the administrator,

Execute as an admin.

unprivileged users get masked records!

Execute as a non-privileged user.

Non-privileged user creation script… used for our discussion today:

-- In master database:CREATE LOGIN reader WITH password = '<your password>';
CREATE USER readerUser FROM LOGIN reader;
-- In target database:CREATE USER readerUser FROM LOGIN reader;
EXEC sp_addrolemember 'db_datareader', 'readerUser';

Unification of Features

When we classify the data, one of the important intentions is to restrict access and masking the sensitive data for unprivileged users. Generally, for our organizations we decide the masking policy/functions for each classification level. If I give a very simple example -

Refer here for the full list of data masking functions.

Once we decide the mapping, we can dynamically add the appropriate masking functions based on the selected data classification labels and information types.

-- View to show data classification informationCREATE OR ALTER VIEW view_classification
AS
SELECT s.name AS schema_name, ao.name AS table_name,
ac.name AS column_name, t.name AS column_type,
sc.label, sc.information_type
FROM sys.sensitivity_classifications sc
INNER JOIN sys.all_objects ao
ON ao.object_id = sc.major_id
INNER JOIN sys.schemas s
ON s.schema_id = ao.schema_id
INNER JOIN sys.all_columns ac
ON sc.major_id = ac.object_id
AND sc.minor_id = ac.column_id
INNER JOIN sys.systypes t
ON t.xtype = ac.system_type_id
AND t.xusertype = ac.user_type_id;
-- View to show data masking informationCREATE OR ALTER VIEW view_masking
AS
SELECT s.name AS schema_name, ao.name AS table_name,
ac.name AS column_name, mc.is_masked, mc.masking_function
FROM sys.masked_columns mc
INNER JOIN sys.all_objects ao
ON mc.object_id = ao.object_id
INNER JOIN sys.schemas s
ON s.schema_id = ao.schema_id
INNER JOIN sys.all_columns ac
ON ac.object_id = mc.object_id
AND ac.column_id = mc.column_id;
-- ** A simplified logic to dynamically add/drop masking based on set classification **DECLARE
@_cls_label VARCHAR(MAX),
@_msk_is_masked INT,
@cls_schema_name VARCHAR(MAX),
@cls_table_name VARCHAR(MAX),
@cls_column_name VARCHAR(MAX),
@cls_column_type VARCHAR(MAX),
@cls_label SQL_VARIANT,
@msk_schema_name VARCHAR(MAX),
@msk_table_name VARCHAR(MAX),
@msk_column_name VARCHAR(MAX),
@msk_is_masked BIT,
@msk_masking_function VARCHAR(MAX),
@query VARCHAR(MAX);DECLARE cursor_columns CURSOR
FOR
SELECT
cls.schema_name AS cls_schema_name,
cls.table_name AS cls_table_name,
cls.column_name AS cls_column_name,
cls.column_type as cls_column_type,
cls.label AS cls_label,
COALESCE(msk.schema_name, 'NA') AS msk_schema_name,
COALESCE(msk.table_name, 'NA') AS msk_table_name,
COALESCE(msk.column_name, 'NA') AS msk_column_name,
COALESCE(msk.is_masked, 0) AS msk_is_masked,
COALESCE(msk.masking_function, 'NA') AS msk_masking_function
FROM view_classification cls
FULL OUTER JOIN view_masking msk
ON msk.schema_name = cls.schema_name
AND msk.table_name = cls.table_name
AND msk.column_name = cls.column_name;
OPEN cursor_columns;
FETCH NEXT FROM cursor_columns INTO
@cls_schema_name,
@cls_table_name,
@cls_column_name,
@cls_column_type,
@cls_label,
@msk_schema_name,
@msk_table_name,
@msk_column_name,
@msk_is_masked,
@msk_masking_function;
WHILE @@FETCH_STATUS = 0
BEGIN
SET @_cls_label = CONVERT(VARCHAR(MAX), @cls_label);
SET @_msk_is_masked = CONVERT(INT, @msk_is_masked);
IF (@_cls_label IN ('General', 'Public')
AND @_msk_is_masked = 1)
BEGIN
SET @query = 'ALTER TABLE ' +
@cls_schema_name + '.' + @cls_table_name
+ ' ALTER COLUMN ' + @cls_column_name
+ ' DROP MASKED';
EXECUTE (@query);
END
ELSE IF (@_cls_label IN
('Confidential', 'Confidential - GDPR')
AND @_msk_is_masked = 0)
BEGIN
SET @query = 'ALTER TABLE ' +
@cls_schema_name + '.' + @cls_table_name
+ ' ALTER COLUMN ' + @cls_column_name
+ ' ADD MASKED WITH (FUNCTION = ''default()'');';
EXECUTE (@query);
END
ELSE IF (@_cls_label IN
('Highly Confidential', 'Highly Confidential - GDPR')
AND @_msk_is_masked = 0)
BEGIN
SET @query = 'ALTER TABLE ' +
@cls_schema_name + '.' + @cls_table_name
+ ' ALTER COLUMN ' + @cls_column_name
+ ' ADD MASKED WITH (FUNCTION = ''default()'');';
EXECUTE (@query);
END
ELSE
PRINT 'No change - Ignore'
FETCH NEXT FROM cursor_columns INTO
@cls_schema_name,
@cls_table_name,
@cls_column_name,
@cls_column_type,
@cls_label,
@msk_schema_name,
@msk_table_name,
@msk_column_name,
@msk_is_masked,
@msk_masking_function;
END
CLOSE cursor_columns;
DEALLOCATE cursor_columns;

The above logic can be enhanced further including different classification labels, data and information types. The logic can be scheduled using Azure Data Factory or we can try to run as a database trigger.

Sensitive Data Access

Once we classify our data and automate data masking functions as discussed above, we’re free to use the masked datasets along with non-masked datasets.

As an example, if we use Apache Hive, a Hive table created on SQL Database masked table (here, SalesLT.Customer) along with other tables based on non-sensitive data in datalake, will make a transparent layer for developers/users to use.

%sqlCREATE TABLE SalesLT_Customer
USING org.apache.spark.sql.jdbc
OPTIONS (
url 'jdbc:sqlserver://<SQL Server Name>:1433;database=<SQL Database Name>;user=<reader user id>;password=<password>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;',
database '<SQL Database Name>',
dbtable 'SalesLT.Customer'
);
-- create other Hive external tables with data in datalake...
Developers are free to use the Customer table now!

[If we want to further restrict the environment we can try this]

Monitoring Access to Classified Data

We can configure a Log Analytics workspace and monitor the users & their IPs accessing the classified data.

Creation of a Log Analytics workspace.
Configure the SQL Database to audit logs into the Log Analytics workspace.
Log Analytics workspace — Overview screen.
To monitor users & their IPs accessing classified data.
To monitor users & their IPs accessing various information types.

High Level Architecture

If we draw an end to end architecture it’ll look as follows. We can keep non-sensitive data into datalake and sensitive (& non-sensitive as well) in Azure SQL Database/Synapse Analytics.

However, we need to be mindful not to store the sensitive data into a staging location without appropriate protection!

Conclusion

  1. By tying Data Classification with Data Masking feature we can achieve the best of both — masking the classified data & tracking their usages.
  2. Azure Data Lake by itself doesn’t provide masking feature. If we have Apache Hive based warehouse, we can try to use Apache Ranger. Though on that case as well, we have to restrict direct access to data lake.
  3. Dynamic Data Masking feature is must when copying sensitive data from production to non-production environments.
  4. Proper periodic auditing is must if we’re dealing with confidential records.
  5. Azure Data Catalog Gen 2 (in private preview now) is going to be a great product in classifying data lake data assets along with other data stores.

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium & LinkedIn.

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store