Improving Resiliency with Databricks Delta Lake & Azure

Image for post
Image for post

esiliency is one of the most important aspects we should consider while creating a data lake. Azure Storage provides some great features to improve resiliency. On top of these, Databricks Delta Lake can add a cool feature called time travelling to make the lake more resilient and easily recoverable.

In this blog, we’ll discuss about few features which will help to protect our data from corruption/deletion and can help to restore easily in case of any issues.

Right Access Permission

First thing we will consider providing the right access. Only the resource administrator should have the owner access, developers should have read access and applications can have contributor access. By this way, data can only be deleted by the resource administrator or by a process e.g. by Databricks or by Azure Data Factory pipelines.

Accidental Delete Protection

To avoid any accidental deletion we should always add a delete lock on our data lake.

Image for post
Image for post
Adding a ‘Delete’ lock on the Storage Account.

By mistake if someone tries to delete, he’ll get a prompt to remove the lock first!

Image for post
Image for post
Accidental deletion will be protected.

Delta Lake Time Travelling

Delta Lake time travelling is a great feature and should be used in case of any data corruption in the Delta Lake (e.g. by wrong data ingestion or faulty update procedure). Find below a short example:

import org.apache.spark.sql.SaveMode// adding records for the first time
val studentDF = Seq(
(1, "Prosenjit"),
(2, "Abhijit"),
(3, "Aadrika")
).toDF("id", "name")
studentDF.write.format("delta").mode("overwrite").save("/mnt/mydeltalake/Student")// updating with a new record
val studentDF2 = Seq(
(4, "Ananya")
).toDF("id", "name")
studentDF2.write.format("delta").mode("append").save("/mnt/mydeltalake/Student")// creating an external table of type Delta for easy access
spark.sql("CREATE TABLE Student USING DELTA LOCATION '/mnt/mydeltalake/Student'")
Image for post
Image for post
‘display’ing the Student Delta table after insertions.

Now, we have deleted a record.

spark.sql ("DELETE FROM Student WHERE id = 1")
val studentDF3 = spark.sql("SELECT * FROM Student")
display (studentDF3)
Image for post
Image for post
‘display’ing the Student Delta table after the deletion.
Image for post
Image for post
The Delta table history tracks all of the changes.

We can retrieve the deleted records by simply travelling the time backwards and loading the right snapshots.

val historical_studentDF = 
spark.read.format("delta")
.option("timestampAsOf", "2020-04-15 18:12:26")
.load("/mnt/mydeltalake/Student")
display (historical_studentDF)spark.sql ("INSERT INTO Student SELECT * FROM Student TIMESTAMP AS OF \"2020-04-15 18:12:26\"")
val studentDF4 = spark.sql("SELECT * FROM Student")
display (studentDF4)
Image for post
Image for post
‘display’ing the table after recovering the deleted record.

Restoring the records by time travelling can help when the data were deleted/updated by any Spark application.

But, what will happen in case someone/some application by mistake removes any data files?!

Image for post
Image for post
Someone/some application can accidentally delete any data file!

Delta Lake will not be able track the changes so, will not be able to recover the records! We can run Fsck Repair Table but, that’ll only repair the transaction log.

Image for post
Image for post

Azure Storage Blob Soft Delete Feature

Azure Storage supports the soft delete feature for Blobs. The deleted blobs are stored for the configurable retention days. If our Delta Lake is created on Azure Storage Blob we can avail this feature.

Image for post
Image for post
Enable the ‘Blob soft delete’ feature & set the retention day.

Any deleted blobs can be ‘undelet’ed very easily.

Image for post
We van undelete as soon as we detect the issue.

Once restored, we can query the Delta Lake table and it’ll return the records without any further repairing.

Image for post
Image for post

Azure Data Factory Periodic Backup

As the soft delete feature is yet to be supported for Azure Data Lake Storage Gen 2 at the time of this writing (refer here for the list of features), we can implement Azure Data Factory pipeline to copy Delta Lake directories to another location either in the same region or in a separate region.

Image for post
Image for post
Periodic Backups & on-demand restore by ADF pipelines.

Find below a simple ADF Copy Activity code. We should preserve the source hierarchy and source attributes.

{
"name": "Delta_Lake_Backup",
"properties": {
"activities": [
{
"name": "Delta Lake Backup",
"type": "Copy",
"dependsOn": [

],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [

],
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
},
"enableStaging": false,
"preserve": [
"Attributes"
]
},
"inputs": [
{
"referenceName": "mydeltalake",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "mydeltalakebackups",
"type": "DatasetReference"
}
]
}
],
"annotations": [

]
}
}
Image for post
Image for post
Once the Delta Lake has been copied.

We can then connect to the copied snapshots, read the data and if required, we can track the changes using the transaction logs.

Image for post
Image for post

For any on-demand backup, we can try the Cloning feature of the Azure Storage Explorer.

Image for post
Image for post

Points to note:

  • Retaining Delta Lake data by taking periodic snapshots will consume extra space. The amount of storage and cost for that will depend on our backup frequency, size of our Delta Lake and if we’re transferring the data into another region.
  • If we backup our data into Azure Storage Blob, we can use Lifecycle Management to delete the data after retention days.
  • Lifecycle Management for Azure Data Lake Storage Gen 2 is yet to be fully supported. Until then, we can use ADF Delete activity to clear the old snapshots.

Azure Disaster Recovery Feature

In case of any severe disaster, the whole region containing our Delta Lake will go down. If we set our replication as Geo-redundant storage (GRS) or Read-access geo-redundant storage (RA-GRS) and primary region suffers an outage, the secondary region will serve as a redundant source of our Delta Lake with some data loss (refer here to know more about Last Sync Time to estimate the amount of data loss).

The Delta Lake in the secondary region will not be accessible unless Microsoft declares disaster and flip over to the secondary. So, we may want to create our own backup solution if Azure provided redundancy doesn’t suit our purpose.

Points to note:

  • In case we want to implement our own backup/redundancy solution by copying the Delta Lake data into another region, compare the solution cost (e.g. 2 LRS locations + ADF pipeline run time + approximate data transfer-out cost from primary region) with Azure Storage GRS/RA-GRS cost w.r.t the benefits.
  • In case of outages, we may require to access the Delta Lake at our secondary region. Azure Databricks needs to be pre-configured as part of our disaster recovery readiness process. Refer here for the steps to follow.

Conclusion

We have seen few steps to make our data lake more resilient with Databricks Delta Lake and some Azure features. We should select the options based on our application criticality and budget.

Thanks for reading. To see similar posts, follow me on Medium & LinkedIn. If you have enjoyed, don’t forget to Clap & Share!!

Written by

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store