Resiliency is one of the most important aspects we should consider while creating a data lake. Azure Storage provides some great features to improve resiliency. On top of these, Databricks Delta Lake can add a cool feature called time travelling to make the lake more resilient and easily recoverable.
In this blog, we’ll discuss about few features which will help to protect our data from corruption/deletion and can help to restore easily in case of any issues.
Right Access Permission
First thing we will consider providing the right access. Only the resource administrator should have the owner access, developers should have read access and applications can have contributor access. By this way, data can only be deleted by the resource administrator or by a process e.g. by Databricks or by Azure Data Factory pipelines.
Accidental Delete Protection
To avoid any accidental deletion we should always add a delete lock on our data lake.
By mistake if someone tries to delete, he’ll get a prompt to remove the lock first!
Delta Lake Time Travelling
Delta Lake time travelling is a great feature and should be used in case of any data corruption in the Delta Lake (e.g. by wrong data ingestion or faulty update procedure). Find below a short example:
import org.apache.spark.sql.SaveMode// adding records for the first time
val studentDF = Seq(
).toDF("id", "name")studentDF.write.format("delta").mode("overwrite").save("/mnt/mydeltalake/Student")// updating with a new record
val studentDF2 = Seq(
).toDF("id", "name")studentDF2.write.format("delta").mode("append").save("/mnt/mydeltalake/Student")// creating an external table of type Delta for easy access
spark.sql("CREATE TABLE Student USING DELTA LOCATION '/mnt/mydeltalake/Student'")
Now, we have deleted a record.
spark.sql ("DELETE FROM Student WHERE id = 1")
val studentDF3 = spark.sql("SELECT * FROM Student")
We can retrieve the deleted records by simply travelling the time backwards and loading the right snapshots.
val historical_studentDF =
.option("timestampAsOf", "2020-04-15 18:12:26")
.load("/mnt/mydeltalake/Student")display (historical_studentDF)spark.sql ("INSERT INTO Student SELECT * FROM Student TIMESTAMP AS OF \"2020-04-15 18:12:26\"")
val studentDF4 = spark.sql("SELECT * FROM Student")display (studentDF4)
Restoring the records by time travelling can help when the data were deleted/updated by any Spark application.
But, what will happen in case someone/some application by mistake removes any data files?!
Delta Lake will not be able track the changes so, will not be able to recover the records! We can run Fsck Repair Table but, that’ll only repair the transaction log.
Azure Storage Blob Soft Delete Feature
Azure Storage supports the soft delete feature for Blobs. The deleted blobs are stored for the configurable retention days. If our Delta Lake is created on Azure Storage Blob we can avail this feature.
Any deleted blobs can be ‘undelet’ed very easily.
Once restored, we can query the Delta Lake table and it’ll return the records without any further repairing.
Azure Data Factory Periodic Backup
As the soft delete feature is yet to be supported for Azure Data Lake Storage Gen 2 at the time of this writing (refer here for the list of features), we can implement Azure Data Factory pipeline to copy Delta Lake directories to another location either in the same region or in a separate region.
Find below a simple ADF Copy Activity code. We should preserve the source hierarchy and source attributes.
"name": "Delta Lake Backup",
We can then connect to the copied snapshots, read the data and if required, we can track the changes using the transaction logs.
For any on-demand backup, we can try the Cloning feature of the Azure Storage Explorer.
Points to note:
- Retaining Delta Lake data by taking periodic snapshots will consume extra space. The amount of storage and cost for that will depend on our backup frequency, size of our Delta Lake and if we’re transferring the data into another region.
- If we backup our data into Azure Storage Blob, we can use Lifecycle Management to delete the data after retention days.
- Lifecycle Management for Azure Data Lake Storage Gen 2 is yet to be fully supported. Until then, we can use ADF Delete activity to clear the old snapshots.
Azure Disaster Recovery Feature
In case of any severe disaster, the whole region containing our Delta Lake will go down. If we set our replication as Geo-redundant storage (GRS) or Read-access geo-redundant storage (RA-GRS) and primary region suffers an outage, the secondary region will serve as a redundant source of our Delta Lake with some data loss (refer here to know more about Last Sync Time to estimate the amount of data loss).
The Delta Lake in the secondary region will not be accessible unless Microsoft declares disaster and flip over to the secondary. So, we may want to create our own backup solution if Azure provided redundancy doesn’t suit our purpose.
Points to note:
- In case we want to implement our own backup/redundancy solution by copying the Delta Lake data into another region, compare the solution cost (e.g. 2 LRS locations + ADF pipeline run time + approximate data transfer-out cost from primary region) with Azure Storage GRS/RA-GRS cost w.r.t the benefits.
- In case of outages, we may require to access the Delta Lake at our secondary region. Azure Databricks needs to be pre-configured as part of our disaster recovery readiness process. Refer here for the steps to follow.
We have seen few steps to make our data lake more resilient with Databricks Delta Lake and some Azure features. We should select the options based on our application criticality and budget.