Azure Purview — Cataloging Delta Lake Assets using Apache Atlas API
Azure Purview, one of the latest tools delivered by Microsoft helps to properly govern customer Data Lake and have well-integration with various Azure services. Its support to Apache Atlas API can easily extend the data governance service to various non-Azure components as well. In my earlier blog, we have seen how we can leverage the API to catalog/lineage Apache Hive assets. In this blog, we’ll see how we can register Delta Lake assets into Purview.
Scanning Azure Data Lake identifies Delta Lake table schema. Find below few screenshots.
Though this should be fine for most of the cases however, there may be specific use case where, we need to take advantage of Delta Lake metadata to specifically catalog Delta assets along with storing the lineage information. To achieve this, we need to create a new type definitions supporting Delta Lake.
To contain Delta assets we’ll create three entities:
- delta_db: to store Delta Lake database.
- delta_table: to store Delta Lake tables.
- delta_process: to store lineage information i.e. relationship in-between output Delta table, one/multiple input Delta table(s) and further relationship data.
- To contain table columns we’ll use existing entity type of tabular_schema.
Find below the entity relationship I have considered.
Next, we need to create the Delta Lake types.
Create Type Definitions for Delta
In case, we need to delete the newly created types, we can use the following request structure.
Delete Type Definitions for Delta
Once, the types have been defined in Atlas or Purview, we can catalog Delta Lake database and tables. A sample asset creation request will look like below:
Once the above request is succeeded, Delta Lake — the new type will appear under Custom source types.
We’ll use the newly created delta_process entity type to store the lineage information of joining two tables — 2LIS_03_BF & 0MATERIAL_ATTR and producing the resultant table — DAILY_PRODUCTION.
There is no official plugin available for Databricks to register Delta Lake assets automatically into Atlas or Purview. However, we can create our custom process to scan Delta Lake database or files and catalog into Purview. This also comes with maintenance of the entity life-cycle, i.e. clearing catalog in case of dropping of a Delta table or updating catalog in case of table schema changes. Anyway, we can automate the scan and log the entities under appropriate types.