Azure Purview — Cataloging Delta Lake Assets using Apache Atlas API

Azure Purview, one of the latest tools delivered by Microsoft helps to properly govern customer Data Lake and have well-integration with various Azure services. Its support to Apache Atlas API can easily extend the data governance service to various non-Azure components as well. In my earlier blog, we have seen how we can leverage the API to catalog/lineage Apache Hive assets. In this blog, we’ll see how we can register Delta Lake assets into Purview.

Scanning Azure Data Lake identifies Delta Lake table schema. Find below few screenshots.

A scan has discovered ADLS folder & files containing records in Delta format.
The scan rightly identifies the Delta table schema.
Even the other Delta table/files have been discovered and put under ‘Related’ tab.

Though this should be fine for most of the cases however, there may be specific use case where, we need to take advantage of Delta Lake metadata to specifically catalog Delta assets along with storing the lineage information. To achieve this, we need to create a new type definitions supporting Delta Lake.

Registering non-Azure resources at Purview.

To contain Delta assets we’ll create three entities:

  • delta_db: to store Delta Lake database.
  • delta_table: to store Delta Lake tables.
  • delta_process: to store lineage information i.e. relationship in-between output Delta table, one/multiple input Delta table(s) and further relationship data.
  • To contain table columns we’ll use existing entity type of tabular_schema.

Find below the entity relationship I have considered.

Next, we need to create the Delta Lake types.

Create Type Definitions for Delta

POST https://{{catalog_end_point}}/api/atlas/v2/types/typedefs

In case, we need to delete the newly created types, we can use the following request structure.

Delete Type Definitions for Delta

DELETE https://{{catalog_end_point}}/api/atlas/v2/types/typedefs

Once, the types have been defined in Atlas or Purview, we can catalog Delta Lake database and tables. A sample asset creation request will look like below:

Create Assets

POST https://{{catalog_end_point}}/api/atlas/v2/entity/

Once the above request is succeeded, Delta Lake — the new type will appear under Custom source types.

Home > Browse assets
Searching the Delta Lake with the newly created delta_db.
One of the Delta tables, with table metadata added as entity attributes.
Table schema information along with column description retrieved from Delta table metadata.
‘Related’ tab shows all other related entities.

Lineages

We’ll use the newly created delta_process entity type to store the lineage information of joining two tables — 2LIS_03_BF & 0MATERIAL_ATTR and producing the resultant table — DAILY_PRODUCTION.

POST https://{{catalog_end_point}}/api/atlas/v2/entity/

Closing Note

There is no official plugin available for Databricks to register Delta Lake assets automatically into Atlas or Purview. However, we can create our custom process to scan Delta Lake database or files and catalog into Purview. This also comes with maintenance of the entity life-cycle, i.e. clearing catalog in case of dropping of a Delta table or updating catalog in case of table schema changes. Anyway, we can automate the scan and log the entities under appropriate types.

References:

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium & LinkedIn.

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store