Azure Purview — Cataloging Hive Assets using Apache Atlas API
Azure Purview (currently in preview) is a unified data governance service which supports automated data discovery, lineage identification and data classification across various Azure services, even on-premises and other multi-cloud systems. It supports integration via Apache Atlas Rest APIs for any other systems which Purview doesn’t directly support.
If we have Apache Hive as our organizational central data warehousing solution and we create our data assets as external tables i.e. keeping the data into Azure Data Lake, Purview can scan the data files and can take out the schema information. However, it’ll not be able to extract the metadata information stored in Hive metastore database.
In this blog we’ll discuss how we can create, update, search and delete Hive assets from Azure Purview catalog using Apache Atlas APIs.
We can refer here for the Rest API documentation that we’re going to use in this blog.
Hive assets & relationships defined in Atlas
The following hive types have already been defined in Atlas:
- ENUM: hive_principal_type
- STRUCT: hive_order, hive_serde
- ENTITY: hive_column, hive_table, hive_process, hive_column_lineage…