Azure Purview — Cataloging Hive Assets using Apache Atlas API

zure Purview (currently in preview) is a unified data governance service which supports automated data discovery, lineage identification and data classification across various Azure services, even on-premises and other multi-cloud systems. It supports integration via Apache Atlas Rest APIs for any other systems which Purview doesn’t directly support.

If we have Apache Hive as our organizational central data warehousing solution and we create our data assets as external tables i.e. keeping the data into Azure Data Lake, Purview can scan the data files and can take out the schema information. However, it’ll not be able to extract the metadata information stored in Hive metastore database.

In this blog we’ll discuss how we can create, update, search and delete Hive assets from Azure Purview catalog using Apache Atlas APIs.

We can refer here for the Rest API documentation that we’re going to use in this blog.

Hive assets & relationships defined in Atlas

The following hive types have already been defined in Atlas:

  • ENUM: hive_principal_type
  • STRUCT: hive_order, hive_serde
  • ENTITY: hive_column, hive_table, hive_process, hive_column_lineage, hive_db, hive_storagedesc
  • RELATIONSHIP: hive_table_partitionkeys, hive_process_column_lineage, hive_table_storagedesc, hive_table_columns, hive_table_db

The hive entities follow the below parent-child relationship:

First step — Retrieve an access token

We need an access token to call the catalog APIs. Follow the steps mentioned below to create a service principal and configure Purview to trust the service principal.

Once the above steps are done, we’ll be able to call the Microsoft API to get the access token.

POSThttps://login.microsoftonline.com/<subscription_id>/oauth2/token?api-version=1.0grant_type = client_credentials
resource = https://management.core.windows.net/ or <resource id>
client_id = <service principal client id>
client_secret = <service principal client secret>

The above request should return a token (example response below), we need to access the Atlas APIs.

{
"token_type": "Bearer",
"expires_in": "3599",
"ext_expires_in": "3599",
"expires_on": "1607252128",
"not_before": "1607248228",
"resource": "<resource id>",
"access_token": "<access token>"
}

Creating & searching Hive assets

Create a Catalog Entity

POST
https://{{catalog_end_point}}/api/atlas/v2/entity/

Request structure

{
"entity": {
"status": "ACTIVE",
"createTime": <create time in timestamp>,
"updateTime": <update time in timestamp>,
"createdBy": "<asset creator process or name/id>",
"version": 1,
"attributes": {
"name": "<hive table name>",
"db": {
"guid": "<hive db guid>",
"typeName": "hive_db"
},
"qualifiedName": "<hive table qualified name>",
"createTime": <create time in timestamp>,
"lastAccessTime": <last access time in timestamp>,
"comment": "<hive table comment>",
"sd": {
"guid": "<hive storage guid>",
"typeName": "hive_storagedesc"
},
"tableType": "<hive table type e.g. MANAGED or EXTERNAL>",
"parameters": {
"location": "<Data Lake location>"
},
"location": "<Data Lake location>",
"description": "<hive table description>",
"owner": "<owner name>",
"columns": [
{
"guid": "<hive column1 guid>",
"typeName": "hive_column"
},
{
"guid": "<hive column2 guid>",
"typeName": "hive_column"
}
]
},
"typeName": "hive_table",
"guid": "<hive table guid>"
},
"referredEntities": {
"<hive db guid>": {
"status": "ACTIVE",
"updateTime": <update time in timestamp>,
"updatedBy": "Unknown",
"version": 1,
"attributes": {
"owner": "<owner name>",
"ownerType": "ROLE",
"qualifiedName": "<hive db qualified name>",
"clusterName": "<cluster name>",
"name": "<hive db name>",
"description": "<hive db description>",
"parameters": {<anything in key value format>},
"location": "<hive db location>"
},
"typeName": "hive_db",
"guid": "<hive db guid>"
},
"<hive storage guid>": {
"status": "ACTIVE",
"updateTime": <update time in timestamp>,
"createdBy": "<asset creator process or name/id>",
"version": 1,
"attributes": {
"qualifiedName": "<hive storage qualified name>",
"inputFormat": "<data format>",
"outputFormat": "<data format>",
"compressed": "<data compression technology>",
"serdeInfo": {<anything in key value format>},
"comment": "<data storage comment>"
},
"typeName": "hive_storagedesc",
"guid": "<hive storage guid>"
},
"<hive column1 guid>": {
"typeName": "hive_column",
"attributes": {
"owner": "<owner name>",
"qualifiedName": "<hive column1 qualified name>",
"name": "<column name>",
"type": "<column type>",
"comment": "<column comment>",
"description": "<column description>",
"position": <column position>
},
"status": "ACTIVE",
"updateTime": <update time in timestamp>,
"createdBy": "<asset creator process or name/id>",
"version": 1,
"guid": "<hive column1 guid>"
},
"<hive column2 guid>": {
"typeName": "hive_column",
"attributes": {
"owner": "<owner name>",
"qualifiedName": "<hive column2 qualified name>",
"name": "<column name>",
"type": "<column type>",
"comment": "<column comment>",
"description": "<column description>",
"position": <column position>
},
"status": "ACTIVE",
"updateTime": <update time in timestamp>,
"createdBy": "<asset creator process or name/id>",
"version": 1,
"guid": "<hive column2 guid>"
}
}
}
Once a Hive table has been cataloged into Purview.
We can include classification & glossary terms while creating entities.
‘Related’ tab will depict all the related entities.

While creating entities, we should remember about the following two terms:

  1. GUID: When we send a create entity request we supply a GUID. This can be any string, unique to the request. We generally follow a naming convention like “-<any long string>” e.g. “guid”: “-sales_db.Customer_Orders@productionhive”. In the return response catalog will also send another GUID (like “guid”: “a7b1fa36–3952–497a-847f-a05e25058028”) after successful creation of the entity. GUID once assigned is not changed unless we delete the asset and recreate. We can query the catalog using a GUID or by a qualified name.
  2. Qualified Name: We also supply this when creating an entity. This should be unique across the catalog.

Fetch an Catalog Entity using GUID

GET https://{{catalog_end_point}}/api/atlas/v2/entity/guid/{guid}

Fetch an Catalog Entity using Qualified Name

GET https://{{catalog_end_point}}/api/atlas/v2/entity/uniqueAttribute/type/<entity_type>?attr%3AqualifiedName=<entity_qualified_name>

Advanced Search

POST https://{{catalog_end_point}}/api/atlas/v2/search/advanced

Request structure

{
"keywords": "Customer",
"filter": {
"or": [
{
"typeName": "hive_db",
"includeSubTypes": false
},
{
"typeName": "hive_table",
"includeSubTypes": false
}
]
},
"limit": 10
}
Searching with the entity name.

Delete an Entity

DELETE
https://{{catalog_end_point}}/api/atlas/v2/entity/uniqueAttribute/type/<entity_type>?attr%3AqualifiedName=<entity_qualified_name>

Maintaining Lineages

Purview catalog doesn’t only catalog entities, it can store lineages. Let’s assume, we have two Hive tables — Customer & Order and we then create another composite table called Customer_Orders by joining the first two tables.

For Hive, we have entity type — hive_process to store lineage information. Find below a sample request:

{
"entity": {
"status": "ACTIVE",
"updateTime": <update time in timestamp>,
"createdBy": "<asset creator process or name/id>",
"version": 1,
"attributes": {
"owner": "<owner name>",
"userName": "<user name>",
"qualifiedName": "<hive process qualified name>",
"name": "<a lineage name>",
"operationType": "<type of the operation>",
"startTime": <start time in timestamp or any default>,
"endTime": <end time in timestamp or any default>,
"queryText": "<query text e.g. the join query>",
"queryPlan": "<query plan or, default e.g. UNKNOWN>",
"queryId": "<query plan or, default e.g. UNKNOWN>"
},
"typeName": "hive_process",
"relationshipAttributes": {
"outputs": [
{
"guid": "<previously catalog returned GUID of Customer_Orders hive table>",
"typeName": "hive_table"
}
],
"inputs": [
{
"guid": "<previously catalog returned GUID of Customer hive table>",
"typeName": "hive_table"
},
{
"guid": "<previously catalog returned GUID of Order hive table>",
"typeName": "hive_table"
}
]
}
}
}

Once the entity of hive_process has been created, we can search it in Purview UI.

The Hive Process describing the details.
The ‘Lineage’ tab shows the inputs & output from the process.
We can jump to any input/output assets from this screen.

Conclusion

For any custom integration with data catalog, we have to maintain the entity life-cycle i.e. clearing catalog in case of dropping of a hive table or updating catalog in case of hive table schema changes. However, it’ll be worth to catalog our organizational data and process assets into a centralize catalog like Azure Purview.

Further reading:

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium & LinkedIn.

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store