Databricks Workspace instance contains internal Hive metastore accessible by all its clusters to persist table metadata. However, instead of its own metastore, Databricks can connect to external Hive metastore as well.

External Hive metastore can be connected by using thrift service or by connecting directly to the metastore database.

spark.hadoop.hive.metastore.uris thrift://<hive-thrift-server-connection-url>:<thrift-server-port>

Hive metastore connection specific entries, to be added into Databricks cluster Configuration > Advanced Options > Spark > Spark Config.

javax.jdo.option.ConnectionURL <hive-metastore-db-jdbc-connection-string>
javax.jdo.option.ConnectionDriverName <hive-metastore-db-jdbc-driver-class>
javax.jdo.option.ConnectionUserName {{secrets/<my-secret-scope>/<hive-conn-userid-key-name>}}
javax.jdo.option.ConnectionPassword {{secrets/<my-secret-scope>/<hive-conn-pass-key-name>}}

In case we want to read data from ADLS Gen 2, we can append the spark config with:

fs.azure.account.auth.type OAuth
fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.endpoint…


Image source: Unsplash

Feature Stores are being used for few years now to manage machine learning data/features. Google’s Feast, an open source feature store or Uber’s Michelangelo, its very own machine learning platform which has a feature data management layer, often fascinate other companies to either implement or buy a centralize feature storage service but, they often get lost in implementation complexities or budget constraints. Along with that, as enterprises are more and more embracing managed Spark services like Databricks, often it becomes challenging to integrate the managed Spark environment with a feature store implementation.

Recently (at the time of this writing) Databricks…


In our previous blog, we talked about different MLflow components and concentrated on tracking, managing models & deploying into model registry. In this blog, we’ll talk about Databricks AutoML feature and MLflow model serving.

AutoML

Databricks AutoML helps you automatically apply machine learning to a dataset. It prepares the dataset for model training and then performs and records a set of trials, creating, tuning, and evaluating multiple models. It displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. …


Databricks is one of the top choices among data scientists to run their ML codes. To help them to manage their codes and models, MLflow has been integrated with Databricks.

MLflow is an open source platform for managing the end-to-end machine learning lifecycle..Azure Databricks provides a fully managed and hosted version of MLflow integrated with enterprise security features, high availability, and other Azure Databricks workspace features..

Find below the components of MLFlow and few other important components have been added along with.


Courtesy of www.VincentVanGogh.org

Scikit Learn is an open source, Python based very popular machine learning library. It supports various supervised (regression and classification) and unsupervised learning models.

In this blog, we’ll use 8 clustering methods or unsupervised machine learning models on the Iris Plants database (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Scikit Learn clustering methods only.

The database contains the following details:

1. Sepal Length in cm
2. Sepal Width in cm
3. Petal Length in cm
4. Petal Width in cm
5. Class: Iris Setosa or…


Azure Purview, one of the latest tools delivered by Microsoft helps to properly govern customer Data Lake and have well-integration with various Azure services. Its support to Apache Atlas API can easily extend the data governance service to various non-Azure components as well. In my earlier blog, we have seen how we can leverage the API to catalog/lineage Apache Hive assets. In this blog, we’ll see how we can register Delta Lake assets into Purview.

Scanning Azure Data Lake identifies Delta Lake table schema. Find below few screenshots.


How about a classifier to classify humans characters into Good, Bad or Ugly! Image source.

Spark MLlib is a distributed machine learning framework comprising a set of popular machine learning libraries and utilities. As this use Spark Core for parallel computing, so really useful to apply the algorithms on big data sets.

In this blog, we’ll use 9 well known classifiers to classify the Banknote dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Spark MLlib classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.

The database contains…


Classifier comparison using Scikit Learn

Scikit Learn is an open source, Python based very popular machine learning library. It supports various supervised (regression and classification) and unsupervised learning models.

In this blog, we’ll use 10 well known classifiers to classify the Pima Indians Diabetes dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Scikit Learn classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.

The database contains the following details:

1. Number of times pregnant
2. Plasma glucose concentration…


Azure Purview (currently in preview) is a unified data governance service which supports automated data discovery, lineage identification and data classification across various Azure services, even on-premises and other multi-cloud systems. It supports integration via Apache Atlas Rest APIs for any other systems which Purview doesn’t directly support.

If we have Apache Hive as our organizational central data warehousing solution and we create our data assets as external tables i.e. keeping the data into Azure Data Lake, Purview can scan the data files and can take out the schema information. …


“Sensitive data is a part of every large organization’s normal business practice. Allowing sensitive data from production applications to be copied and used for development and testing environments increases the potential for theft, loss or exposure — thus increasing the organization’s risk. Data masking is emerging as a best practice for obfuscating real data so it can be safely used in non-production environments. This helps organizations meet compliance requirements for PCI, HIPAA, GLBA and other data privacy regulations.” — CIO.com

Data masking is an important feature for any types of data storage and the reasons are rightly mentioned in the…

Prosenjit Chakraborty

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store