Azure Purview, one of the latest tools delivered by Microsoft helps to properly govern customer Data Lake and have well-integration with various Azure services. Its support to Apache Atlas API can easily extend the data governance service to various non-Azure components as well. In my earlier blog, we have seen how we can leverage the API to catalog/lineage Apache Hive assets. In this blog, we’ll see how we can register Delta Lake assets into Purview.
Scanning Azure Data Lake identifies Delta Lake table schema. Find below few screenshots.
Spark MLlib is a distributed machine learning framework comprising a set of popular machine learning libraries and utilities. As this use Spark Core for parallel computing, so really useful to apply the algorithms on big data sets.
In this blog, we’ll use 9 well known classifiers to classify the Banknote dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Spark MLlib classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.
The database contains…
Scikit Learn is an open source, Python based very popular machine learning library. It supports various supervised (regression and classification) and unsupervised learning models.
In this blog, we’ll use 10 well known classifiers to classify the Pima Indians Diabetes dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Scikit Learn classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.
The database contains the following details:
1. Number of times pregnant 2. Plasma glucose…
Azure Purview (currently in preview) is a unified data governance service which supports automated data discovery, lineage identification and data classification across various Azure services, even on-premises and other multi-cloud systems. It supports integration via Apache Atlas Rest APIs for any other systems which Purview doesn’t directly support.
If we have Apache Hive as our organizational central data warehousing solution and we create our data assets as external tables i.e. keeping the data into Azure Data Lake, Purview can scan the data files and can take out the schema information. …
“Sensitive data is a part of every large organization’s normal business practice. Allowing sensitive data from production applications to be copied and used for development and testing environments increases the potential for theft, loss or exposure — thus increasing the organization’s risk. Data masking is emerging as a best practice for obfuscating real data so it can be safely used in non-production environments. This helps organizations meet compliance requirements for PCI, HIPAA, GLBA and other data privacy regulations.” — CIO.com
Data masking is an important feature for any types of data storage and the reasons are rightly mentioned in the…
Azure Synapse has brought together Microsoft’s enterprise data warehousing with Big Data analytics. And the Azure Synapse Workspace is providing an unique experience by unifying various components into a common user friendly interface.
Azure Analysis Service (AAS) is an enterprise grade semantic analytical data modelling tool for BI & reporting. In this short blog, we’ll document the steps required to configure AAS and will create a sample report using Power BI Desktop.
To start with, we’ll first provision an Azure SQL Database to host the data. AAS supports various data sources, full list is here.
We’ll create an Azure SQL Database Server and a database.
Handling sensitive data is a common use case for an enterprise however, keeping these secure in a cloud environment is more challenging than on-premise. If not protected well, enterprises run the risk of breaching sensitive information causing financial and reputation loss.
We can keep confidential records in a cloud data lake and restrict using RBAC (role based access control) and ACL (access control list) however, those will restrict the data asset on it’s entirety and we’ll not be able to read the non-sensitive attributes. …
When we work on highly connected data sets such as social networks, world travel routes, material traceability for manufacturing & distribution industry, a robust graph database is must to store the data. On the other hand we need a big data processing tool to handle large datasets.
In this blog, we’ll show how to store a highly connected data set into the very well known graph database - Neo4J along with processing the data using Apache Spark.
We’ll use graph data about air-routes from here. air-routes-latest-nodes.csv contains details about airports, countries and continents. air-routes-latest-edges.csv …
Resiliency is one of the most important aspects we should consider while creating a data lake. Azure Storage provides some great features to improve resiliency. On top of these, Databricks Delta Lake can add a cool feature called time travelling to make the lake more resilient and easily recoverable.
In this blog, we’ll discuss about few features which will help to protect our data from corruption/deletion and can help to restore easily in case of any issues.
First thing we will consider providing the right access. Only the resource administrator should have the owner access, developers should have read access…
Tech enthusiast, Azure Big Data Architect.