Image for post
Image for post

“Sensitive data is a part of every large organization’s normal business practice. Allowing sensitive data from production applications to be copied and used for development and testing environments increases the potential for theft, loss or exposure — thus increasing the organization’s risk. Data masking is emerging as a best practice for obfuscating real data so it can be safely used in non-production environments. This helps organizations meet compliance requirements for PCI, HIPAA, GLBA and other data privacy regulations.” — CIO.com

Data masking is an important feature for any types of data storage and the reasons are rightly mentioned in the above extract. If we look at the Azure data store tech stack, this can be achieved easily using Azure SQL Database and Azure Synapse Analytics. However, in case we’re keeping any sensitive information in Azure Data Lake, we don’t have any inbuilt feature to obfuscate selective data attributes. …

Image for post
Image for post

Azure Synapse has brought together Microsoft’s enterprise data warehousing with Big Data analytics. And the Azure Synapse Workspace is providing an unique experience by unifying various components into a common user friendly interface.

Image for post
Image for post
The four main components of Azure Synapse.

In this blog, we’ll evaluate the main components to some extent and will draw a simplified architecture.

Setting up an Azure Synapse Analytics Workspace

To start with, we’ll first create a workspace using Azure portal.

Image for post
Image for post
Input the Workspace details and primary ADLS Gen 2 account information.

Image for post
Image for post

Azure Analysis Service (AAS) is an enterprise grade semantic analytical data modelling tool for BI & reporting. In this short blog, we’ll document the steps required to configure AAS and will create a sample report using Power BI Desktop.

To start with, we’ll first provision an Azure SQL Database to host the data. AAS supports various data sources, full list is here.

Create a SQL Database

We’ll create an Azure SQL Database Server and a database.

Image for post
Image for post

Handling sensitive data is a common use case for an enterprise however, keeping these secure in a cloud environment is more challenging than on-premise. If not protected well, enterprises run the risk of breaching sensitive information causing financial and reputation loss.

We can keep confidential records in a cloud data lake and restrict using RBAC (role based access control) and ACL (access control list) however, those will restrict the data asset on it’s entirety and we’ll not be able to read the non-sensitive attributes. …

Image for post
Image for post

When we work on highly connected data sets such as social networks, world travel routes, material traceability for manufacturing & distribution industry, a robust graph database is must to store the data. On the other hand we need a big data processing tool to handle large datasets.

In this blog, we’ll show how to store a highly connected data set into the very well known graph database - Neo4J along with processing the data using Apache Spark.

Download Graph Data

We’ll use graph data about air-routes from here. air-routes-latest-nodes.csv contains details about airports, countries and continents. air-routes-latest-edges.csv

Image for post
Image for post

Resiliency is one of the most important aspects we should consider while creating a data lake. Azure Storage provides some great features to improve resiliency. On top of these, Databricks Delta Lake can add a cool feature called time travelling to make the lake more resilient and easily recoverable.

In this blog, we’ll discuss about few features which will help to protect our data from corruption/deletion and can help to restore easily in case of any issues.

Right Access Permission

First thing we will consider providing the right access. Only the resource administrator should have the owner access, developers should have read access and applications can have contributor access. By this way, data can only be deleted by the resource administrator or by a process e.g. …

Image for post
Image for post

In our earlier blog, we discussed how to connect a device using Azure IoT SDK to Azure IoT Hub. In this blog we’ll connect a KEPServerEX instance, deployed at a factory to the Azure IoT Hub.

KEPServerEX is the industry’s leading connectivity platform that provides a single source of industrial automation data to all of your applications. The platform design allows users to connect, manage, monitor, and control diverse automation devices and software applications through one intuitive user interface — source

We’ll also use ThingWorx Manufacturing App in today’s discussion. This is not required to configure KEPServerEX. …

Image for post
Image for post

IoT (Internet of Things) and IIoT (Industrial Internet of Things) are very common words come to our mind when we buy a new smart electronic appliance or we drive a new car or think about a sophisticated manufacturing plants.

To make the Things really intelligent not only we capture the events generated from these, we analyze the events, predict the future, visualize the details and act based on the analysis & prediction.

Image for post
Image for post

Databricks has now become a default choice of service for big data computation in Azure, by its own merit. As more and more clients are embracing it (and Apache Spark) with their versatile use cases, some people started complaining about the hefty Azure bill they’re getting and Azure Databricks’ contribution on that!

Though cloud services has brought infrastructures & services provisioning time from months to seconds however, appropriate governance & controls have become more important.

So, instead of blaming the cloud services (here, Databricks) why not we learn the cost optimization techniques and spend money based on our business needs only. …

Image for post
Image for post

Data quality is an important aspect whenever we ingest data. In a big data scenario this becomes very challenging considering the high volume, velocity & variety of data. Incomplete or wrong data can lead more false predictions by a machine learning algorithm, we may lose opportunities to monetize our data because of the data issues and business can lose their confidence on the data.

Apache Spark has become a technology by default nowadays for big data ingestion & transformation. …

About

Prosenjit Chakraborty

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store