Spark MLlib is a distributed machine learning framework comprising a set of popular machine learning libraries and utilities. As this use Spark Core for parallel computing, so really useful to apply the algorithms on big data sets.
In this blog, we’ll use 9 well known classifiers to classify the Banknote dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Spark MLlib classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.
The database contains the following…
Scikit Learn is an open source, Python based very popular machine learning library. It supports various supervised (regression and classification) and unsupervised learning models.
In this blog, we’ll use 10 well known classifiers to classify the Pima Indians Diabetes dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Scikit Learn classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.
The database contains the following details:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
Azure Purview (currently in preview) is a unified data governance service which supports automated data discovery, lineage identification and data classification across various Azure services, even on-premises and other multi-cloud systems. It supports integration via Apache Atlas Rest APIs for any other systems which Purview doesn’t directly support.
If we have Apache Hive as our organizational central data warehousing solution and we create our data assets as external tables i.e. keeping the data into Azure Data Lake, Purview can scan the data files and can take out the schema information. …
“Sensitive data is a part of every large organization’s normal business practice. Allowing sensitive data from production applications to be copied and used for development and testing environments increases the potential for theft, loss or exposure — thus increasing the organization’s risk. Data masking is emerging as a best practice for obfuscating real data so it can be safely used in non-production environments. This helps organizations meet compliance requirements for PCI, HIPAA, GLBA and other data privacy regulations.” — CIO.com
Data masking is an important feature for any types of data storage and the reasons are rightly mentioned in the above extract. If we look at the Azure data store tech stack, this can be achieved easily using Azure SQL Database and Azure Synapse Analytics. However, in case we’re keeping any sensitive information in Azure Data Lake, we don’t have any inbuilt feature to obfuscate selective data attributes. …
Azure Synapse has brought together Microsoft’s enterprise data warehousing with Big Data analytics. And the Azure Synapse Workspace is providing an unique experience by unifying various components into a common user friendly interface.
Azure Analysis Service (AAS) is an enterprise grade semantic analytical data modelling tool for BI & reporting. In this short blog, we’ll document the steps required to configure AAS and will create a sample report using Power BI Desktop.
To start with, we’ll first provision an Azure SQL Database to host the data. AAS supports various data sources, full list is here.
We’ll create an Azure SQL Database Server and a database.
Handling sensitive data is a common use case for an enterprise however, keeping these secure in a cloud environment is more challenging than on-premise. If not protected well, enterprises run the risk of breaching sensitive information causing financial and reputation loss.
We can keep confidential records in a cloud data lake and restrict using RBAC (role based access control) and ACL (access control list) however, those will restrict the data asset on it’s entirety and we’ll not be able to read the non-sensitive attributes. …
When we work on highly connected data sets such as social networks, world travel routes, material traceability for manufacturing & distribution industry, a robust graph database is must to store the data. On the other hand we need a big data processing tool to handle large datasets.
In this blog, we’ll show how to store a highly connected data set into the very well known graph database - Neo4J along with processing the data using Apache Spark.
We’ll use graph data about air-routes from here. air-routes-latest-nodes.csv contains details about airports, countries and continents. air-routes-latest-edges.csv …
Resiliency is one of the most important aspects we should consider while creating a data lake. Azure Storage provides some great features to improve resiliency. On top of these, Databricks Delta Lake can add a cool feature called time travelling to make the lake more resilient and easily recoverable.
In this blog, we’ll discuss about few features which will help to protect our data from corruption/deletion and can help to restore easily in case of any issues.
First thing we will consider providing the right access. Only the resource administrator should have the owner access, developers should have read access and applications can have contributor access. By this way, data can only be deleted by the resource administrator or by a process e.g. …
In our earlier blog, we discussed how to connect a device using Azure IoT SDK to Azure IoT Hub. In this blog we’ll connect a KEPServerEX instance, deployed at a factory to the Azure IoT Hub.
KEPServerEX is the industry’s leading connectivity platform that provides a single source of industrial automation data to all of your applications. The platform design allows users to connect, manage, monitor, and control diverse automation devices and software applications through one intuitive user interface — source
We’ll also use ThingWorx Manufacturing App in today’s discussion. This is not required to configure KEPServerEX. …