Azure Synapse Analytics Workspace— the tool unifies your Big Data Components

Image for post
Image for post

Azure Synapse has brought together Microsoft’s enterprise data warehousing with Big Data analytics. And the Azure Synapse Workspace is providing an unique experience by unifying various components into a common user friendly interface.

Image for post
Image for post

In this blog, we’ll evaluate the main components to some extent and will draw a simplified architecture.

Setting up an Azure Synapse Analytics Workspace

To start with, we’ll first create a workspace using Azure portal.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Glimpse of the Workspace

Once the workspace has been created, we’ll use the Workspace web URL to launch it in another browser window.

Image for post
Image for post
Image for post
Image for post

Synapse Analytics Workspace supports three types of ‘Databases’ — SQL pool, SQL on-demand and Spark pool. The Data > Workspace tab contains the database details. We’ll explore each type, to some extent in this blog.

Image for post
Image for post

In the Data > Linked tab, we can add Azure Cosmos DB MongoDB & SQL APIs & Azure Data Lake Storage Gen 2. The primary storage account will be automatically configured.

Image for post
Image for post

The Develop tab contains all of our sql scripts, Spark notebooks, job definitions and data flows.

Image for post
Image for post

This tab contains the data factory pipelines and copy data activities.

Image for post
Image for post

The Monitor tab contains all of the pipeline runs, Spark and SQL runs. The historic runs can be filtered by time range.

Image for post
Image for post

Use the Manage tab, to manage the security settings (e.g. access control — giving access to the workspace), pipeline triggers, integration runtimes, linked services and analytical pools.

Configure the Workspace

Apache Spark Pool

Synapse workspace supports creating & managing Apache Spark pools and running Spark queries against your big data. The pool configuration supports a lot of advanced features. For details, refer here.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Once we have a pool set up we can create a notebook to run our Spark code.

Image for post
Image for post

For this blog, we’ll just load the records from datalake, aggregate and store into another datalake location.

from pyspark.sql.functions import colgrpByRegionDf = df \
.filter(col('region').isin([ \
'Europe', 'Asia', 'Africa', \
'Polar', 'Americas', 'Oceania'])) \
.groupBy("region") \
.agg({"name" : "count"}) \
.withColumnRenamed("count(name)", "total_countries")
grpByRegionDf.write.parquet('abfss://myfilesystem@myorgdatalake.dfs.core.windows.net/Group_By_Region', mode='overwrite')

SQL Pool

The SQL pool is a T-SQL based compute engine with storage capabilities. We can create any number of SQL pools from the Synapse Analytics workspace. For further details, refer here.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

SQL pool supports three ways to load data into - PolyBase, Copy command & Bulk insert. We’ll use the first two options in this blog.

Load data using COPY statement

With the following steps, we will copy a Parquet file (with Snappy compression) into a SQL Pool table.

  • Create an external file format (refer here for more details):
CREATE EXTERNAL FILE FORMAT generic_parquet_format  
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
  • Create a table:
CREATE TABLE [dbo].[region_group_parquet]
(
region VARCHAR(50) NOT NULL,
total_countries BIGINT NOT NULL
)
GO

Load data from the PARQUET file(s) using COPY statements:

COPY INTO [dbo].[region_group_parquet]
FROM 'https://myorgdatalake.dfs.core.windows.net/myfilesystem/Group_By_Region/*.parquet'
WITH (
FILE_FORMAT = generic_parquet_format,
CREDENTIAL=(IDENTITY= 'Storage Account Key', SECRET='<storage account secret key>')
)

View the table using:

SELECT * FROM [dbo].[region_group_parquet]

SQL on-demand

This is a server-less query service which enables us to run SQL queries on files of different types, placed in Azure Storage.

Image for post
Image for post

To read files directly from the Azure Storage account, SQL on-demand supports different authorization steps (refer here for more details). In this blog we’ll use Shared Access Signature to connect.

We can query the external data using OPENROWSET or create an external table to map the records.

Step 1: Create Master Key

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<input a password>'
GO

Step 2: Create databases scoped credential

CREATE DATABASE SCOPED CREDENTIAL [SasToken]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = '<shared access signature without the first /?/ char>';
GO

Step 3: Create external file format

CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat] WITH ( FORMAT_TYPE = PARQUET)
GO

Step 4: Create external data source

CREATE EXTERNAL DATA SOURCE myfilesystem
WITH (
LOCATION = 'https://myorgdatalake.dfs.core.windows.net/myfilesystem',
CREDENTIAL = SasToken
)
GO

Step 5: Create external table

CREATE EXTERNAL TABLE dbo.Group_By_Region ( [region] varchar(50), [total_countries] bigint ) 
WITH (
LOCATION = '/Group_By_Region/*.parquet',
DATA_SOURCE = [myfilesystem],
FILE_FORMAT = [SynapseParquetFormat]
)
GO

Step 6: View external table

SELECT * FROM dbo.Group_By_Region
GO
Image for post
Image for post

Refer this more options and query syntax.

Create a Pipeline

As we have now seen SQL pool, SQL on-demand & Spark pool, we’ll create an end to end data load & transformation pipeline using Synapse pipeline. We’ll use Spark pool to transform data in Parquet and load the data into SQL pool using PolyBase.

Synapse pipeline is Azure Data Factory integrated into Synapse workspace.

Image for post
Image for post

For more on copying data using PolyBase we can refer here.

Architecture

Image for post
Image for post

Conclusion

In this blog, we have seen some new components Microsoft has added to enrich the data life-cycle around former Azure SQL DW. Though some features are still being improved, it’ll be worth to try these out specially if we’re already using SQL DW/SQL pool.

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium & LinkedIn.

Written by

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store