Azure Synapse Analytics Workspace— the tool unifies your Big Data Components

zure Synapse has brought together Microsoft’s enterprise data warehousing with Big Data analytics. And the Azure Synapse Workspace is providing an unique experience by unifying various components into a common user friendly interface.

The four main components of Azure Synapse.

In this blog, we’ll evaluate the main components to some extent and will draw a simplified architecture.

Setting up an Azure Synapse Analytics Workspace

To start with, we’ll first create a workspace using Azure portal.

Input the Workspace details and primary ADLS Gen 2 account information.
Provide credentials for the SQL pools and input other security & networking settings.
The summary screen, proceed to create the workspace.
The workspace has been created however, there are no pools provisioned.

Glimpse of the Workspace

Once the workspace has been created, we’ll use the Workspace web URL to launch it in another browser window.

A single-view workspace screen.
Data — Workspace tab

Synapse Analytics Workspace supports three types of ‘Databases’ — SQL pool, SQL on-demand and Spark pool. The Data > Workspace tab contains the database details. We’ll explore each type, to some extent in this blog.

Data — Linked tab

In the Data > Linked tab, we can add Azure Cosmos DB MongoDB & SQL APIs & Azure Data Lake Storage Gen 2. The primary storage account will be automatically configured.

Develop tab

The Develop tab contains all of our sql scripts, Spark notebooks, job definitions and data flows.

Orchestrate tab

This tab contains the data factory pipelines and copy data activities.

Monitor tab

The Monitor tab contains all of the pipeline runs, Spark and SQL runs. The historic runs can be filtered by time range.

Manage tab

Use the Manage tab, to manage the security settings (e.g. access control — giving access to the workspace), pipeline triggers, integration runtimes, linked services and analytical pools.

Configure the Workspace

Apache Spark Pool

Synapse workspace supports creating & managing Apache Spark pools and running Spark queries against your big data. The pool configuration supports a lot of advanced features. For details, refer here.

Go to Manage > Apache Spark pools > will create a pool based on our need.
The ‘Additional settings’ tab. Spark pool supports Delta Lake & .Net for Apache Spark.
We can create or manage Spark pools from the ‘Manage’ tab.
Spark pool can be created from the Azure portal as well.

Once we have a pool set up we can create a notebook to run our Spark code.

Notebooks can be run interactively or by any pipeline.

For this blog, we’ll just load the records from datalake, aggregate and store into another datalake location.

from pyspark.sql.functions import colgrpByRegionDf = df \
.filter(col('region').isin([ \
'Europe', 'Asia', 'Africa', \
'Polar', 'Americas', 'Oceania'])) \
.groupBy("region") \
.agg({"name" : "count"}) \
.withColumnRenamed("count(name)", "total_countries")
grpByRegionDf.write.parquet('abfss://myfilesystem@myorgdatalake.dfs.core.windows.net/Group_By_Region', mode='overwrite')

SQL Pool

The SQL pool is a T-SQL based compute engine with storage capabilities. We can create any number of SQL pools from the Synapse Analytics workspace. For further details, refer here.

We can create a SQL pool from Azure portal.
Input the ‘Basic’ configurations.
Select ‘Additional settings’ if required.
SQL pool can be paused, resumed from the portal.
It’s easy to create new scripts & maintain inside the workspace.
The linked service to SQL pool is already created with ‘DBName’ parameter, we need to input our pool name only.

SQL pool supports three ways to load data into - PolyBase, Copy command & Bulk insert. We’ll use the first two options in this blog.

Load data using COPY statement

With the following steps, we will copy a Parquet file (with Snappy compression) into a SQL Pool table.

  • Create an external file format (refer here for more details):
CREATE EXTERNAL FILE FORMAT generic_parquet_format  
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
  • Create a table:
CREATE TABLE [dbo].[region_group_parquet]
(
region VARCHAR(50) NOT NULL,
total_countries BIGINT NOT NULL
)
GO

Load data from the PARQUET file(s) using COPY statements:

COPY INTO [dbo].[region_group_parquet]
FROM 'https://myorgdatalake.dfs.core.windows.net/myfilesystem/Group_By_Region/*.parquet'
WITH (
FILE_FORMAT = generic_parquet_format,
CREDENTIAL=(IDENTITY= 'Storage Account Key', SECRET='<storage account secret key>')
)

View the table using:

SELECT * FROM [dbo].[region_group_parquet]

SQL on-demand

This is a server-less query service which enables us to run SQL queries on files of different types, placed in Azure Storage.

The SQL on-demand comes under the SQL pools section.

To read files directly from the Azure Storage account, SQL on-demand supports different authorization steps (refer here for more details). In this blog we’ll use Shared Access Signature to connect.

We can query the external data using OPENROWSET or create an external table to map the records.

Step 1: Create Master Key

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<input a password>'
GO

Step 2: Create databases scoped credential

CREATE DATABASE SCOPED CREDENTIAL [SasToken]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = '<shared access signature without the first /?/ char>';
GO

Step 3: Create external file format

CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat] WITH ( FORMAT_TYPE = PARQUET)
GO

Step 4: Create external data source

CREATE EXTERNAL DATA SOURCE myfilesystem
WITH (
LOCATION = 'https://myorgdatalake.dfs.core.windows.net/myfilesystem',
CREDENTIAL = SasToken
)
GO

Step 5: Create external table

CREATE EXTERNAL TABLE dbo.Group_By_Region ( [region] varchar(50), [total_countries] bigint ) 
WITH (
LOCATION = '/Group_By_Region/*.parquet',
DATA_SOURCE = [myfilesystem],
FILE_FORMAT = [SynapseParquetFormat]
)
GO

Step 6: View external table

SELECT * FROM dbo.Group_By_Region
GO
Viewing the Parquet data using SQL from workspace.

Refer this more options and query syntax.

Create a Pipeline

As we have now seen SQL pool, SQL on-demand & Spark pool, we’ll create an end to end data load & transformation pipeline using Synapse pipeline. We’ll use Spark pool to transform data in Parquet and load the data into SQL pool using PolyBase.

Synapse pipeline is Azure Data Factory integrated into Synapse workspace.

This pipeline brings data from an external API & finally loads into Spark pool.

For more on copying data using PolyBase we can refer here.

Architecture

A simplified architecture using Synapse components.

Conclusion

In this blog, we have seen some new components Microsoft has added to enrich the data life-cycle around former Azure SQL DW. Though some features are still being improved, it’ll be worth to try these out specially if we’re already using SQL DW/SQL pool.

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium & LinkedIn.

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store