Azure Synapse Analytics Workspace— the tool unifies your Big Data Components

Azure Synapse has brought together Microsoft’s enterprise data warehousing with Big Data analytics. And the Azure Synapse Workspace is providing an unique experience by unifying various components into a common user friendly interface.

In this blog, we’ll evaluate the main components to some extent and will draw a simplified architecture.
Setting up an Azure Synapse Analytics Workspace
To start with, we’ll first create a workspace using Azure portal.




Glimpse of the Workspace
Once the workspace has been created, we’ll use the Workspace web URL to launch it in another browser window.


Synapse Analytics Workspace supports three types of ‘Databases’ — SQL pool, SQL on-demand and Spark pool. The Data > Workspace tab contains the database details. We’ll explore each type, to some extent in this blog.

In the Data > Linked tab, we can add Azure Cosmos DB MongoDB & SQL APIs & Azure Data Lake Storage Gen 2. The primary storage account will be automatically configured.

The Develop tab contains all of our sql scripts, Spark notebooks, job definitions and data flows.

This tab contains the data factory pipelines and copy data activities.

The Monitor tab contains all of the pipeline runs, Spark and SQL runs. The historic runs can be filtered by time range.

Use the Manage tab, to manage the security settings (e.g. access control — giving access to the workspace), pipeline triggers, integration runtimes, linked services and analytical pools.
Configure the Workspace
Apache Spark Pool
Synapse workspace supports creating & managing Apache Spark pools and running Spark queries against your big data. The pool configuration supports a lot of advanced features. For details, refer here.




Once we have a pool set up we can create a notebook to run our Spark code.

For this blog, we’ll just load the records from datalake, aggregate and store into another datalake location.
from pyspark.sql.functions import colgrpByRegionDf = df \
.filter(col('region').isin([ \
'Europe', 'Asia', 'Africa', \
'Polar', 'Americas', 'Oceania'])) \
.groupBy("region") \
.agg({"name" : "count"}) \
.withColumnRenamed("count(name)", "total_countries")grpByRegionDf.write.parquet('abfss://myfilesystem@myorgdatalake.dfs.core.windows.net/Group_By_Region', mode='overwrite')
SQL Pool
The SQL pool is a T-SQL based compute engine with storage capabilities. We can create any number of SQL pools from the Synapse Analytics workspace. For further details, refer here.






SQL pool supports three ways to load data into - PolyBase, Copy command & Bulk insert. We’ll use the first two options in this blog.
Load data using COPY statement
With the following steps, we will copy a Parquet file (with Snappy compression) into a SQL Pool table.
- Create an external file format (refer here for more details):
CREATE EXTERNAL FILE FORMAT generic_parquet_format
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
- Create a table:
CREATE TABLE [dbo].[region_group_parquet]
(
region VARCHAR(50) NOT NULL,
total_countries BIGINT NOT NULL
)
GO
Load data from the PARQUET file(s) using COPY statements:
COPY INTO [dbo].[region_group_parquet]
FROM 'https://myorgdatalake.dfs.core.windows.net/myfilesystem/Group_By_Region/*.parquet'
WITH (
FILE_FORMAT = generic_parquet_format,
CREDENTIAL=(IDENTITY= 'Storage Account Key', SECRET='<storage account secret key>')
)
View the table using:
SELECT * FROM [dbo].[region_group_parquet]
SQL on-demand
This is a server-less query service which enables us to run SQL queries on files of different types, placed in Azure Storage.

To read files directly from the Azure Storage account, SQL on-demand supports different authorization steps (refer here for more details). In this blog we’ll use Shared Access Signature to connect.
We can query the external data using OPENROWSET or create an external table to map the records.
Step 1: Create Master Key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<input a password>'
GO
Step 2: Create databases scoped credential
CREATE DATABASE SCOPED CREDENTIAL [SasToken]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = '<shared access signature without the first /?/ char>';
GO
Step 3: Create external file format
CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat] WITH ( FORMAT_TYPE = PARQUET)
GO
Step 4: Create external data source
CREATE EXTERNAL DATA SOURCE myfilesystem
WITH (
LOCATION = 'https://myorgdatalake.dfs.core.windows.net/myfilesystem',
CREDENTIAL = SasToken
)
GO
Step 5: Create external table
CREATE EXTERNAL TABLE dbo.Group_By_Region ( [region] varchar(50), [total_countries] bigint )
WITH (
LOCATION = '/Group_By_Region/*.parquet',
DATA_SOURCE = [myfilesystem],
FILE_FORMAT = [SynapseParquetFormat]
)
GO
Step 6: View external table
SELECT * FROM dbo.Group_By_Region
GO

Refer this more options and query syntax.
Create a Pipeline
As we have now seen SQL pool, SQL on-demand & Spark pool, we’ll create an end to end data load & transformation pipeline using Synapse pipeline. We’ll use Spark pool to transform data in Parquet and load the data into SQL pool using PolyBase.
Synapse pipeline is Azure Data Factory integrated into Synapse workspace.

For more on copying data using PolyBase we can refer here.
Architecture

Conclusion
In this blog, we have seen some new components Microsoft has added to enrich the data life-cycle around former Azure SQL DW. Though some features are still being improved, it’ll be worth to try these out specially if we’re already using SQL DW/SQL pool.