Azure DevOps Pipeline Setup for Azure Data Factory (v2)

Image for post
Image for post

Azure Data Factory (v2) is a very popular Azure managed service and being used heavily from simple to complex ETL (extract-transform-load), ELT (extract-load-transform) & data integration scenarios.

On the other hand, Azure DevOps has become a robust tool-set for collaboration & building CI-CD pipelines.

In this blog, we’ll see how we can implement a DevOps pipeline with ADFv2.

Data factories vs Data factory (v2)

Data factories

  • Contains a set of Data factory (v1 and/or v2) instances.

Data factory (v2)

  • Contains a set of pipelines, datasets, linked services, triggers, integration runtimes.
  • Each instance is associated to a resource group. Multiple instances can be associated to a single resource group.
  • Each instance is associated with a single location (e.g. East US, UK South, Central India etc.)
  • Instance level access control; users having access to a Data factory instance, can access all of the pipelines inside it.
  • Associated with a single Git/Azure Repo.
Image for post
Image for post
Data factories & individual factory instances

When to create multiple pipelines in a single Data factory (v2) instance?

  • Pipelines belong to a single business unit/same application/same project.
  • Common development & support teams can access all of the pipelines.
  • Can/should reuse the datasets or linked services.
  • Data factory (v2) instance and all other associated Azure services are in same location (Azure services can be from different location but ideally shouldn’t).

When to create multiple pipelines in a multiple Data factory (v2) instances?

  • Need to create pipelines for different business units (BU).
  • Developers/members of one BU shouldn’t have access of pipelines of another BU.
  • Developers/members of one BU shouldn’t have access of Azure services used by another BU.
  • Multiple support team monitoring multiple Data factory instances.

Problems

  • Single support team monitoring multiple Data factory instances — team needs to open the monitoring screen of multiple Data factory instances! Otherwise, they need to develop a tool to monitor pipelines of multiple factory instances in a consolidated view.
  • One Data factory instance can’t share the datasets / linked services with another one.

Azure DevOps Project Creation

To start with the actual implementation, let’s create an Azure DevOps project.

Image for post
Image for post
We have created a new project under one of ‘My organizations’.

By default, only Boards are activated so, we have to turn at least the following services on.

Image for post
Image for post
Select the project > Project settings > Overview > Azure DevOps services.

Azure Data Factory (v2) Creation & Integration with DevOps

Now, we’ll create an ADFv2 instance & integrate with the Azure Repos. From the New data factory screen, we can select Enable Git & input the Repo details.

Image for post
Image for post

Otherwise, we can skip the Enable Git and attach the repository later.

Image for post
Image for post
Select ‘Set up Code Repository’.

The Repository Settings screen will be opened. Add the required details.

Image for post
Image for post

Note: We may find that, Azure DevOps Account (above in RED) is not populated! On that case, select the ‘Select a different directory’ checkbox and try.

Image for post
Image for post

If the Azure DevOps Account is still not populated, login to https://dev.azure.com, select the organization > Organization Settings > Azure Active Directory and connect to the directory. Once you sign out and sign in, the following screen will be displayed.

Image for post
Image for post
Open Azure DevOps > select the organization > Organization Settings > Azure Active Directory.
Image for post
Image for post
Select the Azure DevOps Account, Project Name, Git repository name, Collaboration branch & Save.

Once done, the ADFv2 will be backed by the configured Azure Repos and we can start building data factory pipelines.

Image for post
Image for post
Created a very simple pipeline with couple of datasets & linked services.

All of the changes will be saved to the configured Azure Repos branch.

Image for post
Image for post
Azure Repos.

Points to note:

  • There are two options while designing ADFv2 pipelines in UI — the Data Factory live mode & Azure DevOps GIT mode.
  • If the live mode is selected, we have to Publish the pipeline to save it. If we don’t publish and test the pipeline in Debug mode only, there is a chance of losing the code in case of closing the browser/ADFv2 UI by mistake!
  • On the other hand, if we configure the Azure DevOps GIT, the design/code will be saved into the code repository and once we’re happy we can publish. So, we should always configure the code repository while designing ADFv2.
  • It’s not recommended to switch in between live mode and Git mode as the changes made into live mode may not be seen in Git mode.

Couple of more information we should be aware of, before we start with the DevOps pipeline…

adf_publish vs master (or Collaboration branch)

Once we setup the Azure Repos for a Data factory (v2), couple of branches are created — adf_publish & master (usually this is the collaboration branch, though we can select any other).

Image for post
Image for post
Select the organization > Project > Repos > Branches.

The adf_publish contains the ARM template and parameters JSONs. This is updated once we Publish from ADFv2 UI.

Image for post
Image for post
Azure Repos > Files > Select adf_publish.

The master branch is updated (unless we create any other branch e.g. any feature branch), once we Save from the UI.

Image for post
Image for post
Azure Repos > Files > Select master.

Branch Policies

This is not mandatory, however we should associate appropriate policies with the master branch.

Image for post
Image for post
Select organization > project > Repos > Branches > Branch policies.

To prohibit the un-reviewed code check into the master, we can enforce the the minimum number of reviewer(s) to review the code.

Image for post
Image for post
Under Project Settings > Policies will be selected.

We can inform a set of reviewers (e.g. coding governance / review team) automatically as soon as code check-in request is raised for that particular project branch. We can mandate that the reviewers should approve before we proceed further.

Image for post
Image for post

Refer here for more details on the Branch policies.

Now, the main part of this topic — we’ll setup the required development branch & construct a DevOps pipeline.

Image for post
Image for post
A simplified workflow

Create a Feature Branch from Master & Add a new ADFv2 Pipeline

We have a master branch with only one ADFv2 pipeline.

Image for post
Image for post

From the ADFv2 UI, we’ll create a new feature branch. The same will be created in the Azure Repos.

Image for post
Image for post

We’ll create a second pipeline in the UI and Save All once done.

Image for post
Image for post

The code will be updated in the Azure Repos.

Image for post
Image for post
Repos > Files > select the feature branch from the drop down.

Point to note:

The adf_publish branch which contains the ARM template & parameters will not be updated once we save. The Publish from Data factory UI will not work from the feature branch!

Image for post
Image for post

Create a Pull Request

Once we’re happy with the ADFv2 pipeline/datasets etc., we would like to deploy this to the next environment and push the code into the master branch (for the sake of simplicity we’re pushing the code from the development feature branch to master).

After the changes are done, we’ll create a pull request and select the reviewers.

Image for post
Image for post
Create a new pull request from the feature to master, select the reviewers & select the work items we have already created in Azure DevOps Boards.
Image for post
Image for post
The reviewer(s) will receive an auto-email. Press ‘View pull request’.
Image for post
Image for post
‘View pull request’ will open the changes in Azure DevOps portal. Select ‘Approve’/’Reject’ etc. Reviewer(s) may not want to ‘Complete’ the request.
Image for post
Image for post
The developer checks the ‘Pull requests’ status…
Image for post
Image for post
…and the comments from the reviewer(s). Once happy, developer can ‘Complete’ the request.

Code Merge into Master

Once the code is reviewed, merge it with the collaboration branch (here, master branch). Check the Merge type, full details are here.

Image for post
Image for post
Input merge comment, merge type & post-completion options.

Point to note:

The squash merging keeps our default branch histories (here, master) clean and easy to follow. When squash merging, it’s a good practice to delete the feature branch after merging.

Image for post
Image for post
Once the code merge is completed, ‘master’ will reflect the changes (component JSONs only).
Image for post
Image for post
Go back to the ADFv2 UI & ‘Publish’. The pending changes will be published now only.
Image for post
Image for post
Once the code publishing is completed, adf_publish will reflect the changes (ARM template & parameter JSONs).

Release Pipeline

Next, we’ll create a release pipeline which will be triggered once the adf_publish is updated. We haven’t selected a build pipeline as we’ll simply install the ARM template from the adf_publish repository.

Image for post
Image for post
Select the project > Pipelines > Releases.
Image for post
Image for post
Configure the Artifact. Select Source type as Azure Repos Git. Select adf_publish as the ‘Default branch’. Add a next stage, called ‘QA’.
Image for post
Image for post
Select the agent. Create a task ‘Azure Deployment: Create / Update Resource Group’

Points to note:

  • Check the ARM template deployment mode (Complete mode vs Incremental mode) while configuring the pipeline. Choosing complete mode will overwrite our existing pipelines/datasets/all which may create issue in running production environment. While incremental mode will update the changed resources. For full details, refer here.
  • We have selected a Microsoft-hosted agent for our job to run. For more on Azure Pipelines Agents or whether to use Self-hosted agent instead, refer here.
Image for post
Image for post
Create pipeline variables which will be referred by our pipeline.
Image for post
Image for post
Set the continuous deployment trigger so, the pipeline will be triggered as soon as adf_publish is updated.
Image for post
Image for post
Now, create the QA & production ADFv2 instances, Storage accounts, Azure Key Vaults (for this example).

Point to note:

The QA & production ADFv2 instances, region specific storage accounts, Azure Key Vaults need to be created and configured (e.g. ADFv2 Managed Identity Application ID access to Key Vault) before hand. If we need to create/configure all of these during pipeline execution, we need to add appropriate tasks.

Image for post
Image for post
Once the QA stage is created, we can create a release to test.
Image for post
Image for post
Once the manual release is tested successfully, we can move forward to create the production deployment stage.

What we want —

  1. QA deployment should be done as soon as we publish the changes into the repository, here: adf_publish.
  2. Production deployment should be done once all of the pre-production deployment activities are completed.
Image for post
Image for post
Enable the pre-deployment approvals.
Image for post
Image for post
Add the approvers & select other options based on the requirement.

This will complete the QA & Production stages configurations. Now, we need to configure the Continuous deployment trigger.

Image for post
Image for post
Enable the ‘Continuous deployment trigger’ on the ‘adf_publish’.

Points to note:

  • Configure the QA stage pre-deployment approvals e.g. approval from QA environment owner, development lead.
  • Configure the QA stage post-deployment approvals e.g. approval from the QA test lead.
  • Configure the QA stage post-deployment gates e.g. no alerts for the QA ADFv2 pipelines from the Azure Monitor.
  • Configure the production stage pre-deployment approvals e.g. approvals from the production environment support lead, project owner.

Please refer here for further information on approvals & gates.

Deployment In Action

Once the above steps are done, any change is pushed into adf_publish the DevOps pipeline will be triggered. In our case we only have production stage pre-deployment approval so, production deployment will be paused pending manual approval.

Image for post
Image for post
The approver will receive an email. Once they click ‘View approval’, Azure DevOps UI will be opened.
Image for post
Image for post
The approver needs to approve/reject the production deployment request.
Image for post
Image for post
If approved, the production deployment will start.
Image for post
Image for post
Production deployment is completed.
Image for post
Image for post
The production ADFv2 instance will have all of the artifacts deployed.

Pipeline Amendments

We may need to update or add new ADFv2 pipelines after we go live. We need to create a feature branch and do the changes.

In the following example, the following changes we have done:

  1. A new pipeline MyThirdPipeline is created by cloning from MySecondPipeline.
  2. MySecondPipeline is renamed to MySecondPipelineUpdated with waitTimeInSeconds from 5 to 10.

However, the file changes showing below are not as per the order we have changed! Though, for this example it’ll not make any harm.

Image for post
Image for post
The code changes shows, ‘MySecondPipelineUpdated’ as a new one!
Image for post
Image for post
While publishing, it’s also not reflecting the correct order of the changes.
Image for post
Image for post
After the changes are pushed into production, we find two new pipelines are created with the older one untouched!
Image for post
Image for post
Even the run history is preserved.

Points to note:

  • Earlier we selected the deployment mode as Incremental rather than Complete.
  • In case of change in pipeline name in the development environment, the old pipeline will not be renamed in the production rather than a new one will be created.
  • Production run history will be maintained unless we’re overwriting.

Conclusion

  • The master/main branch should always reflect what we have in production. In case this is not case, we should sync it on priority or, deploy individual JSONs manually (not the ARM template) until we sync.
  • We should setup the QA/staging/pre-production environment to validate the changes done by the DevOps pipeline before pushing the changes into production.
  • Prefer deployment mode as Incremental than Complete to preserve the production pipeline run history (unless we have a different requirement).
  • All of the Azure services should follow appropriate naming conventions so, Azure DevOps pipeline variables can be used to point to development, QA & production environments accordingly.
  • It’s recommended to use Azure Key Vault to store any connection strings, keys or secrets.
  • In case, we want to manage individual ADFv2 pipeline with CI-CD process, you can refer this (out of scope for this blog).

Hope you have enjoyed creating an Azure DevOps pipeline for ADFv2. We have tried to explain different aspects by taking a simple example however, our actual pipelines may be much more complex depending on our requirements!

Thanks for reading. In case you want to share your case studies or want to connect, please ping me via LinkedIn.

Written by

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store