Azure DevOps Pipeline Setup for Azure Data Factory (v2)
Azure Data Factory (v2) is a very popular Azure managed service and being used heavily from simple to complex ETL (extract-transform-load), ELT (extract-load-transform) & data integration scenarios.
On the other hand, Azure DevOps has become a robust tool-set for collaboration & building CI-CD pipelines.
In this blog, we’ll see how we can implement a DevOps pipeline with ADFv2.
Data factories vs Data factory (v2)
- Contains a set of Data factory (v1 and/or v2) instances.
Data factory (v2)
- Contains a set of pipelines, datasets, linked services, triggers, integration runtimes.
- Each instance is associated to a resource group. Multiple instances can be associated to a single resource group.
- Each instance is associated with a single location (e.g. East US, UK South, Central India etc.)
- Instance level access control; users having access to a Data factory instance, can access all of the pipelines inside it.
- Associated with a single Git/Azure Repo.
When to create multiple pipelines in a single Data factory (v2) instance?
- Pipelines belong to a single business unit/same application/same project.
- Common development & support teams can access all of the pipelines.
- Can/should reuse the datasets or linked services.
- Data factory (v2) instance and all other associated Azure services are in same location (Azure services can be from different location but ideally shouldn’t).
When to create multiple pipelines in a multiple Data factory (v2) instances?
- Need to create pipelines for different business units (BU).
- Developers/members of one BU shouldn’t have access of pipelines of another BU.
- Developers/members of one BU shouldn’t have access of Azure services used by another BU.
- Multiple support team monitoring multiple Data factory instances.
- Single support team monitoring multiple Data factory instances — team needs to open the monitoring screen of multiple Data factory instances! Otherwise, they need to develop a tool to monitor pipelines of multiple factory instances in a consolidated view.
- One Data factory instance can’t share the datasets / linked services with another one.
Azure DevOps Project Creation
To start with the actual implementation, let’s create an Azure DevOps project.
By default, only Boards are activated so, we have to turn at least the following services on.
Azure Data Factory (v2) Creation & Integration with DevOps
Now, we’ll create an ADFv2 instance & integrate with the Azure Repos. From the New data factory screen, we can select Enable Git & input the Repo details.
Otherwise, we can skip the Enable Git and attach the repository later.
The Repository Settings screen will be opened. Add the required details.
Note: We may find that, Azure DevOps Account (above in RED) is not populated! On that case, select the ‘Select a different directory’ checkbox and try.
If the Azure DevOps Account is still not populated, login to https://dev.azure.com, select the organization > Organization Settings > Azure Active Directory and connect to the directory. Once you sign out and sign in, the following screen will be displayed.
Once done, the ADFv2 will be backed by the configured Azure Repos and we can start building data factory pipelines.
All of the changes will be saved to the configured Azure Repos branch.
Points to note:
- There are two options while designing ADFv2 pipelines in UI — the Data Factory live mode & Azure DevOps GIT mode.
- If the live mode is selected, we have to Publish the pipeline to save it. If we don’t publish and test the pipeline in Debug mode only, there is a chance of losing the code in case of closing the browser/ADFv2 UI by mistake!
- On the other hand, if we configure the Azure DevOps GIT, the design/code will be saved into the code repository and once we’re happy we can publish. So, we should always configure the code repository while designing ADFv2.
- It’s not recommended to switch in between live mode and Git mode as the changes made into live mode may not be seen in Git mode.
Couple of more information we should be aware of, before we start with the DevOps pipeline…
adf_publish vs master (or Collaboration branch)
Once we setup the Azure Repos for a Data factory (v2), couple of branches are created — adf_publish & master (usually this is the collaboration branch, though we can select any other).
The adf_publish contains the ARM template and parameters JSONs. This is updated once we Publish from ADFv2 UI.
The master branch is updated (unless we create any other branch e.g. any feature branch), once we Save from the UI.
This is not mandatory, however we should associate appropriate policies with the master branch.
To prohibit the un-reviewed code check into the master, we can enforce the the minimum number of reviewer(s) to review the code.
We can inform a set of reviewers (e.g. coding governance / review team) automatically as soon as code check-in request is raised for that particular project branch. We can mandate that the reviewers should approve before we proceed further.
Refer here for more details on the Branch policies.
Now, the main part of this topic — we’ll setup the required development branch & construct a DevOps pipeline.
Create a Feature Branch from Master & Add a new ADFv2 Pipeline
We have a master branch with only one ADFv2 pipeline.
From the ADFv2 UI, we’ll create a new feature branch. The same will be created in the Azure Repos.
We’ll create a second pipeline in the UI and Save All once done.
The code will be updated in the Azure Repos.
Point to note:
The adf_publish branch which contains the ARM template & parameters will not be updated once we save. The Publish from Data factory UI will not work from the feature branch!
Create a Pull Request
Once we’re happy with the ADFv2 pipeline/datasets etc., we would like to deploy this to the next environment and push the code into the master branch (for the sake of simplicity we’re pushing the code from the development feature branch to master).
After the changes are done, we’ll create a pull request and select the reviewers.
Code Merge into Master
Once the code is reviewed, merge it with the collaboration branch (here, master branch). Check the Merge type, full details are here.
Point to note:
The squash merging keeps our default branch histories (here, master) clean and easy to follow. When squash merging, it’s a good practice to delete the feature branch after merging.
Next, we’ll create a release pipeline which will be triggered once the adf_publish is updated. We haven’t selected a build pipeline as we’ll simply install the ARM template from the adf_publish repository.
Points to note:
- Check the ARM template deployment mode (Complete mode vs Incremental mode) while configuring the pipeline. Choosing complete mode will overwrite our existing pipelines/datasets/all which may create issue in running production environment. While incremental mode will update the changed resources. For full details, refer here.
- We have selected a Microsoft-hosted agent for our job to run. For more on Azure Pipelines Agents or whether to use Self-hosted agent instead, refer here.
Point to note:
The QA & production ADFv2 instances, region specific storage accounts, Azure Key Vaults need to be created and configured (e.g. ADFv2 Managed Identity Application ID access to Key Vault) before hand. If we need to create/configure all of these during pipeline execution, we need to add appropriate tasks.
What we want —
- QA deployment should be done as soon as we publish the changes into the repository, here: adf_publish.
- Production deployment should be done once all of the pre-production deployment activities are completed.
This will complete the QA & Production stages configurations. Now, we need to configure the Continuous deployment trigger.
Points to note:
- Configure the QA stage pre-deployment approvals e.g. approval from QA environment owner, development lead.
- Configure the QA stage post-deployment approvals e.g. approval from the QA test lead.
- Configure the QA stage post-deployment gates e.g. no alerts for the QA ADFv2 pipelines from the Azure Monitor.
- Configure the production stage pre-deployment approvals e.g. approvals from the production environment support lead, project owner.
Please refer here for further information on approvals & gates.
Deployment In Action
Once the above steps are done, any change is pushed into adf_publish the DevOps pipeline will be triggered. In our case we only have production stage pre-deployment approval so, production deployment will be paused pending manual approval.
We may need to update or add new ADFv2 pipelines after we go live. We need to create a feature branch and do the changes.
In the following example, the following changes we have done:
- A new pipeline MyThirdPipeline is created by cloning from MySecondPipeline.
- MySecondPipeline is renamed to MySecondPipelineUpdated with waitTimeInSeconds from 5 to 10.
However, the file changes showing below are not as per the order we have changed! Though, for this example it’ll not make any harm.
Points to note:
- Earlier we selected the deployment mode as Incremental rather than Complete.
- In case of change in pipeline name in the development environment, the old pipeline will not be renamed in the production rather than a new one will be created.
- Production run history will be maintained unless we’re overwriting.
- The master/main branch should always reflect what we have in production. In case this is not case, we should sync it on priority or, deploy individual JSONs manually (not the ARM template) until we sync.
- We should setup the QA/staging/pre-production environment to validate the changes done by the DevOps pipeline before pushing the changes into production.
- Prefer deployment mode as Incremental than Complete to preserve the production pipeline run history (unless we have a different requirement).
- All of the Azure services should follow appropriate naming conventions so, Azure DevOps pipeline variables can be used to point to development, QA & production environments accordingly.
- It’s recommended to use Azure Key Vault to store any connection strings, keys or secrets.
- In case, we want to manage individual ADFv2 pipeline with CI-CD process, you can refer this (out of scope for this blog).
Hope you have enjoyed creating an Azure DevOps pipeline for ADFv2. We have tried to explain different aspects by taking a simple example however, our actual pipelines may be much more complex depending on our requirements!
Thanks for reading. In case you want to share your case studies or want to connect, please ping me via LinkedIn.