In my previous blog, we created an Azure Data Factory instance and discussed the benefits of using it. We discussed how Azure Data Factory allows users to transform their raw data into actionable insights. In this blog, I shall introduce you to major Azure Data Factory components for this process. So, let us start exploring the Data Factory components.
The following are Azure Data Factory components that I shall be covering. These components will transform your source data and make it consumable for the end-product.
- Pipelines
- Activities
- Dataflows
- Datasets
- Linked Services
- Integration Runtimes
- Triggers
Please Note: It is essential for you to have created a data factory instance to follow this blog. If you do not know how to create a data factory instance, please refer to the following blog:
Introduction to Azure Data Factory
To understand the Azure Data Factory components, we shall be looking at an example comprising of a pipeline, two datasets, and one dataflow as shown below.
Pipelines
A pipeline is defined as a logical group of activities to perform a piece of work. A Data Factory can comprise of multiple pipelines with each pipeline having multiple activities. The activities inside the pipelines can be structured to run sequentially or in parallel, depending on your needs. The pipeline allows users to easily manage and schedule multiple activities together.
Activities
Activities are processing steps that perform each individual task. This component supports data movement, data transformation and control activities. The activities can be executed in both a sequential and parallel manner.
Data flows
These are special types of activities that allow data engineers to develop a data transformation logic visually without writing code. They are executed inside the Azure Data Factory pipeline on the Azure Databricks cluster for scaled out processing using Spark. Azure Data Factory controls all the data flow execution and code translation. It handles a large amount of data with ease.
Datasets
For the movement or transformation of data, you need to specify the data configuration settings. A dataset can comprise of a database table or file name, folder structure, etc. Moreover, every data set refers to a linked service.
Linked Services
As a pre-requisite step, before connecting to a data source (e.g. SQL Server) we create a connection string. The Linked Services act like connection strings in a SQL server. They contain information about data sources & services. The requesting identity connects to a data source using this linked service/connection string.
Integration Runtimes
Integration Runtimes provide a compute infrastructure on which activities can be executed. You can choose to create them from three types:
- Azure IR
This service provides a fully managed, serverless compute in Azure. All the movement and transformation of data is done in the cloud data stores.
- Self-hosted IR
This service is needed to manage activities between cloud data stores and a data store residing in a private network.
- Azure-SSIS IR
This is primarily required to execute native SSIS packages.
Triggers
They are needed to execute your pipeline without any manual intervention at a pre-defined schedule. We can set the configuration settings like start/end date, execution frequency, time to execute, etc.
Summary
In this blog, we went over the different components of Azure Data Factory. The diagram below represents the relationship between the different components of the Azure Data Factory at a high level as explained on the official Microsoft website.
Conclusively, pipelines are created to execute multiple activities. The input and output format of the dataset is defined for the activities involved in data transformation or data movement. Next, data sources or services connect with the assistance of linked services. To facilitate with the infrastructure and execution location of the activities, we use integration runtimes. When the creation of the pipeline and the activities inside it is complete, we then add triggers to automatically execute it at a specific time or event.
Now that we have understood the Azure Data Factory components in theory, we can make things HAPPEN!
Stay tuned for the next blog in which we shall be using these components. Also, feel free to type in your questions and queries in the comment box below!
If you have any question or queries, do not hesitate to reach out to us!