Azure Data Factory Overview

Azure Data Factory (ADF) is a fully managed, cloud-based data integration service that allows you to create, schedule, and orchestrate data workflows.
It is commonly used for ETL / ELT pipelines, enabling data movement and transformation across different data sources at scale.

Azure Data Factory Architecture

Resource Group

A Resource Group is a logical container in Azure that holds related resources for an Azure solution.

In Azure Data Factory, the resource group typically contains:

  • Azure Data Factory instance
  • Storage accounts (e.g. Blob Storage, Data Lake)
  • Azure SQL / Synapse resources
  • Networking and security configurations

Using resource groups helps with lifecycle management, access control, and cost tracking.

Azure Resource Group

Top-Level Concepts in Azure Data Factory

Azure Data Factory is built around several core components:

  • Pipelines – Logical groups of activities that perform a task
  • Activities – Individual processing steps (e.g. copy, transform)
  • Datasets – Represent data structures used as inputs and outputs
  • Data Flows – Visual data transformation logic
  • Integration Runtimes – Compute infrastructure for data movement and transformation

ADF Components

Pipelines and Activities

A pipeline is a container for one or more activities that together perform a workflow.

An activity defines a specific action, such as:

  • Copying data
  • Executing a data flow
  • Running a stored procedure
  • Calling an external service

Pipelines support control flow, including:

  • Conditional logic
  • Loops
  • Error handling

Linked Services and Datasets

Linked Services

A Linked Service is similar to a connection string.
It defines the connection information required for Azure Data Factory to connect to external resources such as databases, storage accounts, or SaaS services.

Examples:

  • Azure Blob Storage
  • Azure SQL Database
  • Amazon S3
  • On-premises SQL Server

Datasets

A Dataset represents a named view of data within a linked service.

Datasets identify:

  • Tables
  • Files
  • Folders
  • Documents

For example, an Azure Blob Storage dataset specifies:

  • Container name
  • Folder path
  • File format

Linked Service and Dataset

Azure Blob Storage

Azure Blob Storage is Microsoft’s object storage solution for the cloud, optimized for storing large amounts of unstructured data.

Unstructured data includes:

  • Text files
  • Images
  • Videos
  • Binary data
  • Logs

Common Use Cases

Azure Blob Storage is designed for:

  1. Serving images or documents directly to browsers
  2. Storing files for distributed access
  3. Streaming video and audio
  4. Writing and storing log files
  5. Backup, restore, and disaster recovery
  6. Storing data for analytics and machine learning workloads

Azure Blob Storage

Variables in Azure Data Factory

Pipeline variables are values that can be:

  • Defined at the pipeline level
  • Modified during pipeline execution

They are commonly used for:

  • Storing intermediate values
  • Controlling workflow logic
  • Tracking execution states

Parameters in Azure Data Factory

Pipeline parameters are values passed into a pipeline at runtime.

Key characteristics:

  • Defined at pipeline level
  • Cannot be changed during execution
  • Used to make pipelines reusable

Common use cases include:

  • Passing file paths
  • Environment-specific values
  • Dataset configuration settings

JSON Structure in ADF

Behind the Azure Data Factory UI, all pipelines, datasets, and linked services are stored as JSON definitions.

These JSON files describe:

  • Activity logic
  • Dependencies
  • Expressions
  • Parameters and variables

This enables:

  • Source control integration (Git)
  • CI/CD pipelines
  • Automated deployments

Integration Runtime (IR)

The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities.

It is responsible for:

  • Data movement
  • Data transformation
  • Executing data flows

Types of Integration Runtime

  • Azure IR – Fully managed, runs in Azure
  • Self-hosted IR – Used for on-premises or private networks
  • Azure-SSIS IR – For running SSIS packages in Azure

Integration Runtime