Data Factory v2 Primer – p1

Overview
In a recent project we were tasked with a standard scenario of moving data from one place to another in Azure. The specifics are:

  • The source data is on Azure Storage (blob) and is in a CSV format.
  • The destination was a Kimball data warehouse using Azure SQL Database (PaaS).
  • There was not significant ETL processing that needed to occur beyond
    • The source data was ‘vertical’ and needed to be pivoted.
    • There was a small list of cleansing that amounted to range checking on the data.
    • The dimensions within this warehouse were essentially static and predefined.

While there are about a million ways to accomplish task, we chose Data Factory as our orchestration tool. In part one of our post we will describe the core concepts of Data Factory that should provide an initial understanding of the workload’s purpose. Part two we will put this workload to work by moving some data!

Core Concepts
The Data Factory V2 (ADFv2) documentation resides at https://docs.microsoft.com/en-us/azure/data-factory/. ADFv2 is currently still in preview and during the life of this project (since 10/2017) we have seen various quirks popup but we’ve seen our share of excellent feature additions as well. Nothing that we have seen during the preview period has deterred us from the viability or stability of the workload. The items predominately were akin to things that you come to expect, mitigate and workaround in an iterative environment. So basically – buyer beware…

Now let’s talk about some of the core concepts in ADFv2.

Activities – An activity is an atomic unit of work.  They are predefined widgets. Activities rely on Linked Services and Datasets to guide them with their jobs. Linked Services/Datasets define the ‘where’ and the ‘what’ for an activity and are described next.  Activities are broken into 3 different categories:

  • Data Movement – Move data from point A (source) to point B(sink).
  • Data Transformation – Perform an action on the data. Remember ADFv2 is an orchestrator so here we are talking about things like ‘call a stored procedure to do X’.
  • Control – Standard control of flow elements like ‘if’ blocks, for each, lookup other data, etc.

Linked Services – Linked Services are glorified connection strings. They signify where data resides.

Datasets – Datasets define what data we are looking at, mostly. You can have activities that don’t define a dataset. We’ll give you an example of that in the description of our solution.

Pipeline – The Pipeline is the parent entity and ties all of the activities together to perform a unit of work.

Triggers – Triggers provide the automation step to actually run the pipeline. Here think – scheduled task.

Integration Runtime – The integration runtime provides the underlying processing space that the certain activities will perform their work. Since we are primarily doing a data move in the cloud we will be leveraging the ‘Azure Integration Runtime’.

Portal Support
Historically building a data factory meant:

  • Provisioning the workload in the Azure portal.
  • Hand building some JSON files.
  • Publishing them with PowerShell.
  • Run pipelines via PowerShell.
  • Monitor and test the pipelines with more PowerShell commands.

Microsoft has recently released an ‘Author & Monitor’ portal that can be used for these purposes. This portal can be found by going to your Data Factory provisioned workload and clicking ‘Author & Monitor’. The release of this tool has shifted our development paradigm slightly to:

  • Still hand code all the JSON and publish with PowerShell.
  • Use the Portal to test, trigger and monitor the pipeline.

At this point it is probably a good idea for you to check out some additional documentation. See the reference links below. In the next post we will talk about how our team used this workload to import millions of rows daily into a data warehouse.

Reference Links

Part two – coming soon.

Leave a comment

Your email address will not be published. Required fields are marked *