Skip to content
  1. We help small teams achieve big things.
  2. Our Technical Fellow just became a Microsoft MVP!
  3. We run the Azure Weekly newsletter.
  4. We just won 2 awards.
  5. We just published Programming C# 8.0 book.
  6. We run the Power BI Weekly newsletter.
  7. Our NDC London 2020 talk is now available online!
  8. We are school STEM ambassadors.
Carmel Eve By Carmel Eve Software Engineer I
Exploring Azure Data Factory - Mapping Data Flows

As part of a recent project we did a lot of experimentation with the new Azure Data Factory feature: Mapping Data Flows. The tool is still in preview, and more functionality is sure to be in the pipeline, but I think it opens up a lot of really exciting possibilities for visualising and building up complex sequences of data transformations.

Example Data Flow.

A Data Flow is an activity in an ADF pipeline. You define a data source and can then apply a variety of transformations to that data. Currently the supported data sources are Azure Blob Storage, ADLS Gen1 and Gen2, Azure SQL Data Warehouse and Azure SQL Database, with supported file types or CSV or Parquet. However, support for many other sources and file types is imminent. Once you have chosen a data source, you have the ability to filter rows, create derived columns, conditionally split and join sources, perform aggregations over the data and more. There is also the ability to extend the transformation options by building your own custom transformation in either Java or Scala.

Example of adding a transform in the Data Flow.

By default Data Flow uses partitioning strategies which are optimised for your workload when running the flow as part of an ADF activity. However, you can also pick between a list of partitioning strategies, some of which may increase performance. The partitioning strategy for each individual step in the data flow can be edited, so some experimentation would be needed to find the best combination for each specific workload.

Data Flow partitioning strategies.

There is also good debugging support within Data Flows. You can run a set of sample data (of variable size) through the data flow and see the effect of your transforms. This means that as you author a data flow, you can quickly gauge the effect of any transformations applied, and catch errors in the processing.

Showing the debug functionality.

You can also use the debugger to see the effect of the individual transformations used when filtering/creating new columns. For example, filtering rows based on:

notEquals(evaluation_type, 'SAFP')

Produces the following debug output inside the filter expression editing pane:

Showing output of filter expressions in the editing pane.

Finally, once you have finished authoring a Data Flow, you are then able to export the representative code. This means Data Flows can be re-used and deployed as necessary. E.g. the first two transformations in the above Data Flow are represented like this:

Areas for improvement

We encountered a couple of issues during our experimentation. The first is that currently, the Data Flow is run on a Databricks cluster which is spun up for you when you run the activity in the ADF pipeline. This means that a significant portion of the activity duration is spent spinning up a new Databricks cluster, especially at smaller workloads. Hopefully once the feature moves out of preview, the option will be available to run the Data Flows in a BYOC (Bring Your Own Cluster) mode, enabling the use of already-warm interactive clusters, which would significantly reduce the execution time.

Secondly, there currently not adequate monitoring information given around either performance or pricing. The timings and row counts given at the end of a successful data flow activity seem inconsistent, and there is no apparent way of retrieving the cost of each job run. This makes it difficult to assess bottlenecks and more expensive sections of the Data Flow. However, again, this feature is only in preview so hopefully there are improvements coming in this part of the tool.

Overall

Despite the room for improvement, I think this is a powerful technology for enabling the use of simple data transformations to build up complex data manipulation pipelines. I hope to see it improve and grow as it matures, in order to aid the streamlining of data transform and ETL processes. And by extension, allowing improved data discovery and insights, without the need for writing code!

Doodle of author capturing insight from Data Flows.

Carmel Eve

Software Engineer I

Carmel Eve

Carmel has recently graduated from our apprenticeship scheme.

Over the past four years she has been focused on delivering cloud-first solutions to a variety of problems. These have ranged from highly-performant serverless architectures, to web applications, to reporting and insight pipelines and data analytics engines. She has been involved in every aspect of the solutions built, from deployment, to data structures, to analysis, querying and UI, as well as non-functional concerns such as security and performance.

Throughout her apprenticeship, she has written many blogs, covering a huge range of topics. She has also given multiple talks focused on serverless architectures. The talks highlighted the benefits of a serverless approach, and delved into how to optimise the solutions in terms of performance and cost.

She is also passionate about diversity and inclusivity in tech. Last year, she became a STEM ambassador in her local community and is taking part in a local mentorship scheme. Through this work she hopes to be a part of positive change in the industry.

Carmel won "Apprentice Engineer of the Year" at the Computing Rising Star Awards 2019.