Microsoft Fabric Machine Learning Tutorial - Part 4 - Task Flows | endjin

Barry Smart 5th September 2024

Course

In part 4 of this course Barry Smart, Director of Data and AI, walks through a demo showing how the new Task Flows feature in Microsoft Fabric can be applied. In this case, retro-fitting a suitable Task Flow to an existing project to show how it allows you to reinforce the specific architecture pattern you want to adopt and provide a means through which artefacts in a workspace can be better organised.

Barry begins the video by explaining the architecture that is being adopted in the demo and the Medallion Architecture that has been adopted. He shows how a work around has been applied by naming the artefacts in the workspace with a numerical prefix to help sort and group them according to the stages in the medallion architecture. He then demonstrates how the new Task Flow feature in Microsoft Fabric can overcome this workaround.

This forms part of an end to end demo of Microsoft Fabric that we will be providing as a series of videos over the coming weeks.

Chapters:

00:00 Introduction to the New Task Flow Feature
00:16 Recap of the Titanic Diagnostic Analytics Series
00:37 Understanding the Medallion Architecture
01:53 Exploring the Workspace and Artifacts
03:08 Introducing Task Flows
04:09 Configuring Task Flows for Titanic Analytics
05:10 Organizing Artifacts with Task Flows
06:48 Utilizing Folders for Better Organization
08:45 Final Thoughts and Next Steps

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric End to End Demo Series:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Hello fellow Fabricators!

In this video, we're going to show you how you can adopt the new task flow feature in fabric to apply reference architectures and better organise the artefacts in your workspace. This is Part four of the Titanic Diagnostic Analytics video series.

As a bit of a recap, we are building a data product with the purpose of creating a Power BI report that allows users to interactively explore the passenger data from the Titanic to understand patterns in survival rates. This data product adopts the Medallion Architecture pattern, which has gained traction in the industry. The bronze silver Gold terminology provides a clear common language that we can all understand and adopt where:

Firstly, bronze is the area of the lake where we land the data in its raw form.
Silver is then the area of the lake where we write data in a standardised format such as Delta Lake, having cleaned up and enriched the data through processes such as master data Management.
Gold finally is where we write the data in a specific shape and granularity to meet the needs of a specific use case. This enables us to deliver value from data by generating actionable insights.

Now, in our example, we are projecting a simple dimensional model to gold to enable diagnostic analysis of the data in Power BI but from the same source data in the silver layer. You could also project a completely different data set into gold, for example, a flattened version of the data that makes it suitable for training a machine learning model. Well, let's jump over and have a look at the workspace and the artefacts we've created to support our Titanic diagnostic ANA Analytics use case. Now you can see there's a diverse mix of artefacts that we've visited in detail through throughout this video series.

But just as a recap, we've got:

Lake houses that give us persistent storage for both file and tabular data, enabling us to work with any form of data, whether that be structured, semi-structured or unstructured.
We've also got data pipelines to orchestrate processing and do some of that heavy lifting, such as copying data from source systems.
And then we've also got notebooks, where we have written data wrangling logic to clean, enrich and prepare the data.

You can see that we've tried to reflect the medallion process by using a numeric prefix in naming these artefacts to group them and order them according to the stage of the medallion process that they are related to. But this feels like a bit of a workaround. The good news is that the Microsoft Fabric Product team have listened to end users and have developed a solution to this problem. They call it Task Flows.

You can access task flows from a workspace by either dragging down or clicking to expand this new area. You can build a task flow from first principles, but we can see here there is a set of predefined templates and there's one specifically for the Medallion Architecture we're adopting in this project. So lets adopt that and see if this works for us. The default template or the Medallion Architecture looks like this. You can see it captures the end to end pattern, as a directed graph, each node in the graph is called a task, and these tasks can be assigned a type.

For example, here we have a "get data" task, a "store data" task and a "visualise data" task.

I am going to adapt this to better reflect the specific solution we've developed for the Titanic Diagnostic Analytics project. Firstly, we don't have this split of high volume and low volume data, so we can simply rationalise that to one task in the task flow. Secondly, we're only serving the data up for the purposes of data visualisation so we can remove the ML serving task. I'm also going to rename some of the tasks to reflect the terminology we want to adopt in our target architecture. So I'm going to use the terminology "ingest to bronze", "process to silver" and "project to gold" for those ingestion and data preparation tasks. I'll also relabel the storage tasks to simply bronze, silver and gold.

Now, with the task flow configured for the specific solution we're developing here, I can now click on the paperclip icon on each task, and I can start to assign the artefacts in my workspace to that task. So look first, let's allocate the ingestion pipeline to the "Ingest to Bronze" Task. Then let's allocate the bronze Lake house to the Bronze task. Then we will allocate the pipeline and notebooks associated with processing to silver to the "Process to Silver" task. Then allocate the Silver Lake House and its associated artefacts to the Silver task. Then allocate the pipeline and notebooks associated with projecting to gold to the "Project to Gold" task. Then move on to allocate the Gold Lake House to the gold task. Finally, we'll allocate the Power BI semantic model and report to the "Data Visualisation" task.

Now, having done that, we're missing a home for our top level pipeline that orchestrates all of this solution. Well, that's no problem. We can add in a final task. We can use this general type of task, rename it, link it to the other ingestion and data preparation stages of the pipeline, and then add that overarching orchestration pipeline to that task. Now we've got a few artefacts left over that we can then organise into another new feature, which is folders.

So firstly, we have a number of utility notebooks that provide functionality that's reused across multiple stages of this medallion process so we can create a utilities folder and place those notebooks in that folder. We've also got a notebook, which is used to demo logging functionality in notebooks, So let's create a demo folder and drop that in there. The remaining artefact is an environment, which we've set up to import the specific packages used by this project. So create an environments folder and drop that in there.

You can now see a task column in the workspace. It is now fully populated, and it's providing a nice, colour coded visual way of grouping our artefacts. And you can also see how the Task Flow graph itself can be used to explore the artefact in your workspace, clicking on a specific task filters the list of artefacts below to show the artefacts assigned to that task. So it acts as a means of summarising the solution and then navigating to the specific tasks that deliver each element of that solution.

Now, at this point, we should probably rename all of our artefacts. We know no longer need that numerical prefix to group and sort them. The task flow now performs that task for us. Perhaps we will do that later on. One other useful feature is the "add new item" button, which suggests relevant artefacts to adopt for that specific task type if you need to expand or evolve your solution. So here you can see suggestions for the prepare data task are a notebook, a spark job, a data flow or a data pipeline.

So that's it for this video. We've shown how task laws enable an architecture pattern to be established for your project, then how they can be used to organise the artefacts in your workspace according to that pattern. In the next video, we are going to look at how we can introduce observability into Microsoft notebooks.

Please don't forget to hit like if you've enjoyed this video and please subscribe to our channel if you want to keep following our content.

Thanks very much for watching. Bye bye.