Microsoft Fabric Machine Learning Tutorial - Part 1 - Overview of the Course | endjin

Barry Smart 28th April 2024

Course

In this video Barry Smart, Director of Data and AI, provides an overview of a new series providing an end-to-end demo of Microsoft Fabric focusing on a Predictive Analytics use case.

The demo will use the popular Kaggle Titanic data set to show off features across both the data engineering and data science experiences in Fabric. This will include Notebooks, Pipelines, Semantic Link, MLflow (Experiments and Models), Direct Lake mode for Power BI and of course the Lakehouse.

Barry explains some of the architecture principles that will be adopted during the demo including Medallion Architecture, DataOps practices and the Data Mesh principle of "data as a product". He explains how these patterns have been applied to shape the solution across data products that support Diagnostic Analytics and Predictive Analytics.

Chapters:

00:14 Overview of Titanic data set and the features we will demo
00:44 The data landscape is evolving at pace
01:22 DataOps is a key focus for this demo
01:51 Adopting a product mindset
02:29 Goal of demo is to create a model that predicts survival on the Titanic
03:02 We will use the machine learning lifecycle
03:35 We deliver key stages of the machine learning lifecycle using two data products
04:55 Medallion architecture for the diagnostic analytics data product
06:41 MLflow enabled architecture for the predictive analytics data product
07:23 How data products can be chained together to build capability
07:59 What's in the next episode?

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric End to End Demo Series:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Hello everyone! Hello fellow Fabricators! Welcome to this first episode of a series of videos where we will be providing an end-to-end demo of Microsoft Fabric. For this demo we're going to be using the Titanic data set which is openly available on platforms such as Kaggle. And this is a really simple data set that's quite often used by Data Science students to gain an insight and an introduction to Machine Learning.

So it's perfect for this kind of demo. We're going to use the data to show off as many features as possible across both the Data Engineering and data science experiences in Microsoft Fabric. Now Microsoft Fabric is relatively new, it was launched in May 2023, so it's about a year old and it represents the latest generation of data platform. The Saasification of the platform combined with the introduction of intelligent agents into the developer experience means that the cost of entry is lowering. And this means that organizations that may have lacked the budget or the skills to get into data, we now have an opportunity to leverage Fabric to become data driven organizations.

And we're going to show you how this could look, in this demo. As a SaaS platform, Fabric is also a solid foundation for DataOps, and by that, we simply mean, DevOps for data projects. And these non-functional concerns enable data teams to release new features rapidly and safely. And this lowers the cost of ownership. And it allows the team to sustain value over the long term. So we're going to demo some of these DataOps practices throughout this series. Fabric also enables you to adopt a kind of product mindset. It means that data teams can operate as an innovation engine, discovering small but high impact Data Products that they can release and deliver rapidly into their organization. The term Data Product is also one of the core principles of Data Mesh. and that's a concept around delivering data driven value at scale. Through this series of demos, we're going to show you how you can weave that Data Product principle into Microsoft Fabric solutions.

The end goal of this demo is to train a Machine Learning model that's capable of predicting whether an individual would have survived or perished on the Titanic. So, we're going to pass personal details about an individual into the model, such as their age, their gender, the class of ticket they purchased, the size of the group that they travelled with, and in return the model will provide a binary result indicating simply whether that individual would have survived or perished.

So it's a Binary classification problem. To deliver this, we're going to use the Machine Learning Lifecycle. And it's really just a specialized version of the Product Lifecycle I showed you earlier. Lots of loops and iterations and uncertainty along the way. And actually, that's the reality with many Machine Learning projects, they need to fail fast because the data or the model itself is not capable of meeting the requirements. But in this case, we've got something that can deliver relatively good accuracy, certainly it's good enough to build a demo with, so we're good to go. And we're going to deliver this project as two Data Products. The first will perform the Data Engineering to support diagnostic analysis of the core passenger data from the Titanic. And that's going to allow us to step back in time, to analyze what happened on that fateful night when the Titanic hit an iceberg and sank. And it'll help us to identify some of the factors that, governed whether a passenger survived or perished.

And we're going to use the output from that first Data Product as the input to the second Data Product, and we're going to use that data to train a Machine Learning model that can perform Predictive Analytics. In other words, we're going to pass it some new information about a passenger and ask it, What if? What if this person had been on board the Titanic? Would they have survived? And this type of predictive capability delivers significant value to organizations who apply it in areas such as qualifying sales leads or identifying customers who are at risk of leaving, or flagging fraudulent transactions, for example.

Let's drill into these two Data Products in a bit more depth and so you can understand what's going on under the hood. So the first Data Product which is all about the diagnostic and analytics. Here we're going to use the Medallion Architecture pattern. This is a simple concept that allows us to apply a common language across these types of data pipelines. It describes the journey we're going to take the data on through Bronze, into Silver, into Gold, adding value at each step in the process. And in this case we're going to ingest two data sources. The first is the core, This is our passenger data for the Titanic, and we're going to ingest that via SFTP to simulate a common method for ingesting from operational platforms.

The second data source is a spreadsheet hosted on SharePoint, which contains a small set of data, but it's important data about the locations where passengers board the Titanic. And this is again a common pattern that we see where, you've got large operational data sets, but they're often supplemented by reference data sets like this, often maintained by people in the organization in SharePoint or SharePoint lists or Excel spreadsheets or Dataverse tables. So bringing these two data sets in and showing how we can integrate them felt like it made this kind of demo a bit more real. And then having ingested that data, we're going to clean it and standardize it so we can process it into silver. And then we're going to apply a technique called Dimensional Modelling to the data so we can project some tables into Gold.

Having got those Gold tables, we can then import that data into Power BI, where we'll build our Semantic Model. And finally, we'll layer on top of that the report pages and visualizations to create an interactive experience for users to explore the data. Now looking at the second Data Product, this will use the output from the previous Data Product, the Semantic Model, as the input. And we'll show you tools that allow you to connect to the Semantic Model and explore the data in ways that allow you to develop a strategy for Machine Learning. Once we've identified that strategy through the exploratory data analysis, we can then put it into action to train, test, and deploy a Machine Learning model. And throughout this, this Data Product, we're going to be using tools in Fabric that have specifically been put there to allow you to track the end-to end-lifecycle of a Machine Learning experiment.

So, putting it all together, We've got these Data Products. The first is getting us on the first few steps of the data value chain by providing descriptive and diagnostic analysis. The second Data Product is then building on the first to provide Predictive Analytics. And this is a useful way to think about developing a data strategy. Start small and simple. Build the confidence that this will provide A stepping stone to increasing levels of sophistication and value over time.

So that's all for this video. In the next video we're going to look at the early stages of the medallion architecture, specifically around how we can Apply validation to inbound data so that we can track data quality issues as early as possible in the life cycle. So please don't forget to hit if you've enjoyed this video and subscribe if you want to keep following our content. Thanks very much for watching. Cheerio!