Microsoft Fabric Machine Learning Tutorial - Part 3 - Testing Notebooks | endjin

Barry Smart 28th August 2024

Course

In part 3 of this course Barry Smart, Director of Data and AI, walks through a demo showing how to apply a test driven development approach to Microsoft Fabric Notebooks that will allow you to establish a set of tests that can be automated, whilst also driving code that is clean, extensible, re-usable and easy to understand.

He will focus on the notebook which applies the data wrangling steps to "project to gold". Focusing in on the logic which is used to clean and enrich the passenger data for the Titanic.

Barry splits this logic into 3 notebooks:

The first defines the functionality as a series of discrete data wrangling functions, wrapped up in a Titanic Wrangler class. He uses the Pandas pipe method to chain these individual functions together to perform all of the tasks necessary to clean and enrich the passenger data.
The second notebook tests this functionality, by using the Arrange, Act, Assert (AAA) pattern.
The final notebook puts this functionality in use as part of the wider "project to gold" process which projects a fact table and a set of dimension tables to the Gold area of the lake in Delta format.

Barry begins the video by explaining the architecture that is being adopted in the demo including Medallion Architecture and DataOps practices. He explains how these patterns have been applied to create a data product that provides Diagnostic Analytics of the Titanic data set. This forms part of an end to end demo of Microsoft Fabric that we will be providing as a series of videos over the coming weeks.

Chapters:

00:00 Introduction and Video Overview
00:46 Project to Gold Pipeline
01:08 Benefits and Pitfalls of Notebooks
04:20 Addressing Notebook Pitfalls with DataOps
05:12 Test-Driven Development in Data Engineering
08:03 Implementing the Titanic Wrangler Class
09:17 Testing the Titanic Wrangler Class
10:52 Running the Code in Production
12:15 Conclusion and Next Steps

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric End to End Demo Series:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Hello fellow Fabricators in this video were going to explore how we can test the data engineering functionality we develop in Microsoft Fabric notebooks.

This is part three of the Titanic Diagnostic Analytics video series. As a recap, our objective in the series is to build a data product that adopts a medallion architecture to ingest process and project data by promoting it through the bronze, silver and gold areas of the Data Lake. Now the purpose of this data product is to create a power bi report that allows users to interactively explore the passenger data from the Titanic to understand patterns in survival rates.

Now, in this specific video were going to look in more depth at the work we are doing to project data to the gold layer of the lake. This part of the process is orchestrated by a pipeline called Project to Gold. And if we look at that pipeline, we can see it's really simple. It has one notebook task that runs a project to Gold Notebook. Now we love notebooks in Microsoft Fabric. They have a lot going for them. The interactive experience of the notebook allows you to write code in code cells to run those cells and to observe and capture the output from that code. You can also capture documentation alongside the code cells in the form of markdown and through the use of headings in your markdown. You can also help to organise your notebooks in Fabric, because Fabric will generate a dynamic table of contents on based on those headings.

Now, given the interactive nature of these notebooks and your ability to capture your thoughts alongside your code, notebooks are a great place to start to explore and experiment with data. And because these notebooks run a series of cells from top to bottom, you can also view them as a means of orchestrating an end to end process, with each cell in the notebook defining a step in that process. Notebooks are just super flexible. You can use them for a range of tasks, from early stage data exploration to doing your production data wrangling to training machine learning models. But given this flexibility, there are also some potential pitfalls with notebooks that you need to be aware of and to avoid. Notebooks that have been developed for the purpose of exploration and prototyping have a bit of a habit of being put into production without any refactoring to make them production ready.

Here are a few examples of what happens when you don't put that work in to apply good software engineering principles before you put a notebook into production. Firstly, notebooks can be very sensitive to changes to upstream data sources such as schema, changes or issues with data quality. Notebooks can also appear to operational teams who look after them on over the long term as a bit of a black box. They can do a huge amount of work. But there's very little observability, and what we mean here is there's no logging and no alerts out out of the box with notebooks. Then when problems do occur in a notebook, how how do we detect them and apply an appropriate operational strategy? Do we retry the notebook and hope it will work? If we try it again, do we or do we fail the whole process and wait for human intervention?

Another issue with notebooks is that it can grow organically over time, especially during that sort of exploration phase of the project, and you can end up with a lot of code spread throughout the notebook that's difficult to unpick and understand and to maintain. It's going to be difficult to pick up a notebook and make changes to it in the future if something needs to change. And finally, there's no formal testing in place to provide quality assurance, and this is common, so we tend to have a low degree of confidence about the code in the notebook, functioning consistently and reliably over time.

Throughout this series, we're aiming to show how Fabric can deliver important data ops, principles. All of the pitfalls I've just talked about with no notebooks can be addressed by applying these principles. And the good news is that there are our methods and tools out there to help us with this. We just need to apply them!

In a previous video, we covered data validation. So please go back and have a look at that. We'll have a look at building observable into Fabric notebooks in a future video. So in this video, we're going to focus on how we can develop our code. So it is easier to test and then how we can automate that testing. And we'll also show you how the spin off benefit of this is that we're gonna end up with reusable, clean code that's easier to maintain and evolve. To drive this approach, we will aim to adopt the principle of test driven development. This is where you write the tests first and then write the code to make those tests pass. Now in practice, in a data engineering project, the initial logic tends to be developed more informally. You're exploring the data and discovering the steps that are required to to wrangle it into shape as you go through that exploration activity.

But this test driven development philosophy can still be applied because it will drive the subsequent refactoring of that code to put it into more modular units of data wrangling functionality. That allow us then to apply a suite of tests to each of these units of functionality to prove they're working as intended. Now in this project to help us apply this approach, we've developed three separate notebooks one to capture the data wrangling functionality, one to then test this functionality, and then the final to integrate and apply that functionality in production.

Now, doing all of this costs time and effort. But we believe it delivers dividends over the long term. You're gonna get benefits that will pay off time and time again over the lifetime of your data product, you're gonna have code that is cleaner and easier to maintain code, that's largely self documenting. So when new people join the team they can get running up to speed and running very quickly and a process that can be a testing process that can be automated and by automating the tests.

In this way, you're taking the humans out of the loop, and you're able to bake in that quality assurance by default because through one click or indeed, by triggering this process automatically as part of your CI/CD pipeline, you can gain that confidence that your code is fully functional before even make before putting even the smallest change into production. So going back to the previous view of the pipeline, we'll see that it doesn't quite tell the complete picture.

This notebook has a dependency through the percent run magic command. You can see that it's running another notebook called Titanic Wrangler, and we also know that there's another notebook in another pipeline called Titanic Wrangler Tests that also has a dependency on that notebook. This is how we apply those principles I talked about before one notebook at the top, which defines the functionality. Another notebook, on the right here, which tests that functionality proves It works, then the final notebook that puts that functionality into production is part of the wider data wrangling processing that we want to do. So let's step through these notebooks one by one to understand what's going on in a bit more depth.

So starting with the Titanic Wrangler, this serves just one purpose: to define the Titanic Wrangler class. Inside this class, you can see we've broken down the individual steps involved in preparing the data into discrete functions, and each function follows the same pattern it takes in a Panda's data frame. It performs some wrangling on that, and it returns the transform version of that data frame. This allows us to use the pipe method in pandas to chain those functions together. And this approach enhances code readability. We can simply review the chain of functions that are being called to understand what is happening in what order to wrangle the data end to end.

We have effectively achieved self documenting code and by encapsulating each transformation step in its own function. We've also made the code more modular, and you can see how it would be easy to add additional steps into this modular structure that we've got because more maintainable and essentially, it's also made the task of testing the code more straightforward as we'll see. So let's jump over to that testing notebook to see how those tests are performed.

Here, you can see we've written a set of test methods to test some of the more complex methods in our Titanic Wrangler class. we've adopted the arrange act assert pattern from mainstream software engineering, and this involves three distinct phases to each test. The first step arrange sets up the necessary preconditions for the test, So generally we're preparing the test data. The act phase then performs the action that you want to test. And finally, the assert phase verifies that the result of the act phase is as expected. It's verifying the outcome of the test. If that assert fail, assert phase fails, it will fail the notebook, so that allows us to catch when tests are failing.

Now there's a bit of work to do to set up these tests, but once they're written, we've got them forever. We can, and we can run them any time we need effectively for free. And this could even be automated as part of your development life cycle. This investment in testing code form can form part of your wider regression testing strategy, protecting you from those small changes inadvertently breaking something significant and the embarrassment that that follows from that.

So this is great stuff to have in place, and it's what we call establishing a pit of success. Quality is effectively baked in by default. You don't have to think about applying it. Now, if we look at the final notebook, if it runs this code in a production setting, we can see Ghent using the percent run Magic Command to gain access to that functionality we developed in the Titanic Wrangler class.

And you can see we're doing all of the other steps in the notebook here to prepare the various tables we're projecting to gold. So first of all, we load the passenger data from the silver area of the lake. Then, having loaded that data into a data frame, we pass that data frame to our prepare data method in the Titanic Wrangler class, which performs all of those steps required to clean and engineer new features in the passenger data.

By moving all of that code into a separate notebook, you can see how we've abstracted away all of that detail. It makes this main notebook a lot, much cleaner and easier to understand. We can still go away and look at that code to understand what's going on under the hood, but it keeps this main operational notebook really simple. You can see here, we then prepare the dimension tables. A lot of that's really simple, the most complex dimension tables is where we dynamically build an age dimension table spanning all of the ages in the notebook.

And then we finally write all of the tables into the go layer of the lake.

That's it. Job done.

So we hope you enjoyed that. That's it for this video. In the next video, we're going to apply the new task flows feature in Fabric Fabric to help us better organise our workspace artefacts according to that medallion architecture that we chose to adopt in this use case.

So that's coming soon. In the meantime, please don't forget to, like, hit like.

And if you've enjoyed this video, please subscribe to our channel, to keep following our content.

Thanks very much for watching. Bye bye.