Microsoft Fabric - Processing Bronze to Silver using Fabric Notebooks | endjin

Ed Freeman 14th December 2023

Tutorial

Notebooks in Fabric are a lot like notebook experiences in other tools you're probably already familiar with. They allow us to write code in a variety of languages and create a commentary alongside our code using interactive cells. Fabric notebooks also have built in integration with Lakehouses, and provide a built-in filesystem that can be used to store arbitrary files somehow associated with the notebook.

Notebooks are our go-to tool for wrangling Bronze data into Silver, which tends to follow standard processing patterns. In this video we'll do a whirlwind tour of notebooks in Microsoft Fabric, and take a look at how our "Process to Silver" notebook takes shape. The full transcript is available below.

The talk contains the following chapters:

00:00 Intro
00:10 Creating a blank notebook
01:04 Mounting a Lakehouse in a Fabric Notebook
02:05 Querying files in a Lakehouse from a Fabric Notebook
03:28 Query PySpark DataFrames directly with SQL
04:12 Built-in notebook filesystem
05:14 Walkthrough of "Process to Silver" notebook - reading data and transforming
08:18 Walkthrough of "Process to Silver" notebook - merging and writing data
10:03 Walkthrough of "Process to Silver" notebook - reading back written data
10:15 Roundup & Outro

Useful links:

📖 Microsoft Fabric Notebooks
📖 Connect Lakehouses and Fabric Notebooks
📖 Built-in Notebook filesystem
📖 Delta Lake Merge

Microsoft Fabric End to End Demo Series:

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Ed Freeman: In this video, we're going to take a look at how we can populate the Silver Lake house from our bronze data using Fabric Notebooks. Let's get going. So I'm over here in my Fabric end to end demo workspace, which we've been building upon so far. Some of these are existing artifacts that we've already created, some of them are ones that we're going to be covering in future videos.

But this video is all about notebooks and how we can get to silver in notebooks. So let's start by just creating a brand new notebook. What does that look like? we choose it from the vast array of artifact types that Fabric now gives us. It'll give your notebook a default name. First thing that I always do is click on this, the name up here.

And rename it, so let's say Fabric End to End New Notebook Demo. once you click off that, it will save and actually everything in the notebook by default auto saves. So you don't need to worry about constantly hitting save, every time that you make a change. Now, the notebook experience is largely similar to notebook experiences you'll have elsewhere, but you'll see a few things that are different.

And the first thing is this ability to add a lakehouse. So this is the ability to essentially mount a lakehouse into your notebook so that you can, drag and drop files and, and, it will auto generate some code for you. But also you can refer to it without using A, a database prefix for the default lake house that you, add to your notebook.

So, if we click add, we get two options, existing or new. I'm going to click existing here. And I'm going to click our bronze lake house that we've created in previous, videos. And it will automatically, detect what tables and files I have. So I don't have any tables in the, bronze lake house, but I do have files.

So, I can actually navigate to these files, keep on going down here, and once you get to where the files actually are, you'll see them here. Now what I can do is I can drag a file onto the notebook. And now I'm just going to collapse this to give it more space. But my file, my file can now be queried through this auto generated code.

So I'm going to run this and see what happens when we do nothing to the code whatsoever. Now One of the things about Fabric, Spark is that it's generally much, much quicker to spin up a Spark session. Now, if you're, if it's the first time you've run any sort of Spark job in a while, I've seen it take a few minutes still, like it would do in Azure Synapse.

But generally, if you've been running any Spark sessions anywhere else in your workspace or tenant, the, Spark runtime should, should start quite quickly. So this is 30 seconds down the line. The session has begun and the Spark jobs are now running. And we can see that the Spark job ran. okay, that's not great because actually the default generated code here defaults to having a header and we know that in our source data we don't have a header.

So, if I remove that option, that should then just give us some default column names. So that's great. Without having to actually write any code, we've just dragged it on a file onto our onto our notebook, and it's, returning us our data. So we can go one step further here if I just copy some code that I've got on a different screen as of Spark 3.4, I believe, we can now actually write bog standard SQL statements inside a Spark SQL over, um, over our core dataframe. So we just pass in our dataframe variables. Before, what you'd have had to do is pass a, or create a temporary view, for example, from your DataFrame in order to expose the, the DataFrame to the Spark SQL endpoint.

But now we can just use this syntax here to query DataFrames using SQL syntax out of the box, which I think is pretty cool. The other thing that notebooks give you in Fabric that is quite new from compared to elsewhere is this resources tab. So we've got these built in resources and essentially this is a built in file system that you can use which is tied to this specific notebook.

So you can put just temporary files in there if you wanted to write, read and write files. To and from the file system without having to write back to your one lake Then this is a built in file system that allows you to do all of that Now I won't go into too much detail for the rest of the notebook features, but do have a play around and see what you can get on all of these tabs As I said autosave is already on But we have a bunch of kind of helper, operations like adding to a, pipeline or running things on a schedule, for example.

So do take a look through. But let's for now forget about this new notebook and let's go to the ones I've created earlier. So generally when we're processing to silver, as we saw in the previous, the previous In this video, we follow a bunch of principles which mainly boils down to standardizing our data and providing structure for our data.

So what you tend to find is the silver notebooks that we write tend to follow a pretty similar pattern. So I've created a silver notebook here which does a few simple things to our source bronze data. This one is processing the price paid data, but I also have one for the ONS data. Let's just go through the price paid data.

So we start by loading packages. So this is just a PySpark notebook. We can use normal Python import statements to load packages that we want, to use in the notebook. So we're using Delta table here and you'll see why a bit further down. Then we load helper notebooks. So in the next episode we're going to talk about these helper notebooks a little bit more.

But this is where you can really abstract and encapsulate your code into different, different notebooks. As if you were building a normal code repository locally and storing different things, classes, functions, in different files. This is the way you, want to do things if you're working solely in notebooks.

So that you don't end up with a gigantic notebook which has everything in. So we've got a data reading helper, we've got a silver wrangler, and we've got, the silver schema file, table definitions stored in separate notebooks, i. e. separate files elsewhere. We read the data, and we display it.

We just want to see what the, data looks like. And then if we go further down, then we generally apply transformations. So these are not necessarily shaping the data, but they are column level, field level transformations to standardize the column names, their data types, for example, maybe add some helper, columns, and perform some simple cleaning, cleansing operations.

So what we're doing here is we're creating additional postcode columns. We're converting some coded values. into their, readable counterparts and we're adding a year column to enable partitioning. So if we add a year column to our data set, that means when we write the table back out, we can split things by year and organize things nicely in our Delta table.

And we're also trimming the U, UID column because it has those. Curly braces around the outside. So once we do that, we have a slightly different data frame which has, which has all of these transformations applied to it. And the re it's, the reason there's not much code to do that is because it's all buried away in these helper notebooks that we've written.

And that's why it's quite nice to have these helper notebooks, because in the top level notebook, which essentially just orchestrates the transformations that need to happen, everything's quite clean. So then what we want to do is we've got our transformed data set. We want to write that to our silver table.

So our price paid table, we're going to call it price paid. And then we use some, the Delta API to actually create the table if it doesn't exist. and if it does exist, we will retrieve that Delta table. Now we perform a merge. So this is a, something that you can do in Delta Lake. which is similar to what you can do in SQL languages, more traditional relational databases using kind of upsert statements or merge statements.

So this is just the Delta API or PySpark API way of doing that, but you can also write SQL statements that look like your normal T SQL mergers, for example. With some slight differences here and there. But what we're doing is we're having our, because we're upserting our data, because we're getting monthly files on top of the base complete file, we don't want to do a full overwrite every time, we just want to update what's been updated and add any new stuff and delete stuff that isn't there.

So all of these conditions essentially do that. So we're checking where the record status is C for, for updating, so changed, D for deleting, A for inserting or appending. And we're checking that the extract timestamp that we add to the file is newer, for the file that we're running than the one that's already in there, just in case we're running historical data, we don't want to accidentally overwrite it, if it shouldn't be overwritten.

And this is the, syntax to then perform an update using the PySpark Delta API. And then once we've done that, we can see the, the table, the data frame by querying the table directly. And this is what we get. Now that's a whistle stop tour of a, silver notebook in our Fabric end to end demo.

In the next video, we're going to take the opportunity to show some good practices when it comes to working with notebooks in Fabric. Like the abstracting things into separate helper notebooks that I was showing you earlier. Please don't forget to hit if you've enjoyed this video, and hit subscribe to keep following our content.

Thanks for watching.