Microsoft Fabric - Role of the Silver Layer in the Medallion Architecture | endjin

Ed Freeman 14th December 2023

Tutorial

The Silver layer is where we apply standardization to our source datasets. This standardization aligns field names across sources, applies common data cleaning operations and organizes the data into a well known structure. The Silver layer, which is generally stored at full-fidelity (i.e. the granularity of the data is the same in the Silver layer as it is in the Bronze layer) provides the foundation for the Gold layer, where we may have many projections of the Silver data in different forms and at different granularities, depending on the downstream need.

In this video we're going to talk about the role the Silver Lakehouse plays in the Medallion architecture, including what it's used for and whom it benefits. The full transcript is available below.

The talk contains the following chapters:

00:00 Intro
00:13 Architecture diagram recap
01:18 Silver Lakehouse principles
04:10 Why is the Silver Lakehouse useful?
04:38 Who is the Silver Lakehouse for?
05:48 Roundup & Outro

Useful links:

📖 Microsoft Fabric Medallion Architecture official docs

Microsoft Fabric End to End Demo Series:

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Ed Freeman: In this video, we're going to talk about the role the Silver Lake house plays in the Medallion architecture, including what it's used for and whom it benefits. So let's jump right in. So you'll recognize this architecture diagram that I've got over here on my screen. And what we've implemented so far is on the left hand side.

So, we've got our two data sources. We're ingesting our price paid data and our ONS postcode directory via a shortcut. And these are just in files in our Bronze Lake house. So we've not got any tables in our Bronze lake house, just the files. And then we, what the next stage is, we want to do is process to Silver.

So, that's going to be using a couple of notebooks, which I'll show in a, in the next video. but the output of those notebooks is populating the Silver lake house. And in the Silver lake house, we have two tables. We have price paid and we have postcode directory. So you'll notice that these directly align with our Bronze, data sources.

And this is quite common for the Silver Zone. We don't want to be doing too much shaping of our data. The main thing is we want to standardize those raw entities, and put them in some sort of structured and queryable format. So if I shift back to the slides right now the principles of the Silver Layer is we want to identify our key entities.

In our case, it was the postcode directory and the price paid entity. so this generally kind of mirrors the source data, like I just mentioned. But what we, other things we're doing in the Silver Zone, we're cleaning the data. So we're removing, kind of error values or replacing values which represent null with actual null values.

We're standardizing the data. for example, a schema, we want to change the column names. We want to apply data types such that the same column from one data source is called the same column, the same thing as the same column, or the logically same column from a different data source. So that downstream, when we eventually project to gold, we are working from a known set of fields and tables.

That we can then pick and choose what we want from each data source and put into our goal projections. So we want some level of standardization that we've applied to the dataset so that we know it can be easy to work with downstream. We might also be enriching data. So for example, if we've got master data values that need to be incorporated across data sources and systems, we can then incorporate at the Silver layer.

So again, we're building the fidelity of our raw data. But we're not doing too much shaping, reshaping of the data itself. So generally, if your data has come in a single denormalized big flat table, you probably don't want to do too much, too much reshaping of that or modelling on that pre-emptively.

That can happen in the gold zone. In the Silver zone, we just want to essentially make that data set that we've got in the input side as high fidelity as we can. that all might also include partitioning the data, and combining datasets. if we do have the same, the same entity that comes from two slightly different source systems, but they are almost identical, we might want to combine those into the same table, but what we'd still want to do is add a column or some sort of indicator of Which rows came from which data source?

So that can be achieved quite easily with partitioning in Delta tables and Spark tables. But the output is we have this more structured and queryable format in a Delta table so that we can switch to our SQL endpoint and start querying it using T SQL, for example. But it's just a much easier to consume table format.

That's the principles of the Silver layer, but you might still be asking why. Why are we doing this, and who are we doing this for? Why can't we just go straight to gold? the reason is, we want a version of the data that's stored at full fidelity in this standardised queryable form, where we've not necessarily modified the granularity of the data, we've not shaped it any way, all we've done is we've applied a level of gold plating to our, to our Silver, datasets, so that they're much nicer to consume downstream. And it might be that no one ever needs direct access to the Silver layer. Sometimes it's useful for ad hoc analysis, but often times the main benefit, the main beneficiary of the Silver Layer is you as a data engineer.

It's putting your source datasets into a structured, well known, standardized format that you can then build on for your goal projections. The goal projections, which may change the granularity, it may do some aggregations and group buys and only select a subset of columns. But the thing is, all of the data that you might need is there in the Silver zone, ready for when you need to, you potentially need to use it in the future.

That's not to say you need to bring everything into your Silver Zone, there's always a compromise. Don't ingest much more data than you know you might need in the future. Just ingest what you can for now. And then with things like Schema Evolution in Delta Lake, you can evolve that over time if you do need to add more columns.

That's really who it's for and, why we need, or, it's, good practice to have the Silver Layer in our medallion architecture. that's really it for this video. In the next video we're going to take a quick look at the notebooks in Fabric. And walkthrough processing Bronze data into the Silver Zone using those notebooks. Please don't forget to hit and subscribe if you enjoyed this video. We'll see you soon.