Skip to content
Ed Freeman Ed Freeman

Tutorial

In this video Ed Freeman looks at the Lakehouse and Medallion Architecture concepts, and how they relate to Microsoft Fabric.

Lakehouses have become a common foundational design pattern for modern data platforms, and have risen in popularity over the last few years after tools like Databricks, and now Microsoft Fabric, have standardized on the design. The Medallion Architecture refers to a three-tiered architectural blueprint, with data moving through different stages of state, each stage serving different purposes.

The talk contains the following chapters:

  • 00:00 Intro
  • 00:22 What is a Lakehouse?
  • 02:10 What is a Lakehouse in Microsoft Fabric?
  • 04:43 Table and File structure in OneLake
  • 05:19 Justification for Medallion architecture
  • 06:13 Medallion architecture overview
  • 07:00 Bronze layer
  • 08:15 Silver layer
  • 08:45 Gold layer
  • 09:50 Replayability & idempotency
  • 10:56 Alternative structures
  • 11:52 Outro

Microsoft Fabric End to End Demo Series:

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Hey everyone. You may have heard the terms Lakehouse and Medallion architecture on your Fabric travels over the last few weeks, but what are they and how do they relate to Microsoft Fabric?

In this video, we'll go through these two fundamental elements of a successful modern data platform in Fabric, which will provide us with some useful context for the demo series going forward. So let's take a look.

What is a Lakehouse?

Lakehouse as a concept is a relatively new, newly coined term. I think Databricks coined it a few years ago. But it's generally a combination of a warehouse and data lakes, and it's actually stemmed from the separation of storage and compute that started back in the noughties with Hadoop file systems and Hive, where we had data that was stored in a file system but we were able to query over the top of it.

And we were able to start creating tables and even databases that would store snapshots out of our tables and would be able to overwrite those or append into those.

But one thing that's come slightly more recently is certain tools that allow us to get even more relational database-like features such as ACID transactions. So we are able to keep that separated storage and compute, but actually benefit from ACID transactions over the data that's stored in our data lake and represent those as a table similar to those you'd get in a traditional data warehouse.

So we still have the ability to create and maintain tables and databases, and we have the ability for multiple users to query, write into those, or read from those concurrently. We have consistency across all of these transactions.

But the Lakehouse fundamentally is keeping the data stored in a data lake, having some notion of relational database features on top of that.

Lakehouse in Microsoft Fabric

So a Lakehouse in Fabric is also an artifact type. It's a specific artifact type that you'll find in the data engineering experience, but it follows the concept of a Lakehouse that we just defined. What it is, is a combination of files, folders, and tables, which as a whole represent your Lakehouse, represent a database.

Now the relational database features come from the open source tool, Delta Lake, which Microsoft Fabric has embraced as its standard open table format. All of the engines that write data in Microsoft Fabric will be writing data to that format.

Now, usually this requires your compute to be something like Spark—so that's Spark notebooks, Spark job definitions. But we can also in Microsoft Fabric write to these Lakehouses using Dataflows Gen2.

What we shouldn't be confusing this with is an actual Synapse Data Warehouse in Microsoft Fabric. Now, it's true that they both use the same underlying storage format, table format, Delta, and it's true that they both store that data in OneLake. So in that sense, they're quite similar. But on the Data Warehouse side, we're talking about T-SQL as the language that we're using to manage and maintain those tables. And that comes along with all the added benefits of T-SQL, given the larger set of functions and queries that are available for specific T-SQL engines. And in the future we'll see things like merges being supported and identity columns being supported, which aren't currently supported in public preview.

But the Lakehouse is something that's managed from your Spark engines or your Dataflows Gen2.

Now something to be wary of is if you write a table with Spark, you have to manage it through Spark in the Lakehouse. If you write a table in a Data Warehouse through T-SQL, that can only be updated through T-SQL. But you can mix and match how you read from one table that's either being created from a Data Warehouse or from Spark in a Lakehouse.

Lakehouse Structure

So moving on from that, the structure of a Lakehouse is we have a file section and a table section. The tables is where your managed Delta tables sit. That's anything you create from a Spark job, notebook, or a Dataflows Gen2.

Whereas your files is essentially just object storage, where you put, for example, your raw files. You could put reference files there that you don't necessarily need in this tabular format in Delta Lake, but you do want to have alongside your managed Delta tables in your Lakehouse.

Why Multiple Lakehouses?

But that's enough about what a Lakehouse is. In reality, we're not just gonna have a single Lakehouse. There are gonna be different flavors of Lakehouse that we have. And that's because we want to store different states of our data.

So we want to do things like apply schemas—column types, column names. We want to clean our data. We potentially want to standardize it at a column level or a data level. We might want to add master data to it. We might want to join that data with reference data, for example.

And then eventually for our reporting consumers, we might want to start dimensional modeling our data and put it into a star schema.

Now these are lots of steps, and they shouldn't all happen in a single Lakehouse. We shouldn't have all of our tables in there because then it'd be very hard to manage and maintain. So that's where the Medallion architecture comes in.

Medallion Architecture

Now, the Medallion architecture, again, was coined I think by Databricks a number of years ago. But it's not really a new concept. It's the case of separating out raw data, standardized data or cleaned or enriched data, and then curated data or projections of your silver data.

So we have already, at least in the Microsoft land, thought of these Lakehouse architectures as different tiers, but the Medallion architecture really solidifies the concept that there are three, generally three, different layers of your architecture that all serve slightly different purposes.

Bronze Layer (Raw Zone)

So in your raw zone, you'll just have your raw data and that might literally just be a copy of your data from your data source. But you might want to apply file metadata to that or add a different folder structure to that in order to make it more easily discoverable going forward.

Now you might even want a landing zone that you land your data in first so that then you can copy it into your target location, applying such metadata—file metadata or folder metadata. But generally the data stays untouched.

What you might want to do if you've got larger sets of data is actually compress those files into a more compressible format like Parquet. So that over time your data lake doesn't grow huge or your OneLake doesn't grow huge and you keep costs to a minimum. But most of the time the file format stays the same.

We don't often put our raw data in Delta tables. It's because sometimes it's not really required to be directly queryable. It's more a case of making it discoverable, which can be easily done in files.

Silver Layer

Now if we move to the right, we go to our silver zone. And in this zone we are talking about cleaning and standardizing and enriching our data.

We might have multiple data sources that we want to combine into the same business entity that we've defined. And that's the main thing. It's defining your business entities upfront that you want to collect your data into, and coming up with a standardized schema that can be benefited from multiple sources.

Gold Layer

Now if we shift one further to the right, we are in our gold layer. So this is generally our serving layer and it's essentially a foundation for our semantic layer, optimized for reporting and analytics.

So this usually means we're creating some sort of dimensional model or some star schema following Kimball principles.

But it's not just that. You might also want in your gold layer to have projections of the data in the silver layer in different forms. So you might actually want to project your data for a web API to consume, and the web API might want the data in JSON format, for example.

The whole idea of the gold layer is that's your serving layer for any downstream consumers—be that a Power BI report and you'd want a star schema, or be that a web API downstream that wants the data in a slightly different format.

But the idea is the gold layer is where people go and have their curated views of the silver data that they want for their purpose.

Architecture Flexibility

Now, just standing back as a whole, we have this pipeline of bronze, silver, gold, and generally we'll have artifacts in Microsoft Fabric workspace that correspond to that. We'd have a separate Lakehouse per zone, so we can keep these things organized.

And in the raw zone in particular, what we want to have is that to be append-only. So we don't want people to be able to go back into the raw zone and necessarily delete or modify things. That is the representation of the state.

And keeping that append-only means that we can have this historical view of data so that if we want to replay the data further down the line—for whatever reason, maybe it's auditing reasons or maybe it's to see how something looked in the past—we can understand that all of our data is still in the raw zone, exactly how it landed, and replay the data back through the bronze, the silver, and the gold layer, and create our projections as if we were running that data at that time.

So we have this structure for a reason, this three-layer structure. But that doesn't mean that the three-layer structure is always a three-layer structure. You might have a fourth layer. You might want to have, for example, a separate landing Lakehouse before you even get into your bronze.

You may even not find that you need, for example, a gold zone. If you are happy with your standardized entities, your flat entities in the silver zone, you might not need a third zone for the downstream consumption, if that's all you need to consume—the silver variations of the tables. So this isn't a hard and fast rule. It will differ from one project to the next.

This is generally the terminology that Microsoft Fabric is standardizing on. It's one that Databricks was already using, so should be familiar to a few of you that may already be familiar with Databricks.

Wrap Up

Anyway, that's all for today's introductory video. So we spoke about Lakehouses, both conceptually and within Microsoft Fabric, and then we spoke about the Medallion architecture and how they relate to Microsoft Fabric as well.

In the next video, we'll introduce the data set that we're going to be using for the rest of the end-to-end demo and the insights that we're after. And we'll also see an architecture diagram of the whole thing.

So please hit like if you found this video useful, and hit subscribe if you want to get notified about all of our upcoming videos. See you soon.