Microsoft Fabric - Inspecting 28 Million row dataset
Tutorial
In this video Ed Freeman continues the Microsoft Fabric End-to-End demo series by looking at the dataset we'll be using, and the problem statement we're trying to solve.
For a data platform, we need some data! In this series we're going to be using Land Registry data provided by the UK government which registers the ownership of land and property in England and Wales. The dataset is almost 5GB in size, and provides different types of files for complete or incremental processing. This will allow us to benefit from UPSERT-like functionality enabled by #DeltaLake without having to load all the data every time we receive new information.
In this video we'll take a quick look at the data, where it comes from and what format it's in, and we'll also frame up the insights we're aiming to achieve from the analysis taking place in this series. We'll finish by stepping through a sample architecture diagram - a powerful way to visualize involved data platforms at a high-level.
The talk contains the following chapters:
- 00:00 Introduction
- 00:17 Sample data introduction
- 00:58 Sample data inspection
- 03:54 Insight Discovery and defining goals
- 06:46 Fabric architecture walkthrough
- 11:32 Outro
Useful links:
Microsoft Fabric End to End Demo Series:
- Part 1 - Lakehouse & Medallion Architecture
- Part 2 - Plan and Architect a Data Project
- Part 3 - Ingest Data
- Part 4 - Creating a shortcut to ADLS Gen2 in Fabric
- Part 5 - Local OneLake Tools
- Part 6 - Role of the Silver Layer in the Medallion Architecture
- Part 7 - Processing Bronze to Silver using Fabric Notebooks
- Part 8 - Good Notebook Development Practices
From Descriptive to Predictive Analytics with Microsoft Fabric:
- Part 1 - Overview
- Part 2 - Data Validation with Great Expectations
- Part 3 - Testing Notebooks
- Part 4 - Task Flows
- Part 5 - Observability
Microsoft Fabric First Impressions:
Decision Maker's Guide to Microsoft Fabric
and find all the rest of our content here.
Transcript
In this video, we're going to take a peek at the four gigabyte dataset that forms the foundation of the data pipeline that we're going to be building in this demo series. Then we're going to outline the goals and insights we actually want to achieve off the back of this data. So let's get started.
Sample Data Introduction
So the data we're going to be using is called Land Registry data, and essentially it registers the ownership of land and property in England and Wales. So it gives information like the price that's been paid, the location of the property, what style of property it is, whether it's a commercial or residential property, et cetera, et cetera. And it actually has quite a good amount of data, as you can imagine. So I think it's roughly 28 million rows. It goes back to 1995, all the way until the current month.
Sample Data Inspection
So if we actually look at what this data looks like—well, here I am on the British Government's website, and we have this price paid data, this dataset, and they provide kind of a whole portal for this that we can go in and find out more about the data. I won't go through every part of this, but what we'll see is we get monthly data. So we've got April 2023, which is the most recent month at the time of recording.
And then we also have a single file, which actually has the whole historical dataset, which as you can see is over four gigabytes—actually a little bit more than what it says here, but it's between four and five gigabytes. So not a huge amount of data by today's standards, but still a good amount. And you generally need some more heavyweight cloud processes to transfer and wrangle this type of data.
Now this dataset also comes with kind of a "how to use it" guide, and that gives us information about the schema. So actually this dataset itself doesn't have a header row, so we need to know which columns relate to what, and we have some information about each of the columns and the explanation. Particularly interesting is this record status column, which actually is only for the monthly file, but that gives us an indication of what's been added, what's been changed, what's been deleted. And as you can imagine, with regards to data warehousing in our lakehouse, we want to be able to perform upserts or merges, if you will, where we can add the data that's new, change the data that exists but has been updated, and delete the data that needs to be deleted. So this will help us with that control flow, that logic of deleting those rows or updating those rows.
If I quickly actually open up one of these files—I'm not going to open up the 4.5 gigabyte file because I'll make this video very long—but if I open up this text file, for example, and I zoom in a couple of times, you'll be able to see that we have varying styles of data here. So we have just strings. This is an interesting unique identifier that's actually wrapped in curly braces. So we might want to do something with that further downstream. We've got numerical data, so that's just our price paid—so 243,000. Everything will be in pounds because we're talking about England and Wales here. And we've got some dates, and then we've got some normal string fields. So we've got a good amount of data here, different varying data types.
So this will all be really interesting and a common use case for a sample data pipeline that we're going to be building in this series.
Insight Discovery and Defining Goals
So let's take a step back though. What problems are we wanting to solve? Because in a modern data project, you want to start right to left. So you want to start with discovering the insights that you want to glean off of your data, and then work backwards to really understand what data it is you need to be able to get to those insights.
Now, in this instance, the dataset is not huge. It's not like we're having to trawl through loads and loads of datasets to try and find what we need and the columns that we need. It's only a relatively small dataset, but oftentimes, we found this with our clients—we have a huge data estate, a huge data landscape, and starting from the data and then working towards trying to figure out what the insights are is kind of the wrong way around. You need to understand what's going to benefit the business.
So in our case here—again, this is just a sample—but we want to be looking for things like spikes and anomalies. So can we attribute certain strangeness, so to speak, in our data to actual events that may have happened? And I'm thinking things like Covid in the last few years, or the recession back in 2008, other national or international events that might have had an impact on people's propensity to buy, the prices that might have been paid at certain points in time. So that's one thing that we'd like to find out from this data.
Now there's also the geographical element. So this data comes with addresses of all of the places for all of the transactions that have occurred over the last 28 years or so. So we want to figure out what the regional differences are, even on a city by city basis. You can imagine that London being the capital city tends to have inflated prices compared to smaller towns, or potentially even the north of England versus the south of England. So we want to see where's cheaper to buy, or see if there's any indication of what maybe has been a fastest growing property economy over the last however many years. So insights like that will be useful.
And another thing that we can do pretty easily with this type of data is trend analysis. So what can we expect next year from region to region or from property type to property type? That's going to be some interesting analysis.
So that's really the overview of the problem statement, the goals that we're trying to achieve with this.
Fabric Architecture Walkthrough
But let's take a look at an architecture diagram. So how will this fit into an actual Fabric architecture?
So if I go over to this tab here—now, you can't see much of this at the moment, but this is the higher level view, and I'm going to zoom in to various elements of this.
So we're starting off with a Fabric workspace. That's going to be orchestrated by an end-to-end orchestration pipeline, which is a Data Factory pipeline that's probably going to have a few notebooks, potentially even updates of datasets, for example. We will see. But also ingesting the data itself.
So our data sources are going to be twofold. We're going to have the HM Land Registry data that I just showed you, but we're also going to be creating a shortcut—a OneLake shortcut—to some reference data that we've got stored in ADLS Gen 2. And we'll see how we can retrieve that data in a couple of episodes' time.
But the pipeline is going to ingest that data from the web, and it's going to ingest this data from the shortcut—i.e., not actually do any ingesting of data. The shortcut is essentially virtualisation over an area of a data lake, so we don't really need to do anything there.
But then we have this bronze zone, and hopefully you've watched the previous episode where we talked about the medallion architecture and we talked about lakehouses. So essentially the bronze zone is where we want to store the raw copy of the data, or in the case of the shortcut, we just want to store the pointer to that data in a directory structure that we specified for the bronze lakehouse.
So actually the bronze lakehouse, if we remember again to the previous episode—we have our lakehouse split into two areas: tables and files. The tables are going to store our managed Delta tables that are generally created through Spark or Data Flows Gen 2, for example. But we're not going to have any managed Delta tables at this level. Instead, we only care about storing those CSV files in our data lake so that we can query them in a subsequent artefact.
And that subsequent artefact is going to be a handful of notebooks, and these notebooks are going to take that data and actually transform them into the lakehouse where they are going to be turned into managed Delta tables. So we're going to have a couple of tables: price paid data and postcode directory. So the price paid data is generally coming from the Land Registry data that I just showed you, and then the postcode directory is going to come from that shortcut. But these are now queryable Delta tables that we can perform queries on in our lakehouse.
And the final step is to project to gold. So we want to create the house purchase projections, which means doing some dimensional modelling—so creating a star schema, so to speak. So in our gold lakehouse, we have our house prices, dates, locations, and property types. So that's where we've optimised for the reporting, the serving layer, so to speak.
And then actually, with the beauty of Direct Lake, this arrow here that goes to the semantic model—this is our Power BI dataset. Those aren't actually... there's no ingestion going on there. It's just they are kind of pointed—the tables in the Power BI dataset are pointed to those files directly in the lake.
So we don't really need to do anything there other than the things that we care about in our semantic model. And that's the relationships between those tables, building of the measures or the calculated columns, for example, and all the semantics around what do we want to call this. This is meant to be an end consumer artefact entity. We need to make sure that the naming is as we would like it to be for people who are trying to navigate this dataset or reports built off this dataset and find their insights.
And obviously what builds on those datasets are Power BI reports and all the associated artefacts with that—like apps, dashboards, just individual pages and visualisations.
So that's our architecture. It's a pretty standard medallion architecture. If I zoom back out so we see the higher level colours and shapes rather than the text itself, we have our bronze area, our silver area, and our gold area. All serving different purposes, but it's helping us separate and segregate that data in different states, so that our data pipeline can be manageable, maintainable, and modular, so that we can develop it over time.
Outro
So that's it for this video. This has just been an introductory video to the actual problem statement, the goals that we're trying to achieve in this demo series. But stay tuned for the next episode. We're actually going to be diving into Fabric and ingesting that data from the web source. So we've got 4.3 gigabytes of data that we want to ingest into our OneLake lakehouse. How is that going to perform? Well, come to the next episode and figure that one out. See you then.