Microsoft Fabric - Ingesting 5GB into a Bronze Lakehouse using Data Factory
Ed Freeman demonstrates data ingestion from an HTTP source to OneLake using Data Factory in Microsoft Fabric, exploring Lakehouse tables, files, and pipelines.
About this talk
Tutorial
In this video Ed Freeman continues the Microsoft Fabric End-to-End demo series by seeing how we can quickly ingest ~5GB of data from an unauthenticated HTTP data source into OneLake using Data Factory in Microsoft Fabric. We'll see the distinction between Tables and Files in a Fabric Lakehouse, and look at how we can preview data in the Lakehouse explorer.
The talk contains the following chapters:
- 00:00 Introduction
- 00:15 Dataset recap
- 01:25 Workspace and pipeline artifacts
- 01:57 Pipeline UI layout
- 02:21 Copy data activity options
- 03:07 Configure copy data activity source
- 05:00 Configure copy data activity destination
- 06:21 Add dynamic content for destination Lakehouse filepath
- 08:21 Copy Data activity additional settings
- 09:00 Manually trigger pipeline
- 09:21 Alternative parameterized pipeline
- 11:38 Reviewing pipeline run details
- 12:10 Default workspace artifacts
- 13:04 Viewing Lakehouse Files
- 13:46 Roundup and outro
Useful links:
Microsoft Fabric End to End Demo Series:
- Part 1 - Lakehouse & Medallion Architecture
- Part 2 - Plan and Architect a Data Project
- Part 3 - Ingest Data
- Part 4 - Creating a shortcut to ADLS Gen2 in Fabric
- Part 5 - Local OneLake Tools
- Part 6 - Role of the Silver Layer in the Medallion Architecture
- Part 7 - Processing Bronze to Silver using Fabric Notebooks
- Part 8 - Good Notebook Development Practices
From Descriptive to Predictive Analytics with Microsoft Fabric:
- Part 1 - Overview
- Part 2 - Data Validation with Great Expectations
- Part 3 - Testing Notebooks
- Part 4 - Task Flows
- Part 5 - Observability
Microsoft Fabric First Impressions:
Decision Maker's Guide to Microsoft Fabric
and find all the rest of our content here.
Read the transcript
In this video, we're going to take a look at how we can ingest over four gigabytes of data from a web data source into a Bronze Lakehouse in Fabric using a very familiar Data Factory experience. Let's crack on.
Dataset Recap
So in the previous episode, I introduced this statistical data set, the price paid data that's available on the British Government's website. And this gives us information about Land Registry, land and property purchases over the course of the last 28 years in England and Wales and gives us a lot of data related to that.
And we saw yesterday there are two different files that it gives us here. It gives us a monthly file, which updates us every month with the updates from kind of revisions to previous sets of data, but also the new addition to the data from this month. There's also a full single file of all the historical and current data. So that's just a snapshot of their database behind the scenes, so to speak. But it's all provided as easy to ingest CSV and text files.
Now, the data itself we saw yesterday is a mixture of kind of strings, numerical fields and dates. So we've got a good variety of data types going on.
Workspace and Pipeline Artifacts
Now, without further ado, let's go into Fabric. So I've actually created a Fabric end-to-end demo workspace, and I've pre-prepared it with a blank empty pipeline. So if I just go on that pipeline, you'll see that the multitasking opens up a tab down here underneath the workspace. So any artifacts that you got open within that workspace will show underneath here.
There is currently a limitation where you can't open pipelines from different workspaces in the multitasking view. So for now, you're just going to have to create different tabs.
Pipeline UI Layout
So anyway, I'm on a blank canvas here, and you'll see that it's pretty much a normal Data Factory experience, but everything's being pivoted. So in Azure Synapse and Azure Data Factory, most of your activities would be on the left hand side vertically. Now they're along the top horizontally. But pretty much everything that you'd find in Synapse pipelines and Azure Data Factory would be here.
Copy Data Activity Options
So in the activities, the only one we really care about for this data ingestion today is the copy data activity, which actually gives us two options. So we can either just add the activity directly to the canvas and configure it there, or we can use the copy assistant, which if you're aware from Synapse Analytics and Azure Data Factory, it takes you through step by step of configuring a copy data activity. But I'm actually going to use the activity just as it is.
So if I drag that up, I can get all of the bits of configuration for this copy data activity. So the first thing I'm going to do is just give it a better name. So "Copy Land Registry data from website", for example.
Configure Copy Data Activity Source
And the source that I'm going to use is an external source. It's not living within my Fabric tenant just yet. It will be after I copy it, but I need to connect to that website. So if I were to, you know, do this from scratch, I'd click new and HTTP, I'd click on that, continue, I'd put in the URL, which I've found here. But it's actually realised that I've configured this type of connection before. So instead of going through that again and re-authenticating, I'm just going to use that connection. So I've made a connection to that endpoint and everything's ready for me to go.
Now the connection type I'm going to use is just an HTTP request, and that's because I want to use this binary file format. So if all you care about is getting data from A to B, not putting any structure or schema around it, you just want to do a straight copy, a binary copy will always be the fastest way of doing that.
Now that means I have to use HTTP rather than REST because REST kind of implies some structure with how you're able to configure it in Data Factory.
Now, the relative URL is actually only that base URL plus the name of the file, which in this instance is pp-complete.csv, and that's actually the name of the file on the website. So you'll have to take my word for that.
Now in the advanced, we don't actually care about any of these bits of configuration. We don't care about the request method. That's just going to default to GET, which is what we want here. We don't need any additional headers or body.
Now, in fact, let me test that connection and hopefully everything's connected correctly. Yep, that's fine.
Configure Copy Data Activity Destination
So then we go onto the destination. Now, the destination, we want to be a lakehouse. We want to have a lakehouse that represents our bronze layer or bronze zone, which you'll remember from the first episode, hopefully. But this is where we just want to store our raw files of data.
Now, I haven't created a lakehouse yet, so you get this familiar kind of creation experience that you're used to in Data Factory land, but now we're creating a lakehouse. So if I click new, I can type in the name. I'm going to say "bronze_demo_land_registry". So create that.
Now we need to decide whether we're going to put it in a tables folder or the files folder. So again, remember from the first episode, the lakehouses are split into two sections. Tables, which is where your managed tables will end up. And files, which is just an object store for whatever you want to put in there, really.
Now, I don't want this to be a table. I don't care about it being a table. I just want to land the data into our data lake for now.
Add Dynamic Content for Destination Lakehouse Filepath
So the file paths. Now this is slightly more interesting because again, in the raw zone, you tend to want things to be relatively discoverable. If you are going in there and you're troubleshooting or diagnosing an issue, you want to be able to find things based on kind of when they arrived, for example, and have bits of metadata incorporated into that path.
So I'm just going to copy over a couple of expressions that I've got over here. And if you're putting in expressions, i.e. dynamic content, you're going to want to open up this dynamic editor, dynamic content editor here.
And if I just press paste, so what I'm actually doing here is I'm still creating a raw directory within my OneLake, within my file system. Then I'm kind of coming up with a convention. So everything's going to go under land_registry_price_paid_data, and then I'm gonna split by upload year. Where the upload year is actually generated dynamically from this UTC now function in Data Factory.
So if I go to UTC now, we can see that returns the current timestamp as a string. And one thing you can do is you can put in a format there so that it will only be returned in that string format.
So that's where the folder path that I want. And then the file name again, if I just copy this over. Oops. I want to do that in the dynamic content.
So I'm just calling the file pp_complete and then again, using that UTC now I'm actually adding a slightly longer timestamp and usually when it comes to the file names, I like to put a timestamp on it so that if you have kind of incremental loads in the future, you know exactly which files have arrived after or before other files.
Copy Data Activity Additional Settings
So if I hit okay on that, again, that's that configured and now in the settings there are a bunch of settings that you can configure here, but actually I'm just going to leave them as they are. Maybe once the pricing and the performance becomes more clear as a public preview progresses, you might want to be more careful about whether you select auto or not, because it might have implications on just how much compute is used.
And if you don't need your pipelines to be running that quickly, or you don't need to be hyper optimised, then you can just edit this to be able to actually specify specific numbers. And that starts from two, but actually I'm just going to keep it on auto.
Manually Trigger Pipeline
So really that should be it. And if I now go to Run, save and run, it has started that pipeline and I can see in the output tab that's where it started and it's queued my pipeline.
Alternative Parameterised Pipeline
So in the meantime, let me just go and show you one I have done before, which is equivalent to this, but slightly more logic in there.
So if I just open this "Ingest Price Paid data from Land Registry", we see things are slightly different. So I've got this switcher at the start, and the idea here is that our orchestrator pipeline is going to tell us what type of ingestion it is. As in whether it is the monthly or the complete file that we saw on the website earlier, and then it's going to give us some information about the latest month.
Now on the switch statement, the expression that we're switching on is actually that ingest type. So I want to be able to choose whether we ingest the single file or the monthly file.
It would generally be the single file will just be on the first time, unless we need to rectify some issues that we've had with the monthly files and we want to just override everything. But going forward, generally it's going to be the monthly file that we want to default to.
And you can imagine that these have slightly different endpoints that we need to hit. So we need to parameterise that.
So once we've parameterised that, the expression can either be complete or monthly. And in each case we have, we've got a slightly different configured copy activity.
So the complete will look quite like what I've just created on the other tab. Let's look at the monthly. So if we go to the monthly, then this is configured as such. So we've got a source, which is a slightly different name. So it says "PP monthly update new version". Still a binary. Everything else is still the same, apart from we call the file name monthly instead of complete.
But otherwise everything is identical. But that parameterisation that you can also do in Synapse pipelines and Azure Data Factory has come over to Fabric Data Factory.
So, and that's really useful to have these metadata driven pipelines where you can change the behaviour based on some parameters that you pass in.
Reviewing Pipeline Run Details
So if we go over here now, that is finished. It took two minutes to copy over four gigabytes of data. We can actually inspect the copy activity here. Well, actually it's closer to five gigabytes, so those labels on the website aren't quite accurate.
But we get a bunch of information here, so we see it was queuing for eight seconds, and then it started the transfer, started reading and started writing concurrently or simultaneously at the same time.
Default Workspace Artifacts
So last thing we want to do is just view our workspace again. So the lakehouse that I created on the fly in Data Factory, in the Data Factory destination, has now appeared, but we've also got this SQL endpoint and this dataset. We won't worry too much about that for now.
There's actually ongoing kind of feedback to the product group to maybe hide these or make these optional. Because currently they just get provisioned whenever you create a lakehouse.
But the SQL endpoint is essentially a way of using T-SQL to query, to read from your data that's in your tables in your lakehouse.
At the moment, we've not got any tables in our lakehouse. We've only got files, so we don't really care about that.
And the dataset is a dataset on top of your tables, an automatic dataset on top of your tables that are in your lakehouse.
Viewing Lakehouse Files
But if we just go to the lakehouse now and browse that, let's see what we've got in there.
So under tables, we have nothing, there's nothing that's showing under there, but under files we have this directory structure that we created on the fly earlier. It's raw/land_registry_price_paid_data/upload_year. And then we've got our data here. And actually if we click on it, we get a nice speedy preview of this, and it'll only preview 10 megabytes. So it's not loading over well, almost five gigabytes to the screen right now. It's just 10 megabytes so that you can have a quick glance at your data.
Roundup and Outro
So that's really it. We've seen how we can ingest data from a web data source into a Bronze Lakehouse. In this episode, in the next episode, we're going to see how we can utilise the new shortcut functionality in Fabric to access existing data that's already stored within an Azure Data Lake Storage Gen2 account in our Azure tenant.
So that's it for this video. Please don't forget to like and subscribe to stay on top of all our future content. See you next time.