Microsoft Fabric - Creating a OneLake Shortcut to ADLS Gen2 | endjin

Ed Freeman 8th August 2023

Tutorial

Shortcuts in Microsoft Fabric enable zero-copy referencing of data across OneLake and clouds. With Shortcuts, you can get rid of your pipelines that are used to copy data into your storage account - instead, the data that's required for your Fabric operations is just retrieved as and when needed.

In this video we use an Azure Data Lake Storage Gen2 (ADLS Gen2) account as the source, and configure a Shortcut in our Bronze Lakehouse so that the referenced data sits alongside our other raw data. The full transcript is available below.

The talk contains the following chapters:

00:00 Intro
00:19 What are Shortcuts in Microsoft Fabric?
02:08 Demo - Browsing source data in ADLS Gen2
03:32 Where should I create my Shortcut?
04:46 Creating a Shortcut in Microsoft Fabric - Configuring connection
06:07 Configuring Shortcut settings
07:00 Browsing virtualized Shortcut data in Lakehouse View
08:04 Outro

Useful links:

📖 OneLake shortcuts

Microsoft Fabric End to End Demo Series:

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Ed Freeman: Hi everyone and welcome back for part four of this Fabric End-to-End Demo series. In this video we're going to see how we can reference existing cloud data directly from within Fabric with no data copy required. This is going to use the Shortcut feature. So let's get going.

So Shortcuts, what are they? Essentially they are data virtualization across your OneLake and across clouds. So what's data virtualization? It's essentially the ability to reference data at source without having to copy it into whatever system that you're referencing it from. So it's this "one copy" principle that you've probably seen in some of the Microsoft Fabric marketing.

But it supports data stores like Azure Data Lake Storage. So that's still within Azure, but it's outside of Microsoft Fabric as a SaaS product. So we've got Azure Data Lake Storage. We've also got AWS S3 buckets that we can reference our data directly from. Or the final option currently is referencing within your own OneLake. So that's files or tables within other workspaces or other databases or lakehouses, sorry, within your Fabric tenant.

Now, the benefit of this is we no longer have to maintain a copy of the data and we no longer have to maintain the process by which that copy gets updated and refreshed. So essentially we're benefiting massively from not having that synchronization issue, not having to manage that. That doesn't mean that this feature is free, there will still be the egress cost of getting the data from your source to Fabric, but depending on how much you're doing that, it may or may not actually cost you, but that's for a later video. But essentially Shortcuts allow us to Shortcut to our data without having to copy it over. So what do they look like? Let's take a look.

So I've actually got an Azure Data Lake Store here that I want to reference some data from. And the data I want to reference is actually in this raw zone. So this is a kind of a previous project that we've done that we want to, we don't want to have to go through the process of grabbing the data from source again.

So I'm in this default container, and within this raw subdirectory, I have a bunch of sources. You'll see the familiar Land Registry, we've also used this ADLS account in the past to land that Land Registry data. But the one I want to create a Shortcut for is this ONS data, so the Office for National Statistics.

And this is giving us information about postcodes, so more detailed geographical information. About postcodes that we can use potentially in reporting going forward to generate some nice map visuals and do some great analysis over some maps. So within this postcode directory, it's just a bunch of CSV files and none of them is a huge file in itself. But we've got a significant amount of data here that we don't want to have to... Manually copy over by creating an Data Factory pipeline, for example, we just want to be able to reference this from source. So let's do that. So back in Fabric, I'm actually in my bronze demo lake house here. So you'll, from the previous episodes will remember that we uploaded this raw data, this Land Registry data via a data factory pipeline.

And we have the complete file, and we've actually had a monthly file arrive since then. Now to create the Shortcut, first you need to figure out where you want to put the Shortcut. Now I want to, I'm thinking of my data lake in this taxonomy essentially. My lake house files is in this taxonomy.

We've got the raw zone here, and usually under that you put the data source that the data is coming from. So we've got our Land Registry data, and we want the ONS data, the Office for National Statistics postcode directory data, to go at the same level as this, because we've got this taxonomized path, and we're following it to keep that consistency and make sure we apply that structure to the files section of our lakehouse.

So to create the Shortcut here it's important that you click on the parent folder that you want the Shortcut to come underneath. So I've clicked on raw because I want ONS to come underneath and sit at the same level as Land Registry. So now I click right click and I click new Shortcut.

And we've got these three options that I mentioned. OneLake, Azure Data Lake Store Gen 2 and S3. If I click Azure Data Lake Store Gen 2, it comes up with these options in the dialog box. After it loads,

there we go. And I'm just going to copy a URL that I've got on a different screen, which is the URL to my storage account. So it's this ddo, data lake storage, and it's a DFS endpoint because it's an ADLS gen two account. And this is the path all the way to to the actual file sorry, directory. And actually, I've created the connection before, so it remembers that I've done this in the past.

But if I just delete that again, temporarily, if you are creating a new connection, it's going to ask you to give the connection a name. And then it's going to give you ask for you, for some authentication details, and it's going to offer you organizational account, access key, sas token, or service principle.

Now, when I set this up previously, I did it via an organizational account. I'm just going to click next because it's all authorized already. And then on the next screen, we get information that it wants around the Shortcut name and the sub path. Now the sub path. is the path within this storage account over in Azure.

So the path within that and within the file system that points to the root of the data that we want to replicate, to reference. The Shortcut name is important as well because that's what will show up in the Lakehouse view. So I'm going to put ONS here. And then for this subpath, I'm actually going to point to the root ONS folder.

Slash default, that's my file system in Azure Data Lake Storage. Raw is the first subdirectory, and ONS is the next subdirectory. And I want everything under ONS to now come in as via my Shortcut into my account. So I'll click create. And as quickly as that, because no data copy is actually happening it's there and it's available.

It's fully, it's authenticated. It's proven that I can access this data. So now it's just essentially referencing it from my OneLake. So it's got the postcode directory and it's replicated the folder hierarchy that was in my storage account over in ADLS. And if I had more subfolders under ONS, it would just show that as well and it would have created the same hierarchy.

And just like we could with the actual data that we landed, if I want to preview this data, that's absolutely fine. And that was essentially referencing that data, sending a query over to the data to without having to create a full copy of it over in the OneLake. And look, we've got our data there.

Now, obviously there's things that need to be done to this data, because everything is turning up as a string, but this is our raw data. We will parse it, and we will create data types in the silver zone. But that's it for this video. We've shown how we can create a Shortcut to some ADLS Gen 2 data in Azure.

In the next video, we'll take a look at a couple of tools that we can use to navigate our data in OneLake on our local machines, and some of the things that you might need to be wary of when doing that. Now, if you found this video useful, please hit and hit subscribe to stay up to date with the rest of the series going forward.

See you next time.