Data Toboggan 2022
Database templates are a recent new feature added to Azure Synapse Analytics. They allow you to put a layer of abstraction over your data lake to make it appear more like a relational system than an file system.
This is a an example of how Microsoft continue to invest in Synapse Analytics, adding yet another tool to the Synapse "Swiss Army Knife".
In this short talk, Barry provides an overview of Database Templates and illustrates how they can be put to use within a modern, cloud native, data & analytics architecture.
He uses a practical example to show how Database Templates can be leveraged by putting them at the heart of the process to design, build, test and deploy actionable insights on Azure Synapse Analytics.
Barry Smart: Hello and welcome. This is a recording of the talk I gave last week at the Data Toboggan Alpine Coaster event. The challenge I was given was to talk about Azure Synapse Database Templates in less than 10 minutes. So, start your stopwatch! Let's see how far we get in 10 minutes.
My name is Barry Smart. I work for endjin. We like to see that we help small teams achieve big things. We love using cloud technology to solve big problems. And we tend to do that in two ways with clients, through data and analytics and .NET application development.
We're huge fans of Azure Synapse Analytics. That tends to be our go-to platform for delivering data and analytics projects. And this is very much how we think about it. As a Swiss Army knife, there's lots of tools that you get. Some of them you may never use but you tend to select a subset of them to solve any particular data and analytics challenge that you might be facing. And database templates, as you can see here, are yet another tool that Microsoft have recently added to the platform.
It's a great example of how Microsoft are continuing to invest in Synapse Analytics to extend its capabilities and deliver more and more value to clients.
So, what are database templates? They're metadata that you can layer over your data lake to make it appear more like a relational system, like a database, than a file system.
Now that sounds fine in theory, but how do you put that into practice? One of the ways of applying database templates is to put them right at the heart of the process that you use to deliver cloud native modern data analytics solutions using synapse.
When we engage with clients, they tend to want to start on the left-hand side of this process with the data and to try and figure out how to extract value from it. We tend to challenge that approach and suggest that they should actually forget about the data initially, and instead focus on the goals of the organisation.
Once you understand those goals, you can identify the people within the organization that have a role to play in delivering the goal, and you can interview them to under identify the data driven, actionable insights that would empower them to deliver that goal more effectively.
You can then determine the best method of delivery for that actionable insight. So this could be a data API or an alert, or in this case, a Power BI report. And once you've established that, you can then set about designing the semantic model that will support that report. And this is where the database templates come in.
So for the first time, you're able to do that design phase inside Synapse rather than in a spreadsheet or a Visio diagram. With that in place, you can then work left to right identifying suitable data sources and the ingest / prepare / project stages that you'll need to put it through to populate that semantic model.
Now taking this approach has a number of benefits, one of which is it decouples the teams that may be working on the left and right side of the semantic model. For instance, once the semantic model has been designed, you could populate it with test data and that would then free up the Power BI team to start building the model, layering in the measures and developing the visuals.
Whilst at the same time it provides a clear target for the data team to focus on when building the pipeline that will take the raw data through that ingest, prepare project life cycle.
Taking this approach then leads to this type of architecture being implemented. We've got the raw data on the left. It's going through the ingest, prepare, and project stages of the life cycle, all orchestrated through a synapse pipeline.
We're leveraging different tools in the Swiss Army knife that we talked about to put the data through this life cycle, and at the end of the pipeline we've defined the target schema using a database template. It's the mini star schema that we want to present to Power BI.
Power BI can import the data and generate that model that powers the report, delivering the actionable insight to the persona that we identified in our earlier analysis.
So, let's delve into this in a bit more detail and show you what some of these things look like in Synapse.
So, the first thing is the database template where we've defined this house prices mini star schema. If you flip over to the Synapse Workspace and look in the data menu option, on the left we can see here a number of lake databases that we've defined simply by clicking + and Add New Lake Database.
You can see here we've got our house price trends database with the four tables that comprise our star schema. And if we open that database, we see the new graphical user interface that allows you to build and maintain database templates. Database templates comprise tables with columns and relationships between those tables.
To create a table, you can reverse engineer a schema if you have a file lying in the lake, you can point to that file or that folder and it will pull the schema from that file and generate the table schema from it.
You can also choose from a wide array of industry templates that Microsoft have provided.
In this case, we have actually built this mini star schema from first principles using the custom option. You just basically get an empty table, you can give it a name and a description, and then you can start to add in the columns.
Each column has a name. You can identify primary keys. You can give each column a description. So, all of this meta that you're adding will flow through into things like Microsoft Purview. You can identify columns which can be nulled. You can assign data types and then where relevant, assign the format and length.
Once you've got a number of tables, you can then define the relationships between them. So here you can see I'm defining the many to one relationship from this fact table out to the individual dimension tables to which it's related.
Okay, so the next stage is then to populate that schema. So if you flip back to our architecture diagram, you can see that we've got a PySpark Notebook here that takes the cleansed data and injects that into our mini star schema.
So, let's have a look at how that's done in the code. If you open the appropriate notebook, we've encapsulated this functionality inside this functions notebook just to abstract away some of the complexity. But here's an example of how we then inject data into our database template.
You can see where this function takes in a dataframe containing the cleansed data. And then it's just a case of picking out the appropriate columns that we want to take from that cleansed data. Aliasing we're necessary to the new column names and then using
InsertInto syntax here where we identify the database and the table name.
Here we can choose whether to insert the data or over override the data when we're doing that operation. So, it's really straightforward to write the data into the database template once you've created it. There's a bit of schema of enforcement that goes on, if you don't write the right number of columns, you get an error, for example, which is good.
But also, what's quite nice is this layer of abstraction that you're already starting to harness over the database template. You're not referencing a long path to the data. You're referencing this database table name semantics, which is a lot more logical to work with when you're at the stage in the process.
So, if we return back to our architecture diagram now, the next stage is then (having populated the database template): how do we consume that into Power BI? This is really straightforward.
There's a new power BI connector. So, from within Power BI, the typical approach is to "get new data". In the Azure subsection, you'll see this new "Azure Synapse analytics workspace Connector". Which is under beta. So, there's a bit of a health warning that goes with it.
But when you choose to connect, you authenticate using your own credentials. And at this point, the navigator displays all of the Synapse Analytics Workspaces that you have access to on your various Azure accounts.
So, you can then navigate to the relevant Synapse Workspace. So, in this case it's our database templates demo environment. And when you expand that, it shows you the relevant Lake Databases that you've set up. And from within there you can expand. So, in this case it's our house price trends mini store schema that we want to expand. We can choose all of the tables, we can preview them, and we can load those into power BI.
So, when we load into Power BI, this is what it looks like. You can see that we've brought over the four tables that we've chosen, but importantly, we've also brought over the relationships. So, these relationships are defined by the database template. They're not discovered by Power BI and they certainly haven't been set up manually.
So, this is a great feature because it means that the design of that star schema is not just the tables, but it's how you want to create those relationships as well when it comes across into Power BI with the tables. And it means you don't have to rely on people connecting up the tables in the right way and that already been done for them.
So, you can then layer in the measures on top of the the tables to create your model and then build your interactive Power BI report on top of there to deliver the experience that you are looking to achieve, to deliver that actionable insight to that persona to achieve that business goal.
In summary, what we've demonstrated is that database templates are another tool in the Swiss Army knife that is Synapse Analytics. It's an integrated tool that sits alongside all of the other tools that you're used to, provides another capability that you can leverage when you're building cloud native data and analytics solutions.
We think a big feature of the database templates is that they facilitate collaboration between the domain experts and the technology experts working on a data project. It's a point where you can come together and agree the foundation for that semantic model, and then, as I described earlier, it decouples those two teams potentially to work in parallel.
We've seen the new database template user interface that's in Synapse, where you can define the schema. That scheme is source controlled. So, if you choose to sync your synapse workspace with Git you can then apply versioning, pull requests and the like to take your database templates through that life cycle with all of the other artifacts that you've got in your Synapse Workspace.
We've shown how there's a level of abstraction provided by these lake databases, which is really useful particularly when you get to that sort of semantic layer, and you want to start using business terms to organize and present that data
Database templates are also a great layer for discoverability. Tools like Microsoft Purview will discover them and project the metadata that you set up within the database template. So, it helps to encourage discoverability and reuse of the data that you're projecting into that "curated" layer in the data lake.
Another big feature of database templates is that Power BI integration. You pull over the tables and the relationships into Power BI. So, it helps to sustain that semantic model that you've started to build inside Synapse when you pull it through into Power BI. So that's a nice feature.
And then finally, As I've really discussed earlier, these database templates are now a first-class citizen within Synapse. You can take them through that DevOps life cycle as you promote your pipelines through from development into test, into acceptance, and then finally into production.
So that's been a very quick whistle stop tour. I hope you found it useful. If you want to delve into this in a bit more depth, we've got a series of blogs that we're working on at the moment. We've got two out of a series of five published. Hopefully we'll get the third one out relatively soon. But here's the shortened URL or the QR code that will take you to the first blog in the series. Hopefully that'll provide you even more information about database templates and what we've been discovering when we've been putting them into use.
All it remains to be said is thanks very much for listening. Hope you found that useful. If you've got any comments or feedback, we'd love to hear from you. Thanks very much. Bye-bye.