Skip to content
James Broome James Broome

SQLBits 2024

In this comprehensive session, discover how to ensure accuracy and reliability in your data projects by implementing testing methodologies for notebooks, pipelines, and Power BI reports.

James Broome, Director of Engineering, explores real-world examples from dozens of endjin's data & analytics projects; teaching techniques like unit testing in Fabric Notebooks, snapshot testing for ETL processes in Synapse Pipelines, and user acceptance testing for Power BI models. Learn to avoid common pitfalls and enhance your data practices with effective testing strategies.

Chapters

  • 00:00 Introduction and Context
  • 00:16 The Importance of Accurate Data Reporting
  • 01:53 Challenges in Data Testing
  • 03:40 Common Excuses for Lack of Testing
  • 05:29 Levels of Software Testing
  • 08:23 Unit Testing with Fabric Notebooks
  • 11:28 Refactoring for Testability
  • 19:09 Integration Testing with Pipelines
  • 22:45 Testing Pipeline Behaviour with Valid and Invalid Data
  • 23:40 Running and Monitoring Pipeline Tests
  • 25:21 Snapshot Testing for Data Validation
  • 31:05 User Acceptance Testing in Power BI
  • 33:51 Creating Executable Specifications with Gherkin
  • 34:40 Building Dynamic DAX Queries for Testing
  • 37:18 Testing Power BI Reports Locally and in the Cloud
  • 43:17 Conclusion and Final Thoughts

Transcript

So most of you will have seen this graph or a variation of it when these numbers were published by Public Health England back in October 2022. Government policy was being set on a near day basis by this data and your own actions and behaviours. Maybe your own your own hopes and fears were influenced by what this graph was telling us, but guess what? It was wrong. Now, I'm not trying to make a political statement, and this isn't going to be a session about the correct interpretation of statistics, but fundamentally what I'm going to try and do is help you avoid being in the same situation.

So what we're talking about here, obviously, was positive COVID test cases, and for a period of eight days, Public Health England consistently under reported data. To the tune of about 2, 000 cases a day. Now, if you cast your minds back to that time, this was test and trace, this was close contact follow up. You can imagine the impact that this error in reporting and this data collection might have had. Now, I've written a link about what happened here. I think I've explained pretty simply, I think, what the problem was, and given my opinions about how it could have been avoided. If you want to read more about that, then this QR code will take you to this.

This blog post, but despite what you might have read in the media at the time, which was they seem to blame Excel, which I think was, quite unfair. Excel is guilty of a lot of things. I don't think it was really Excel's fault in this case. To put it simply, the root cause was this. It wasn't tested. Now, I'm guessing most of you in the room because you do one of the following. You might write Python code to generate data insights, you might write T SQL scripts to generate pre canned reports, you might write DAX queries for Power BI measures, you might design long running ETL processes, and if you're doing that, I'm guessing also the work you're doing has some meaning and some impact in some way.

Now, you might not have been collecting figures for a global pandemic, but you might have been. You might be working in other forms of healthcare. You might be working in medical drug trials or AI for healthcare. You might be working in financial services and processing credit card applications or mortgage applications. Or people's credit ratings. You might be having an impact more just in a general business level, sales reports or customer satisfaction metrics that might be influencing business strategy or performance. And really I want you to hold what it is you're working on at the minute, the data that you're processing, the insights you're generating, the projects you're working on and ask yourself this question.

Does it matter if it's wrong? Because there's really only two answers to that question. And if the answer is no, then as a data professional, I would argue, why are you even involved? And if everyone says yes, I would argue, then you need to prove it's right. Now I'm not going to ask for a show of hands, but I'm guessing if I did and asked you how many of you have comprehensive testing solutions in place for your data projects, not many hands would go up, although one down here, very good.

So why is that? Now at risk of grossly oversimplifying things, I think there's three kind of common answers I've encountered over my experience of working with customers and projects and teams over a couple of decades of doing this. And the first reason is this, we don't have the time. Public health England is a case in point here, right?

They were under intense pressure to get that data and reporting up and running. Corners were cut. But this argument falls down when you consider the time and impact and effort that was taken in cleaning up the mess when they got it wrong. Second answer might be this, we don't know the answer. This is hard when we talk about data projects, especially if we're dealing with large amounts of data. And it's even harder when the people that should tell the answer, your stakeholders, your customers, don't know what the answer should be themselves.

Now we once worked with an organization that had 15 different definitions just for gross margin and that was one of about 50 KPIs we were reporting on. But if this is, where you are, I would say forget about the technology, stop what you're doing, you need to go and do the hard work of understanding your requirements. If you want to know how to do that for a data project, this blog post will talk through a series that hopefully will help called Insight Discovery. That's a whole separate talk that we're not going to dig into in more detail today.

The third and final reason is this, it's hard. And I would say, yes, it is. How do you even test a Power BI report, for example, and spoiler alert, I'm going to show you that later on, but software engineering is hard. And I deliberately use the term software engineering, because if you are doing any of the things we just talked about, if you're writing Python scripts, if you're writing DAX queries, if you're developing even formulas in Excel. You're developing software. There's two things I'll say to that. One is to be blunt, that's what we're paid to do, right? This is our job. So we should be, it's hard, but we should be able to, we should be able to deal with this. And the second thing is, that's why I've created this talk, right? That's what we're going to dig into for the next 45 minutes.

So hopefully I will help with this. And by the time you leave, it may not seem as hard as you thought. So when we talk about software development and testing, what do we really mean? And another kind of gross oversimplification, we might break down testing into these three kind of levels. So the first level is unit testing.

And what we're doing there is we are going to try and isolate specific pieces of business functionality or logic so we can test it and prove it works under certain simulated conditions. So in a kind of data space, we might not be talking to real databases or APIs. We might be having simulated or fake kind of data sets that we can that we can curate in a certain way and make sure our system behaves as expected. What we're looking for at this point is fast feedback loops, so as we're developing our our insights we can do so with confidence they're going to be right quickly. At the next level we're talking about integration testing, so now we're going to start stitching things together, subprocess or subsystems, and making sure things work end to end.

So we're going to be hitting real data sources now, real databases, real APIs. And obviously the natural kind of conclusion of this is that things are going to be slower to test, you might not be doing this quite as often as you are doing any kind of unit tests. And the final level is user acceptance testing, so once we're happy things are done. How do we make sure our business stakeholders, our customers, our end users are happy, so they can sign it off and we can deploy things into production. Typically at this point, they're going to be doing this through some form of user interface, maybe a web or mobile or desktop application. And they're probably going to be doing it manually, because they don't have the kind of tools and technologies that we as data professionals and software developers may have access to.

Now, full disclosure, I wanted to do this whole session and all the demos that are about to come in Microsoft Fabric. You'll see in a second why that wasn't quite possible, but if we stay on the Fabric theme for a while, And think, how does that translate into a data project? We might think about our unit tests, running them in Fabric notebooks, where we're writing PySpark or Python code, small units of functionality that we should be able to test in isolation. Moving on to integration testing, that naturally would fit into kind of the Fabric pipelines, where we're starting to stitch together end to end ETL processes, maybe reading from a data source like a data lake and writing out to Delta tables. And then when it comes to user acceptance testing the natural user interface, in fact, that people are going to use is probably Power BI reports.

So how do we test the measures and the calculations and the schema that we're designing to make sure the insights that our end users are interacting with? Correct. So that's what the rest of the talk is gonna be. Three technical demos, one for each of these different scenarios. Now I'm gonna move quite quickly. There's a lot to cram in and 50 minutes isn't a huge amount of time. So I want you to focus less on the specific kinda lines of code you might see. It's certainly gonna be, not gonna be an end to end code walkthrough, but try and think more about the kind of higher level tactics and concepts that I'm showing.

Because what I'm trying to do is explain that whilst it might be hard in certain cases, it's definitely possible. And my aim is to try and leave you inspired to feel like you could take some of these tools and techniques back to your own projects as well to avoid being in the situation we saw at the beginning.

So without further ado, let's move straight into demo number one. This is unit testing. Using Fabric Notebooks. And I'm going to quickly introduce, actually no, I'm going to do one more slide first. I'm going to talk about notebooks. So why do we like notebooks? Notebooks are great for ad hoc experimental data processes. We can quickly get things running. We can explore data. We can document what we're doing, our steps along the way, which means they're very, a collaborative tool, and they're very good in a kind of agile, let's just get something done really quickly and add some value. The downside there is that comes at a cost because actually the friction to go from ad hoc and experimental in a notebook into something real and actually something that's repeatable and used and we can actually base business decisions on is really low.

That's great. But the cost is it means we often ignore a lot of kind of solid engineering kind of best practices like testing, for example. So let's dig in. First of all, I'm going to introduce our use case. It's very simple. We're a company that sells stuff. We have a data extract here from our, our sales system that has order ID, customer ID, some dates, quantity, unit price, total price, right?

And let's imagine we are. A data professional, and we've been tasked with exploring that data to produce a monthly report around how our sales performance is doing, and maybe, understand some kind of customer insights and what we can see in that data. So we use Fabric Notebooks, and all we're doing here is we're pulling our data in from our data lake. We start to explore it using a kind of summary statement and realize straight away, oh, this looks a bit odd. We've got 76 rows of data here, but our maximum order ID is 75. Maybe we've got some duplicates in that data, so we could do a bit of digging and realize we do. What we actually need to do, there's a bug in our source system, so we want to drop our drop any kind of duplicates based on the order ID column.

And now we've got 75 rows as we would expect, so we're doing a bit of kind of preliminary data cleansing. We know we're going to be reporting on our sales data by month, so we start to add a kind of a month, sales month column in we show that data here, and we start to have a look at it, and we might visualize it, and we have a poke about, and we think this is interesting, our customers have different kind of sales habits.

The customer five in particular is interesting, they only replaced one order, and that was over a year ago. Maybe we need to introduce the notion of whether a customer is active or not, so we talk to our business and define some rules around what we consider an active customer to be. And that might be someone who's placed multiple orders.

And a customer who's inactive, let's say they only placed one order and it was over a month ago, we might want to follow up and offer them a discount or, entice them back to place more orders again. So we create a new kind of data projection based on error in our customers with a notion of customer active status.

And then we go through the motions of aggregating our data by month and basically saving the output for our reporting. So we've got total sales by month here. And this is great. Now the value is, In Fabric, and Synapse as well for that matter, we've got something experimental, we've got something useful, we can just do add to pipeline, and immediately that notebook can now be run, if we add a trigger, on a nightly basis as our new data extracts land in the lake, immediately we've got a repeatable process, we can, do our reporting on that, and everything's great.

So the talk about the times of value from ad hoc and experimental to something real is really tiny, but what if it's wrong? How do we bake some confidence into that process? What you'll note in notebooks is that example I just said was really a set of logical steps. It was one kind of long end to end process, but really it was a set of discrete things that we did. We dropped duplicates, we looked at our customer status, we aggregated by month, we wrote things out. It was basically an ETL process inside a notebook. And the first thing that we can do is actually be clear about that and restructure our notebook into a series of logical steps and define each of those steps as functions.

Now what's great about notebooks is they have a logical kind of execution boundary built in, which is cells. So the natural thing to do is to split those steps out into separate cells, each defined in a separate function, and name that function with the intent of what it's doing. So we're loading our data, we're removing our duplicate orders, We're adding our order month column, we're calculating customer status, we're counting our sales by month.

Here's a series of our kind of end to end workflow. If we were to run this notebook, something interesting would happen, and that would be, nothing would happen. Because all we've done is define our functions, we need to actually call them to get the same results as what we did in the previous notebook, where everything was end to end. So in order to make that work, we'd have to add an extra cell down the bottom, which is our workflow cell. Call all the functions in order, and now we get the same kind of outputs to the bottom. Now this simple process of refactoring might not seem like much, but this is the first step in order to make things testable.

What we've done now, because we've got separate functions, it means we can call those functions independently, which means we can test them in isolation. And the key thing here is each of those functions takes a data frame, which it's going to act on. So those functions are no longer tied to the original data frame we've got at the top, which was our data from our data lake.

And if we could pass any data frame into any of those functions, We can get that function to execute on that kind of data scenario. Now, if you come from the world of software development, this might seem familiar to concepts such as inversion of control and dependency injection. Fundamentally, we'd break the dependency on where the data came from and what we're going to do over it.

So the next notebook I'm going to show you is a second step further. It's got exactly the same functions that we just had before, but what we've now done is add a test case. And our test case is just another function. But we've described what we want to test. In this case, we want to test our orders with the duplicate order idea removed. That was that first piece of business logic that we added in. Or we had duplicate orders, they had duplicate order IDs, we wanted to drop them. This may seem really overkill, why on earth would we test one line of code, right? All we did was to do a drop based on order ID. But you'll see in a second why it's important.

I'm just going to hit run on this to get that started whilst I'm talking. Now there's a few things that are interesting about this function. So I'm not using any test frameworks or anything, I'm just using the built in assert statement in Python to be able to match one value to another. But if you look at the structure of this function, I've got these comments in here that say, Arrange, Act, Assert.

And that's a very common pattern in testing. What we're going to try and do is arrange our scenario context, arrange the conditions in which we want the test to perform. In this case, we're setting up the data frame of what we want to pass into our function. The act is, we're now going to call a thing we want to test, which is our removeDuplicateOrders function. And the assert is, what happened? Is it as you expect it to be? All right, so the second thing I want to point out is, in the middle of that, we're creating a Spark data frame in memory. So this has nothing to do with a data lake, nothing to do with a database. In line in code, we've created a Spark DataFrame to set up our test condition.

And in this case, because we are testing orders with a duplicate order ID removed, we've created a DataFrame of two records. They both have the same order ID of number 10. If we pass that into a function, what would we expect to happen? We'd expect one of them to be dropped. And we are asserting the number of orders returned in the lettering back is one.

So I run that notebook in the background and you can see the notebooks have succeeded. We've got a tick, the test passed, not particularly exciting, but what happened if I now changed our duplicate order logic? So the minute I'm dropping duplicate orders, if I comment that line out, say, actually, let's change the behaviour so that our duplicate order logic is actually to, I did that the wrong way around, is to.

No I didn't, is to do a distinct statement. Now that's a logically separate piece of logic, because instead of just checking that order ID column, it's going to check every column in the data frame. If I was to run that again, our test is now going to fail, because our test case doesn't match the business logic in that line of code. What's going to happen is we're going to get two back, because although the order ID is the same, you can see the order date is different. All right, so they're not distinct rows, and you can see now we've got an error up in fabric. And our test has failed, and we've now expected one order after moving two recorders, but two returned.

So we've tested one piece of logic in that thing. Now you can imagine that obviously it's going to get quite unwieldy quite quickly. We've added our test logic in with our kind of business logic. That's one test case for one line of code. We've got a big process, suddenly things are going to get unmanageable.

So before we start adding more tests, what we're going to do is a bit more refactoring. The first thing we're going to do is create a separate notebook all together With just our functions in, just our business logic. You can see here, that's all it is. Those steps that we are working on, all split out separately and nothing is happening.

It's just our business logic code in that notebook. And then we're going to create another notebook, which is just our test cases. And how we link those together, is using this run command. And that allows us to execute one notebook from another. So the first thing we're in this notebook is going to do, it's going to load all those functions defined in the functions notebook into memory.

And we can act on them in this notebook just as if they were part of this notebook. So it's a really great way to compartmentalize and modularize that kind of logic. So you can see now in this test notebook, we've got a lot more tests. We've got our original test duplicate order IDs are removed. We've got our next one. Okay the logical thing to do after that is test the opposite case. Orders with different order ID are not removed. So 10, 11, 12, we're going to end up with three. We're going to go on and we can now test for our customer status. So customers with multiple orders have an active status.

A customer with only one order. If it was in the last month, that's active status as well. So we're going to get one active customer. But if that was over a year ago, that's inactive. So we get no active customers. And at the end, we're just executing all our tests. So now we've got a comprehensive testing notebook that's separate from our business logic.

And the natural conclusion to that is we create the inverse, which is our kind of end to end ETL workflow notebook, excuse me, that now does everything else. And you can see what I'm doing here is actually making that connection back to the data lake to load the real data frame in. And save the outputs and then call our workflow here and that's calling into the same kind of functions notebook.

So we've separated our test code from our real kind of workflow code, isolating our functions in the middle. The natural conclusion to that is we can add all that into a pipeline and we can actually say here's our data pipeline. The first thing it does is run the test notebook and only if that succeeds are we going to run the actual business logic notebook. And you can see if for whatever reason our test notebook fails like this one did when I ran it yesterday, Bye. You will start to see that we're getting errors, and if we look at the errors, it'll tell us what went wrong, and our error messages are in here, expected 1, but 2. Just by a simple process of refactoring our notebooks once we've got things in the way we want them to, we can make them testable.

And the key point I want you to remember is that notebooks contain code. Code should always be tested. By defining functions for logical steps, it makes them testable. Using in memory data frames and passing those into our functions, we can simulate test data scenarios without needing to actually have that in the data lake.

And it means we can test all the different cases and code paths that we might want to do. Now the key thing here is you want to try and find the right balance between Being agile and adding value and then quality assurance. That's always going to be a fine balance to to deal with. But the other thing I want to say is that approach is absolutely applicable to other notebook technologies.

It's not Fabric specific. You can do the same thing in Databricks, the same thing in, Synapse notebooks could do the same thing in Jupyter notebooks, et cetera, et cetera. Demo number two, integration testing. Now this was supposed to be in Fabric Pipelines, but I couldn't do that because the API for Fabric Pipelines doesn't exist yet.

So we've fallen back to Synapse Pipelines. And the thing about pipelines is that They're going to take a long time to run, right? If you're doing this in Synapse, if you're running anything in Spark Notebooks, there's a minimum of probably a two or three minute spin up time before the cluster's active and ready to go. The other thing is, if you're running we're talking about integration testing now, probably going to be dealing with real data from a real data lake, and writing back to a sync. So you've probably got lots of data to validate, so that's hard. How do we deal with that? So there's a few things we're going to see.

First one is a simple technique called asynchronous polling. So we're no longer going to do testing request response like we just did in notebooks. We're going to call something and expect an immediate answer. We're going to trigger a pipeline. We're going to pull it for completion and we're going to see what the results are.

Once you've got that pattern in place, it doesn't matter if it takes 10 seconds or 10 hours, right? It's just going to work. The other two things we're going to talk about are behavioural scenario testing. So testing, caring less about the specific data points themselves, like we did in the last example, the order count is two. We just want to see that is our pipeline behaving as we expect under certain conditions. And we're going to talk about a technique called snapshot testing, which allows us to test large kind of data sets by serializing them to disk and just making sure every time they match the same, whether there's 10 data points or a thousand.

So back to demos what we are going to do is first of all is go into Synapse and we're going to see a pipeline. This isn't the pipeline that we naturally would have assumed from the last demo because Spark would take too long to run in this kind of scenario. What I've got is a very simple pipeline, it just has a copy data activity.

All it's doing is copying data from source to sink. It's reading our orders data in from the data lake, and it's mapping it to a parquet format back in the data lake, but it's done that kind of schema mapping along the way. But what's interesting here, if you look at the pipeline, it's got a whole bunch of parameters. And the parameters are, where's it going to read the data from, and where's it going to write the data to. And the reason for that is exactly what we did in the last demo where we were passing that data frame into the function. What we've done is we've parametized the kind of the sticky bits, if you like, of where the pipeline is getting the data from.

And if we can do that, means we can pass in different bits of data and get the pipeline to act on the different data scenarios again, just like we did in our unit testing example. So what I'm going to do now is switch into Visual Studio Code, and what we're looking at now is a Gherkin specification.

Now, if you've done any software development before, this might be familiar. If you haven't, Gherkin is a very familiar and widely used language for writing test cases. It's plain English. It's supposed to be there so you can talk to people who aren't developers and aren't software engineers and describe the behaviour of a system. Visual Studio Code What we're looking here in this particular project, this is a JavaScript based project. We're using TypeScript and Node.js, that doesn't matter at all. Gherkin, there's Gherkin test runners for every programming language under the sun, alright, and you'll see that in a second as we carry on.

But what's interesting about Gherkin is it's very similar to that Arrange, Act, Assert pattern that we had in the last demo. Instead of using Arrange, Act, Assert, what we're using Gherkin is Given, When, Then. Given a certain set of conditions is in place, when the thing that we want to test happens, then the outputs should be as follows. And we can use that to describe any type of system. In this case, we are describing the execution of a pipeline. So the language is a pipeline. is around pipelines, given the data to be processed is this ordered CSV file, when the pipeline is triggered, then it should run successfully, and we should have five rows of output.

Now you see we don't really care what that data is. I don't care about specific data points. I just care the behaviour is should run successfully, should have five rows. The next line is slightly different, which is if we give it bad data, The pipeline should fail. So we've baked in the behaviour of our pipeline into this test without really caring what the data is. So how are we doing this? I said that the pipeline was parameterized and we could pass in different data scenarios. So actually the data files themselves are the scenarios. So if you look here, orders 2023 CSV, We have that file over here. It's a simple CSV file. Same scheme as what we saw before. Only five rows of data.

It's very easy to manage. They're all valid. Should map successfully. If we pipeline run and process this, it should work. Bad orders on the other hand though, looks like this. We've got one bad row of data in this file, and that should break our pipeline because we're trying to map to a Parquet schema and those rows aren't going to actually map quickly and everything should fall over.

So what I'm going to do is run this test. I'm just going to make this bigger so you can see it. And I am going to do that. And whilst that's running, I'm going to switch back to Pipeline. I'm going to go to the monitor tab. I'm going to hit refresh. And hopefully, once you can see the little dots there with the test starting to run, at some point, we should see the pipeline kick in.

There you go. We've got one pipeline run started and another one is started too. So they're running. Now, what we'd expect to happen in this case is the following. Both of the tests should pass because I know that the pipeline is working as expected. But one of the pipeline runs is actually going to fail.

And that's a valid test case. We expect it to fail, right? So that's great. The behaviour is valid even though the pipeline has succeeded. The behaviour of the pipeline is correct. The two tests are passed. Okay, now what if we were to change that behaviour? So let's go back into the pipeline and let's say, actually, we do care about no, so we don't care about bad rows.

So we're going to go into here and we're going to say fault tolerance, Skip incompatible rows, we don't care about logging, all right, and I'm going to commit that change and I'm going to have to publish it as well. So now we've changed the behaviour of our pipeline. Now when it encounters bad rows, it'll just gloss over them and carry on and it won't break the pipeline. So let's publish that, okay, okay, and now we can run the test again. Now in this case the opposite behaviour is going to happen, all right. Our pipeline are both going to succeed because we've changed the behaviour to allow it to do but the second test is going to pass because the behaviour is now different. We expected it to fail.

It didn't fail, it succeeded. So that's how we can lock in our pipeline and intuition testing behaviour in those simple kind of BDD specs. Now, next thing I talked about was snapshot testing, and that's interesting. What if we did actually care about the data points that were coming back? I'll just wait for this to run through before we move on. Hopefully it won't take too long. Let's just make sure this is running.

So one has succeeded, second one has succeeded, there you go. Both pipeline triggers have succeeded and as you can see, our test has failed and you can see it said the expected result was Failed and actually it succeeded. So the behaviour is now wrong. So the test is invalid. But what if we did care about the data points themselves?

If I uncomment this and let's comment this one out just to speed things up a bit. I'm going to run this test again. And what this last step is doing is actually going to query the data that was written to disk in that, by that pipeline. And it's going to serialize it to my local file system.

We've got this fold up here called snapshots. Nothing is actually going on at the minute, there's nothing in there, but when I run this test, it'll actually write something to the disk. And whilst that's running, I'll show you what's happening behind the scenes. Behind each of these steps is actually just some code, alright? And it's pretty simple, but really all we're doing is making use of two things. One is the Azure Data Lake SDK. Another one is the Synapse Pipe, Synapse Artifacts SDK, I think it is the Data Lake SDK is uploading these files into the Data Lake, so they're ready to be processed. The Synapse Pipeline or Synapse Artifacts SDK is just triggering a pipeline run and polling for completion, and I'll show what that looks like.

Here we go so when we say run pipeline, all the config that I've got is around which Synapse works, which we're pointing to, which pipeline name we're pointing to. All the parameters around where the file I've uploaded is, which we pass into the pipeline. We do a get pipeline, or trigger a pipeline run, get the pipeline run, pull for completion. Once it's finished, we get the status. And then in my actual steps themselves, we're then just checking for The result, making sure it matches what we expected, et cetera, et cetera. The snapshot bit is interesting because what we're doing now is we're reading the file from Data Lake that's written out, we're reading it back from Parquet, and in this case I'm using a particular framework called Chai snapshots, but it's, snapshot is a kind of a technique built into lots of different testing frameworks, it's not specific to this.

You'll probably find it if you start doing some research on that in any kind of testing language. And what's happened now, that pipeline has run the test has succeeded, and you can see over here, we have got a serialized. Dataset and JSON of the output of what was in Data Lake. Now, the important point to know about this is the first time we run this, the test succeeded and we got the file.

And it's on us to validate this is correct. Is this the output we'd expect? It looks like it is. There's five rows of data. They've all mapped correctly, like all the kind of column names are correct, and that's great. Now we've got this in our kind of test framework, we can commit this to version control, along with our test scenarios, along with our code.

And if it was ever to change, run that test again, and Things would break and I can show you what that looks like. So let's go back into our pipeline. Let's simulate someone changing our processing logic. So we're going to here, let's change our mapping logic. And let's say, in fact, order ID is going to map to a column called just ID. And let's commit that and publish that. And as soon as that's finished, we can kick off the test again.

And in this case, the pipeline is going to succeed. The data is going to be mapped. Although now the order ID column is going to map to a column called ID. So when our code says, okay, is the data as we expected it to be, which I think is what we said in our future file, and the output data matches the expected results, it's going to get the data back from the data lake again, it's going to compare it to the snapshot that we persisted from last time, and it's going to do these two things match.

And if they do, the test succeeds. If they don't, the test fails. So that, there you go, test has failed. We can have a look here why it's failed, and it will tell us it's failed because what it was expecting was a property called orderId and actually it's id in every case. But that technique would have worked if it was the calculations, or there were extra data, or things are missing, whatever it might be.

Whatever kind of logic would have changed, the fact is that snapshot doesn't match the snapshot we already had. So that's a great way to test any size data set, whether that's five rows of data or a million rows of data, you don't have to check every individual data point in turn, you can just say under these conditions, given this input set of data, the output should look like this. Is it the same every time? Yes or no? And you're baking kind of regression into that process.

What we've seen is, with pipelines, we've proved that long running ETL processes can be tested using polling. Asynchronous, kicking off asynchronous jobs, poll for completion, doesn't have to be request response, which means you can do end to end kind of pipeline testing. Behaviours of pipelines can be tested as well as data values.

How does it operate under this set of conditions? Whether we're dealing with bad data or good data or, whether it's overwrites or appends, whatever it might be, we want to test those different scenarios. The snapshot testing allows us to add regression coverage to large datasets without caring about every individual data point, like we did in our kind of unit testing coverage.

And of course, as before, this approach is applicable to any tooling or technology. So we saw here in Synapse Pipelines, it works in Data Factory. It will eventually work in Synapse Pipelines. Once those APIs are there. Third and final demo. User acceptance testing in Power BI and or slash Fabric. So there's a few caveats around this, right?

There's no kind of real official or available testing frameworks or tools, I would say, around Power BI testing. We've been using this kind of technique for about seven years, I would say, endjin. I have yet to see anyone else do the same thing, so hopefully you will find this valuable. But the important point is you don't have any control over the user interface, right?

We're testing a product. Power BI renders its own reports. We don't really control how that looks. We don't really have control over the DAX expressions those visualizations are generating, but you can see What they are if you use a performance analyzer in Power BI Desktop. But when you understand that behind every Power BI report is a tabular model, then you can realize quite quickly that actually we can test that tabular model. We can actually add a lot of confidence at that layer. So all we're doing here is testing the tabular model only. But even with those caveats, there's huge value in this. And I know that because we've been using these frameworks for years. And it allows us to bake confidence into our schema, the measures that we're generating, The relationships in our model, and if we've got that, then everything else is going to be a lot more plain sailing.

So before we dig into Power BI, I want to introduce the concept of UAT. What does UAT mean in this scenario? How do we work with our stakeholders to actually get some notion of what the things should look like? So what we typically do in data solutions is we will try and find a data set that's small enough that a stakeholder can manipulate in a tool that they're comfortable with.

Which is always going to be Excel, but big enough that it can contain as many different scenarios that we care about to simulate the different test conditions that we want to test. So what we've got here is an Excel spreadsheet with the same order data we saw before, and the first step might be to sit down in five minutes or five weeks, however long this might take, to understand what people want to report on.

And we might say, okay, what do you want to see? And they say, okay first of all, we want to say, just total sales of all, total sales amount of all time is going to be 8, 905. And we say, okay, but what about the fact that actually you've got duplicate orders in your system? And they say, yeah, okay, we don't want those. Let's delete that. Actually, the total sales amount in this case, in this dataset should be 8740. We say, yeah, great. Okay. What else do you care about? And I say we want to break our sales data down by time. We say, okay, so if we looked at sales data in 2022, it'd be 3300. And they say, yeah, and we use a pivot table to do that.

And then we say, what else do We want to filter between active customers, inactive customers. We talk about that and make our rules around, around that status. And we say okay, so customer five in this data set is inactive. What would that look like? So we go through and we get rid of customer five. And we refresh our pivot table again, and we say, okay, so the sales total for active customers will be 8680. And we say, yeah, great, okay, we've got a report on that, we're laughing. So we translate all of that into a Gherkin spec. And what we've got now is Visual Studio. So now we're in .NET. This is a C# testing framework.

But as we've just proved, Gherkin is not specific to any programming language or technology. We saw it last time in JavaScript and Node. js. This is now .NET and C#. This is using a test framework called SpecFlow. You'll see why I'm using .NET in a second as we go further down. But we have translated those requirements into an executable specification. What's interesting now, even though the given when then is still exactly the same, the language of the specs is different because now we are testing tabular models. Okay? So we are talking about measures. We are talking about filters. We are talking about row counts. The language of the specification is in tune with the thing that we are testing.

So what do we do with this? So behind, just like the other kind of example, behind each of these steps is some code and really all we're doing is as we execute these steps, we are building up a dynamic DAX expression. Now, if you've ever done anything in code to programmatically create a SQL statement, for example. It's exactly the same principle, except it's not T SQL, it's DAX. So every time we add a kind of thing, so we're in a query with a total sales measure. If I go to the definition of that, I've got this kind of DAX helper class, and it's going to add a measure into basically a big kind of string that I'm building up.

Ultimately, when I run this, you'll see in a second, it's going to create a DAX query that we're going to execute. So let me run this test. Help more. I'm going to debug this test and you'll see it running and how this is made possible at the minute is that in the Power BI service, let me wait a second, so admission to query is one of the kind of steps that we just saw. The measure name was total sales. I'm adding this kind of property bag of measures and ultimately, is that going to work? Yeah. We're going to build up dynamically a DAX query that looks like this. Evaluate, summarize columns, Total sales measure. So very simple, and we can execute that against the tabular model.

And how we can do that is because in the Power BI service, there is a REST API called ExecuteQueries. So Microsoft have exposed an API that allows us to fire queries to the tabular model. It looks just like this. We pass in a JSON payload. With a DAX expression, and it gives you a set of tabular data in response. So all this test framework is doing is dynamically building up a DAX query, and then if I run it through, we'll get to the point where, excuse me, it's going to make that call to that REST API, and it's going to interpret those results, and it's going to check them against our test cases. This bit just takes a little second because it's going to authenticate against Power BI.

It'll stop in a minute. Here we go. So we're using the Power BI REST API client. It's in the kind of data sets kind of namespace, it's calling execute queries async endpoint, and in this request message, in the body of this is our DAX query that I dug into, you would, here you go, see it again. So it's firing that DAX query off to Power BI, it's coming back. And it's taking the results as a kind of data table and it's looking through them and making sure it matches our scenario and you'll see that test will have passed. Simple stuff, right? But pretty awesome. However, it's limited. And the reason it's limited is we can only run this test when our reports are published up in the Fabric or Power BI service.

We don't get that in a dev loop. We don't get the ability to We can't test our measures as we are designing them in Power BI Desktop, but we can, right? And that's why I'm doing this in .NET. Because actually behind every Power BI Desktop report is a local instance of Analysis Services. And over the years, what's been really interesting is watching Power BI and Analysis Services merge together as one kind of tool set.

To begin with, we, this test framework was born out of an Analysis Services project that we developed about seven years ago, and we were firing these tests as your Analysis Services. What then happened was the kind of endpoints that were available in analysis services became available in Power BI and we were able to shift the whole thing to Power BI. Then a couple years ago that execute data set, execute queries kind of end point came about, which made the whole thing, everything a lot easier. But in Power BI Desktop, if you use anything like DAX Studio, when you open that up, what you will see is

What you'll see is down here, very small, localhost 65044. That is my Analysis Services server instance running on this laptop. If I was to close this Power BI report and open it up again, that would be a different port number. So every time you open Power BI Desktop on your machine, it's why it doesn't work on a Mac, because it's actually running Analysis Services on your machine. If we can get that We can do what DAX Studio is doing. All DAX Studio is doing is connecting to that tabular model. That's how we can see the, the schema, and it's how we can execute queries. And if we know that DAX Studio can do it, then why can't we do it? So I'm going to change this test framework, and I'm going to flick this switch to true.

And what's going to happen now, rather than fire that DAX query at the Power BI service and execute queries endpoint, it is going to run the same set of tests over that Power BI report. And it'll be a lot faster than doing so. Oh, I didn't mean to debug it, sorry. So let's debug. And you'll see it's exactly the same code base, feature file exactly the same, steps are exactly the same. But when I go to making the connection, instead of fall down to a different kind of code path. Instead of going down here when pulling to my Power BI API helper and querying it through the REST API, so it's going in here. And what we've got is. We are doing some kind of low level. Processing kind of interpretation, which is just what DAX Studio does to find that port number 65044 for our report name, which is sales report, which is open.

If I close that report, this wouldn't work. And it is using ADOMD, which is if you're doing anything in .NET with SQL very familiar, just ADOMD, it's the kind of tabular model kind of version for BI kind of cubes. So we create a connection, we use an ADOMD data adapter to fill a data table of results, but we're passing that query into that and that's how it's working.

So I run that, everything else is the same, test case is the same, test pass. Now to prove that's not smoke and mirrors, because I know some people find this quite unusual I'm just going to change this measure. And say, okay, let's change total sales to not be that, which changes the amount. You can see immediately my value is going to change up there in the actual report. And if we were to run those tests again, they should now largely start failing.

There you go. Because now we're pointing at that report. And if we change the config back to point to the service, they should all pass again, because we haven't changed the report in our service. So the tabular model up in the Power BI cloud is different to the one we're running locally. So now we've got A developer loop for validating and testing our reports as we're developing in Power BI Desktop before we push them up into the Power BI service and then maybe run them again with a different set of data, maybe some kind of load test data or different kind of scenarios.

So you start to factor that into some of the other things you might have heard about this week around Power BI version control, Power BI deployment pipelines, you can start to see how you can actually build in quite comprehensive engineering practice into your Power BI development and deployment processes.

What we've just seen is actually that behind every Power BI report is a tabular model. And a tabular model is just a database. It's not a relational database. It's a columnar database, but it's a database nonetheless. And if we can execute queries against it, Then we can test it. The real value in that, even though we're not talking about the UI, we haven't tested any kind of report visuals, the confidence is that we can test the schema of our tabular model, we can test the transformations, we can test our DAX expressions, and if we know that's all right, we're 90 percent of the way there.

Now there's different options on how we do this. We saw me use the .NET SDK, but clearly the first thing I did was use the REST API, and that means we could do that in any programming language. We could have rebuilt this in .NET, and in fact we have. We've also built it in Python, and using that REST API you can do exactly the same process.

If you want to do things locally in Power BI Desktop, you need to use the .NET, you need to C#, because those ADOMD.NET SDKs are only available in .NET. So there's different options have different implications. If you want to know more about that there's a blog here that talks about the execute queries REST API in particular, and if you explore a bit more about that around Those blog posts, you'll find a whole bunch of stuff around analysis services, which is using that a DM A D is hard to say, A DO MD do .NET connector.

And all of that is equally applicable to the power reports, as we just saw. So that's the end of the demos. What I would like to have or hoped to have proved to you is that whilst a lot of this stuff is hard, especially if you consider yourself a data person rather than a software developer, I've at least proved that it is possible.

All right, whether you are writing Python code in notebooks, if you can create in memory data frames, if you can restructure logic into logical defined functions, you can test them. If you're building long running processes, if you can find a way to test that asynchronously, if you want to test behaviours and less care about data points, if you use snapshot testing to serialize data to disk, you can find a way to test those.

And clearly, we've just seen a great way to test Power BI reports, both locally in Power BI Desktop. And up in the Power BI service in the cloud. So the final kind of thought to leave you on is this, which is if it doesn't matter if it's wrong, then you need to prove it's right. Thank you very much for listening.