Do Those Numbers Look Right? | endjin

James Broome 26th October 2022

Data Platform Virtual Summit

How to ensure quality, and avoid inaccuracies in your data insights.

The need to validate business rules and security boundaries within a data solution is important, as well as the need for ensuring that quality doesn't regress over time as it evolves. Data insights are useless or even dangerous if they can't be trusted, and in this session, James will explain how and why they should treated just like any other software project with respect to testing - by building automated quality gates into the end to end development process.

During the session, he'll walkthrough some practical examples and proven techniques around testing data solutions - including Power BI reports, Synapse Pipelines and interactive Spark notebooks.

Transcript

Hello. My name is James Broome and I'm the Director of Engineering at a UK based technology consultancy called endjin. And today I'm gonna talk about ensuring quality and avoiding inaccuracies in your data insights. Now, the majority of this talk is gonna be a hands on technical walkthrough. I'm gonna show you some proven patterns and practices and techniques for adding testing to your data solutions.

I'm gonna demonstrate how you can add automated quality gates in areas that you might not have even considered. and specifically, we're gonna look at testing power, bear reports and Azure analysis services models, long winning ETL processes, such as Azure Data Factory Synapse Pipelines and interactive notebooks, such as DataBricks and notebooks, and as your Synapse Notebooks. But before we do any of that, I want start with a story cause I want to illustrate why any of this is important. So let's cast our minds back to the autumn of 20. and relative terms, this was still early days in terms of the COVID pandemic in the UK. At least we were still very much in the middle of waves of rising cases. We had daily briefings by the UK government on TV, and we had a test and trace scheme in place where the idea was to identify positive cases quickly, and then follow up with close contacts to stop the spread as quickly as possible. In October, we were reporting between seven and 8,000 cases a day, but in early October, public health, England reported that during the course of one week, they'd been quite significantly under reporting cases, they'd reported around 50,000 cases and it turns out there were further 15,000 that got unreported due to an it error.

So about a quarter of cases were going unreported for a. It took them eight days to realize it's a mistake and then correct the figures. And during that time, those around 2000 cases being unreported every day that were being missed. So clearly this was a big impact. Thousands of people were totally unaware. They'd been potentially exposed and potentially in unknowingly spreading the virus even further. So something had gone wrong somewhere. And what had gone wrong was this due to the immense time pressure to get this process up and. The testing results were being provided in CSV format from the various testing bodies and then created in Excel in order to report the aggregated results. And for some unexplained reason, the legacy file format of dot XLS was being used rather than .XLSX. And this meant that the Excel worksheets had a limit of 65,000 arrays of data. So the aggregated results from the CSV data sets were being truncated at this. And what was interesting was how this was being reported in the UK media. It was essentially being described as a technical issue. And even more than that, Excel was being blamed for the error, but there was no bug in Excel. It wasn't Excel's fault. The six, 5,000 mill limit had been present present in excels file formats since the 1980s. And it was well documented. If you went and looked it on the Microsoft documentation, what actually went wrong was that nobody tested this process.

Nobody had validated that the actual output. Matched, what was expected given the known set of inputs. So it wasn't a technical issue. It was a people and process issue. Nobody'd step to check, stop to check if the numbers looked right. And by the time they did the impact had already happened. Now I'm guessing most people listening weren't, or aren't responsible for cling figures for a global pandemic, but some of you might have, and if you're not, there's every chance you're working with other data sources that easily could have an impact on on human health. For example, algorithms for self-driving cars, results of clinical drug trials. I have not potentially dangerous in a life-threatening way. I'm guessing at least some of you do work with data that has an impact on a personal level, such as mortgages or loans or credit card scoring, maybe in a criminal justice system or approval of visas or residency applications, or a billing process for utilities or subscription services. And if not on a personal level, then definitely a commercial level in the organizations you're working for sales forecasts, weekly reporting KPIs or customer satisfaction metrics. So fundamentally the question I want to ask yourselves for any data solution, for any data insight that you're working on, developing or responsible providing to your stakeholders, your organization to your customers is this, does it matter if it's wrong?

Because there's really only two answers to this question. And if answer is no, then why are you as a data professional, even involved? And if answer is yes, then you need to be able to prove that it's right. That sounds obvious. I could have had a show of hands. If we were here in person, how many of you actually have comprehensive testing your data solutions? I guarantee there wouldn't be that many hands up and I'm prepared to stand by that statement because I've been a consultant for over 20. And I've had the privilege of working with many great teams in many successful organizations, across many industries, even across three continents. And yet that statement's still true.

So why don't we do it? The risk of kind of massively oversimplifying things. I think there's three main excuses that I've heard over the years that come up time and time again. First one is we don't have the time, a case in point, this is the public health England example. I just talked. They were under a lot of pressure to get this process up and running from nothing. So corners were clearly cut in order to get a process from the scratch very quickly, but the argument, they didn't have time to make proper testing into the process, falls down a bit. When you consider how much time was then spent clean up the mess and dealing with the impact of getting it wrong in the first place. Now there's tons of research out there about how much quicker it is to solve problems during development than it is before they make it into production in the software industry. At least. Think about product recalls in the car industry, how much time and money is spent on testing their processes to try and avoid that solution or that situation.

Sorry, you say you don't have time to test something. I would argue that probably you don't have time to test, get it wrong. Now the second reason for this is we don't know what the numbers should be. And this is another very common situation, especially when dealing with large amounts of. And by large, I really mean bigger than can fit on a single screen in an Excel worksheet. For example, beyond that, it's hard for humans to hold that amount of data in their heads, which then makes it harder for someone else to tell you what the right answer should be. And that's assuming that people agree even on how you calculate the answer. In the first place I worked with one organization who had nearly 20 different definitions of gross.

And that was just one of about 40 KPIs that were being reported on, in a standard set of management dashboards. But if you dunno what we should be doing, then the answer is simple. You need to stop. You need to go back and do the hard work of defining your requirements properly. Forget about the data and technology for a second and start with all the expected outputs are supposed to be. And I was actually here talking at DPS conference last year on exactly that subject. The third reason is this it's hard. How on earth do you even test a Power BI report? For example, and to this I say, yes, software development is hard and to be blunt, that's why you're being paid to do it. I deliberately use the term software development because that's what all of this is. Data engineering, data science, data, wrangling data modelling, just different terms for software development. You're adding a formula for Excel worksheet. You're developing software. If you are writing queries to create measures for a Power BI report, you are developing software. If you're creating Python scripts to generate statistics, you are developing software. If you're designing ETL processes to populate a data warehouse, you are developing software. And that's exactly what I am. I've been a software developer for over 20 years, and yet I've done and do all of the things I've just talked about. I might have been designing and delivering big and complex data projects and data platforms, data solutions for the last decade or so of that.

But I'm still doing that as a software developer. And that means I'm always looking for a way to test something. And for the rest of this talk, I'm gonna help you do the same. I'm gonna walk through three practical examples that cover a large portion of the types of technologies and processes in the data platform projects that I've been involved in recently. And I'm gonna show some approaches that you can take away and apply to your data solutions. So not only do your numbers look right, but you can prove it. So in the first demo, I'm gonna walk through an approach that we have taken to test Power BI reports, as well as analysis of this databases by validating the calculations and the data and the underlying tabular models. I'm gonna use Power BI for the demo, but everything I'm gonna show is equally applicable to analysis services.

Now, as a reminder, there's no official support or recommended approach from Microsoft about any of this. So the approach is based on endjin experience and recommendation, there's no available testing framework for Power BI reports. We have no control over how the reports are augmented in the user interface. And at first glance, the report business logic is encapsulated within the underlying Power BI service. However Power BI is built on top of analysis services and the general direction of the team and the product over recent years is that Power BI is eventually becoming a superset of Azure analysis services. And whilst an analysis service model is an in memory, compressed BI data cube. It's essentially just a database, but it's not a relational database, but a database nonetheless. So we can connect to the tabular model using client SDKs or APIs. Then we can execute queries over it to test what's inside and everything I'm gonna demo is based on that approach. But before we go any further, there's a couple of big caveats to point out. So number one, we are testing the tabular model only, and not the visualization layer of Power BI. So by validating that the measures and the calculated columns and the data modelling are working, as we expect, we're effectively testing the business logic in our data application, but not the presentation layer.

And number two is we don't have control over the DS queries that are actually getting executed by the Power BI visualizations themselves when the report is loaded. So any test that we do execute are merely simulations or replicas, if you like of what the report is actually doing. However it is possible to intercept the DS that the report is generating. So those queries can be used in the test themselves, if you need to. So the simulation is as close as possible to the real. However, even those caveats we found is a huge value in writing and winning tests over the type of your model to test the business logic. That's been added into the data model whilst the visual that once the visualizations themselves are also important, the right visualization, the right position, showing the right data points, having the confidence that the calculations behind the data points is what really adds confidence to a data solution.

For the most part, it's a reasonable assumption to make that the visualization layer is gonna be working as you expect generating the max stack queries. So we can focus on testing that the schema and the data in the model are correct. So that's open Power BI and I'm gonna use a sample report that I've just taken from the Microsoft website. And while that's loading up, I'm gonna remind you that behind every Power BI report is a tabular model. And that in effect is an analysis services database. When opening a Power BI report in Power BI Desktop, what's actually happening behind the scenes.

It's a local instance of analysis services is actually being started automatically. And you can find that out by using a tool called DAX Studio, which is an external tool that's freely available to download from the internet. And once the Power BI report has opened, if you have DAX Studio installed, you will see it under the external tools menu under here, and it can open it very simply like this. And DAX Studio will detect that we have a Power BI report open. If I click the connect button, you can see it has the name of the Power BI report already prepopulated in the connect wizard. And I click connect. We're gonna connect to the model in that report, but the interesting thing is. Down the bottom, right on side here, you can see it says local host port 53562. And that is the server information of the analysis services instance that is currently running to Power BI Desktop with the report we're opening. And that's the key bit, because we can get that. We can create a connection string using the.NET SDKs and open a connect. To the analysis studies database.

And that means using third party tooling, like DAX Studio, but more significantly in this case, our own code, we can create a connection to the tabular model. And then we can start to write some tests against it. Now once we've developed a model and developed a report and we want to publish up to the Power BI workspace what's even better is that we can do the same thing in the Power BI service. And a couple of years ago, Microsoft exposed a an endpoint. If you like in the Power BI service called the XMLA endpoint, and we can use the same tools, the same SDKs. To connect to a Power BI report in the Power BI service by changing the connection string to point to the XMLA endpoint. And that can be found in the workspace settings panel for any premium enabled workspace. So there's a restriction around this. The workspace needs to have a capacity assigned to it, either Power BI Premium or Power BI Embedded, or nowadays they're a premium per user workspace. Believe you have the XMLA endpoint enabled. It means we can use the same code that writes our tests with the same connection process, just changing the connections during either to point to our local analysis services at database or the XML endpoint available in the Power BI service to make the connection. And although I just mentioned net SDKs, and I'm gonna start by going down that route a bit later on, I'm also gonna show you a different way that you can query this type of the model without the need for any SDK, just by using REST APIs, meaning you can do this in any programming language or tech stack. So for the first demo, we're going to use C# and.NET. This is just one way to achieve this result. And as you'll see C#, isn't a prerequisite. And as created the rest of this talk, I've deliberately picked a different technology and tool set for each walkthrough to show you that the approaches I'm demonstrating, aren't tied to a specific programming language, a lot testing framework.

First of all, i'm gonna switch to visual studio to show you how we can write some tests against our Power BI desktop file. That is winning. I am gonna use a framework called SpecFlow, which is a.NET based BDD framework that allows us to write tests as executable specifications in a human readable language in the Gherkin syntax, which is an industry standard very widely used structure. If you like for structuring BDD specifications. And the whole thing about BDD, which stands for behaviour driven development is to describe the behaviour of your system in a very easy human readable, plain English or other language of your choice format. To aid easy collaboration between the technical users who are developing the system in this case, the report and the business users.

So in this case, maybe the business analyst or the end users, ultimately who understand the domain of the. And because we are testing Power BI reports. In this case, our language talks about the things that we would find in our data model. So measures results, filters and that kind of thing. So you can say I've described here quite a simple set of scenarios about the data model behind our sample report that go back to Power BI. This is the retail analysis sample that contains sales by stores over geographic areas. So we've adopted an approach where we have a feature for each report. So this is our example report feature. And our first scenario is around total sales. The total sales measure is probably one thing we want to add into this data model.

Imagine we haven't created our report yet, and we start to do some data modelling. When we query the total sales measure, we expect the result to be a specific value—in this case, rounded to two decimal places. This is absolutely something we could discuss with the business. We could talk about how to calculate total sales and determine the expression needed behind that measure to validate the particular KPI. We've made an assumption that the report is loaded with a known set of data. In our experience, that's a fairly good way to work in this kind of model. Clearly, we could write some code that would automatically load data into the data model, reprocess the dataset, refresh everything, and repeat that every time. However, we've found that to be quite a heavyweight process.

A much better approach is to work from a known static set of test data. This dataset should be simplified, smaller, and more manageable than what you might have in production. It allows you to easily verify calculations and measures, both outside and inside of Power BI, and facilitates collaboration between someone who understands the data and the person developing the report. So, we've assumed there's a known set of data in the report. Once we understand how our total sales measure is calculated based on that dataset, the result should be a specific value—in this case, about 45 million pounds or dollars, depending on the currency. That's a very simple test: total sales across the entire dataset. Given that this is an interactive report, the next logical step is to filter the data. Power BI is designed to slice and dice data, so how do we do that? We apply a filter.

We've seen in our report that our data model is retail-based, and we have some stores. What would happen if we wanted to filter the stores down to a specific store—for example, the store called "Lenses"? Let's expand our scenario language a bit further. In this case, let's say we have the following filter applied: we've defined a dimension table called "Store" that has a property called "Chain." In our known dataset, we have a store chain called "Lenses." Given that we've filtered stores down to "Lenses," when we query the total sales measure, we would expect the value to be different because now we're calculating total sales only for "Lenses."

We can prove this by going back to the report. I've added an extra page into the sample that shows total sales, and we've got a small table that allows us to filter by "Lenses." When we apply the "Lenses" filter, the total sales value goes down to about 13 million. However, Power BI shows a simplified, aggregated, and truncated view of the actual value. If we want to see the exact value, we could use a tool like DAX Studio to execute a query directly.

First, let's run the test and see what happens. Behind the scenes, SpecFlow allows us to generate C# code to execute these statements. If we open up a C# file, we can see methods behind each of those definitions we call in our feature file. These methods actually perform actions. For example, when we say we're going to add a named measure to our query, we have code that does exactly that. Behind the scenes, we're dynamically building up a DAX expression. When we add a measure, we insert the measure's name using the syntax you would use if executing the same query in DAX. When we want to add a filter, we do the same thing—we add extra syntax to our DAX query expression, specifying the filter table name and the filter statement. By the time we execute the query, we've built up a dynamic expression that evaluates into something like this:

It's a simple EVALUATE statement with a SUMMARIZECOLUMNS statement. In the body of that expression, we have the columns we want to retrieve, the filters we want to apply, and the measures we want to calculate. Let's start by running the test to prove it passes. I will need to update my connection string to point to my local Power BI Desktop environment. If I go back to DAX Studio, we can see I'm running on port 53562, and I have a configuration file here. Now, if I open the Test Explorer window, it will automatically find all the tests in the solution. Let's run the total sales test. This should run relatively quickly, and in theory, it should pass—and it does.

If we were to debug that test, let's break on the execute query method. We'll be able to see the actual DAX query code that gets executed through the test. Let's debug again, and when the breakpoint hits, we can inspect the query. We can see the DAX expression that's actually being generated. If we go into DAX Studio, we can paste that directly in. It looks like this: EVALUATE SUMMARIZECOLUMNS, total sales. When we run it, we can see the exact value returned from the data model, which is displayed on the left-hand side—approximately 45 million, with many decimal places.

If we do the same thing again, with the filter applied, we'd see a much more complicated statement because we would have filter expressions in it as well. Now, so far we're using visual studio, we're using.NET and we are using what you can see here on the screen. The ad OMD net SD. So this solution here, this testing kind of framework I've designed is limited to winning in.NET.

It requires the analysis services SDKs that are only available in.NET, but we have another option because once Power BI is a Microsoft based tool, it's clearly used widely enough to be prevalent in environments where.NET and C isn't used at all, either due to team skill set or wider technology choices, for example and because we need a different way of doing this and the end of last year, Microsoft enabled a new capability and the Power BI rest API called the execute queries API endpoint. And if we look at that, here's the documentation. So for any data set, that's been published into Power BI service. We have an execute queries API that allows us to execute a DS express. And if you look, we have adjacent payload and in the body of that payload is just a D statement, just like we were winning locally over our power desktop file. And what comes back is adjacent payload, representing a tabular set of data. So it's very easy to replicate what I was doing in .NET and C#, and SpecFlow in another testing framework in another language that doesn't require.NET that doesn't require the ad md.NET SDKs. And actually instead of executing the queries that way, executes the queries over HTTP using the rest API and the execute queries end point. Now next year, queries API has some pros and cons compared to even the.NET SDKs. And this can impact what you can and can't do from a testing perspective. So the rest API only works over the Power BI service. A report has to been published into a Power BI workspace with, to work. Clearly you can't run the rest API over a local Power BI desktop file. that will implicate or that will impact, sorry, your kind of local developer workflow in terms of, at what point you can actually start testing things.

You need to push things up into the service before you can do that. However, the execute queries API is available to all workspaces. So there's no restriction. That means you need a premium capacity like we did for the XMLA endpoint. And that means there's no cost implications going down this route. So the .NET based approach as a reminder requires. Premium capacity Power BI embedded or premium per workspace. The rest APIs available for free inside any Power BI workspace. And finally, once the .NET SDK approach clearly requires C# or .NET in order to connect to the tab model, whether it's running locally in desktop, Or whether it's via the XMLA endpoint using this execute query is rest API means we can make that connection using any program language. So for example, we could do exactly the same thing in Python or JavaScript or any other kind of testing framework and program and language of your choice. So to summarize this demo behind Power BI and analysis services is a tabular model. And that can be connected to any other database technology. And this means that your schema, your data transformations, your measures, your characteristic columns can all be tested to give confidence to reports and Power BI models. There's different ways to connect to the type of the model using SDKs and using the rest API. And depending on which you choose, this means you can connect and run tests over a local PBIX file, or a report published into the Power BI service. I've shown you how to do this in C# and talks about how to do it in JavaScript, using well known testing frameworks in each domain. And hopefully you can see how you could equally do this in Python or any other language technology of your choice.

So what testing Power BI reports isn't something that many people think about is absolutely possible. And if it matters, if your reports are wrong, then it's absolutely something you should be doing. In the second demo, i'm gonna walk through an approach we've taken to test long-running cloud ETL processes, specifically using Azure Synapse Pipelines, but equally applicable to Azure Data Factory and conceptually applicable to any other cloud data processing platform or technology. And there's a couple of very common challenges when testing cloud ETL processes one, they might take a long time to run by which I mean they're not instant and two, they might process and output lots of data, making the results very hard to validate.

Now there are solutions and patterns to both of these challenges, which can be applied to make testing these long winning processes possible. The first one's pretty straightforward and that's just to apply asynchronous polling pattern. Allowing the test to kick off a pipeline and then periodically polling for completion. The fact it takes a long time might mean this type of test. Isn't something you want to run on every single source code change, but rather something that gets run on every release or even every night. For example, second challenge can be overcome with a simple mindset shift in how you test the outputs. And we're typically used to testing or validating. Specific outputs or values. And that becomes hard when you're posting a large amount of data. And actually I say large, even 10 rows of data with 10 fields me would mean testing a hundred different specific values. So there's two different messages we can use here and both are equally useful in different ways.

First one is called data snapshot testing. So it may be, we absolutely have to test every specific value in our data set. And in that case, a technique called snapshot testing is very useful and rather than test every value individually, we can test that the entire output matches what we expect based on a previous known state. For example, we run our pipeline for known set of inputs or parameters. It produces an output that we manually validate to confirm that things are working as expected. And then we then snapshot that output and story with our tests. So we can repeat this test one under the same set of conditions, and always expect the same output every time. And this predictor from further changes down the line, introducing bugs or regression issues in scenarios that we've already tested. So the downside is clearly we need to perform the initial validation ourselves. But it provides a repeatable safety net as the platform, all the solution evolves over time.

Now the second method is behaviour based testing because it may also be of course, that we can't test every specific value. We may be relying on external systems or dynamic data sources that are temporal or, subject to kind of configuration changes and outside of our control. We do care about how our system handles certain situations, for example, dealing with data quality issues. So rather than testing the data points themselves, we could validate that. For example, duplicate records are removed by relying on an expected count. So this makes the test less Bri to the specific data, but validates the system is behaving as expected for the expected scenarios as a look at how this works in practice.

Okay. So I've opened synapse studio and inside synapse, we have a synapse pipeline that looks a little bit like this. It's quite simple for this example. It basically has one activity called copy data and it's copying data from a source to a sync. And in that process, converting it to a par file. So the input data set is basically a CSV file and the output data set is basically a per file. So it's just a conversion of data source. Now, if we look at the pipeline level, we can see we've got a bunch of parameters expected for the pipeline in terms of the input container, the input directory, the input file, and then where the outputs are gonna be. So the whole process is look specific file as an input, copy the data, converting it to Parquet, and then dump it in the output directory. So we want to test this pipeline. Now this time, we're not gonna use visual studio instead. We're gonna use VVS code and we're not gonna use.NET this time. We're gonna use Java. But what you can see on the screen here is very similar to what we saw in the last, this is a feature file. Again, we're using a BD framework just like in .NET and SpecFlow load this time in JavaScript, we're using cucumber JS.

And I mentioned that the Gherkin language was ubiquitous and a standard syntax across BDD frameworks. And this is proof of that. So cucumbers are very used and highly used kind of BDD framework, across many different languages. And this is the JavaScript implement. And the last time we were testing power bear reports so that the language of the spec was around measures. It was around filters. It was around data tables. In this case, we're testing a pipeline. So the language is slightly different. And you can see here, we have a feature around a pipeline that we're saying in order to build our reports based on the output of the pipeline, we need to automatically process the data from the data source. So in this case, we have a background which is our kind of set. Saying that we expect there to be a pipeline called demo pipeline behind the scenes, as you see in our code. And second, this is basically configuring a configuration value for what our pipeline's gonna be called when we make a connection into synapse.

And we have one scenario, our golden path or our happy path scenario that the pipeline runs successfully. Our setup action is that the data process is gonna be in the CSV file. Call sample sales data. Our act action. If you like is gonna be triggering the. and then our assertion is that the pipeline has run successfully. So we get a successful response and the output data matches the expected result. So before we do anything, let's have a look how how this all works in practice. So given the data to be processed is sample sales, data dot CSV. So in this test solution, we have a data folder with a demo pipeline folder for our demo pipeline and we have a CSV file called sample sales data. And you can see, it looks like. This is just some sample data I've taken from the internet, an open source database data source, sorry, with some sales domain data around purchases in different kind of locations across the us. So I mentioned earlier that what's difficult about testing ETL processes is often the input data and the output data are quite large. And what we're doing here is we are in effect defining our scenarios in the data input files themselves. So in this case sample sales data CSV file, that is the scenario. We're not manually keying in every different value. That's gonna go in. We're basically saying, pointing it at this file. So we wanna name this file in a way that represents our kind of scenario. So in this case, it's just a happy path. So just sample sales data, it's gonna get processed successfully. Now we talked about snapshot testing and the key thing here is that in the very first time we run the. We dunno what the output's gonna be. There's gonna be a manual step to begin with to validate that the pipeline is working as we expect, once we run it once.

And we've validated that output gets kinda locked in if you like to the test framework. So actually gets committed to source control in this case, it's part of the code that we've got in this demo. And then every time you win it again, since. It will automatically check the same results and if nothing's changed, then the test will pass. So before we do anything else, let's run this test and see what happens. So down here, this is a no JS solution in JavaScript. So I'm gonna run MPM test. It's gonna execute this one scenario. Now, once this kicks off, we'll see the test winning and I can talk through what's going on. So very first step behind this feature is that we are keying up the sample file that we want to, and the test on and behind the scenes, what the test is doing is actually uploading this CSV file into our data lake account. That's backed by our synapse workspace. So we go over to synapse studio. We can see in our data lake, in our test data container, and the minute there's nothing in there, but as the test starts to run, we'll start to see that CSV file being uploaded into the container. And I refresh the container. Now you can see, we have a folder called demo pipeline because we're testing our demo pipeline. And if I double click on that, we can see inside here, we have a sample sales data CSV that's been uploaded, and that's what the pipeline's gonna reference. Now we talked about long winning process. The actual pipeline itself, isn't gonna be instant. It's gonna kick off and it's gonna trigger and run. So if we go back over here and we're going to monitor, we should see that our pipeline has started running. In fact, it's already finished. It took 13 seconds, but it could take a long time. This pipeline could be very complicated.

It could be pulling in lots of data. It could be using spark pools. It took a few minutes to spin up. For example, what we can see here that we triggered the pipeline and the pipeline is run successfully behind the scenes what's going on is our test frame is actually polling for results. If we look at our pipeline helper using the Azure SDK for JavaScript, in this case, the artifact client from the synapse artifacts, MPM package, we can trigger a new pipeline run, and then we can just use a loop and pull whilst that pipeline's winning and check for the response until it's finished. So this test could run for a few minutes. This test could run for a few hours depending on your scenario and how often and how regular you want to actually trigger the trigger. The So our pipeline has went successfully and that means actually our test has also passed. And what you can see in that process is we've now got a file in our snapshots output folder. And what's happened is we are using a framework in this case called Chi snapshot testing. So because we're using the Chi assertion framework for cucumber JS, this is an extension that's perfectly deals with snapshots. Now I go into my step definitions. You can see that when we say that the output data matches its vector result, we're using this snapshot extension here based on our input file name and the Chi snapshot extension will look specifically based on a kind of convention for the snapshots folder. And it will look for a file with the same name as the input file and what you can see here is actually serialized. The data set that we've got back from our response. So once the pipeline's run, we've cleared the data lake we can see in our output folder, we go back to develop, we will have our results and remember our results are being converted into Parquet files because that's all our pipeline, our cap activity is. So we have a demo pipeline output folder, and we'll have a par file with a good that's basically been automatically generated for the Test. So the test framework is querying that data back out.And when we call the snapshot method, it's basically serializing that data frame with that data set, if you like to disk and that's our initial snapshot now at this point, it's on us to prove that's correct. So there would be a process with a QA team with a business stakeholder validating that based on our input data that we passed in is this, there is this, the data that we expect coming out the other side.And every time we run this test, we run it again. Our test is always gonna pass because nothing's changed. But if something was to change in the future, we now have a safety net.

If you like that, the output is always gonna be checked against what we've committed into our version control repository as the expected. So this test first will pass because nothing's changed. Okay. But if we were to change something and in this case, we could just change the input file. We now know that the output is not gonna match what we expect. So let's change this value and run the test again in this TA in this case, we'll see a test failure. Now, in reality, it's probably not gonna be the input file that's gonna change. What's more likely to change is the pipeline processing itself. As new features are added as new business logic is added and as the pipeline evolves over time, the point here is we are proving that we are not introducing anything that's already been validated and QA tested. And as you can see, our test framework has now reported that one of the tests was failing. And if we dig into the details here, you can see that actually it's calling out the expected value versus the actual value. So we can dig into the details. Now, this is all well and good. When we have control over the input data set, where it's a static kind set of values whilst we're not actually handwriting out every single value in this data set that we want to test and validate, we are actually testing them all. It's just, we're relying on the snapshot process to, to test the whole data set as one thing, rather than each individual value individually, but clearly there's cases where we can't do that. And we talked about this kind of behaviour based testing method that I'm gonna show you now. So if we go back to our. We might want to, for example, care more about how our pipeline deals with a certain scenario. For example, duplicate rows are being deleted. So we can care less about the specifics of the data that's coming in and out, but actually the behaviour of how it handles a certain scenario.

So we've added a new scenario that duplicates are gonna be excluded in this case, the data processed is a new CSV file called all duplicate row. We've named the scenario in the data file itself. And if we look at all duplicate row CSV, it's a much smaller file, but you can see everywhere is the same. That's the whole point. All of all rows are duplicates. What would expect in this scenario is if we want all our duplicates to be. When the pipeline is triggered and the pipeline went successfully, then actually our output data should only contain one row because all the duplicates are excluded. And we end with one in the output. Now, in fact, I haven't implemented this feature in our test pipeline. So we ran this test. It would fail. What's interesting is we now have a test first approach to defining our functionality that we care about in our pipeline. So the next step will be go back into the pipeline and change our copy activity behaviour, such that it can deal with duplicate rows. And actually we get the output results that we expect. So in summary, in this demo, I've proved that long winning data processing in cloud ETL tools like data factory or synapse pipeline. Absolutely be tested by adopting asynchronous polling patterns to wait for processing to complete. If data sets are large enough to make testing every individual value problematic, then snapshot testing is a good and simple solution to add regression tests, type coverage to these types of tests.

If you can't test specific output values. You can still add quality assurance through testing behaviors in a more general sense, like number of rows or files or error conditions. And once the demo I've just shown is JavaScript and no JS. There's nothing in there that couldn't be achieved using another technology or testing framework or language because the various client SDKs have been using specifically the Azure ones are just interacting with west APIs under the covers. So in the third and final demo, I'm gonna walk through an approach that we've used to test as your Synapse Notebooks, as you probably come to expect by now, though, the approach is equally applicable to Jupyter notebooks or DataBricks. And notebooks are really interesting. We think there've been a bit of a game changer in recent years in the data and analytics space, as they allow for quick ad hoc, collaborative, and experimental working patterns to be applied to data problems, which means they can be used to great effect in data science and data modelling.

But they're also really useful in standard ETL processes. The act as living, breathing documentation of a process, allowing you to mix in markdown and executable code side by side. Meaning that data processing isn't locked away in a store procedure or an S I S package, but in a human readable format that allows for collaboration and understanding across a team. But that agility comes with a trade off because very quickly, we've seen that notebook start from the backbone of systems and processes that businesses start to depend on. And if business insights are being surfaced, then there's a need for quality assurance because without it inaccuracy in the data are gonna end up going on latest.

So let's say open as Azure synapse and work through an example. So in our first demo, we're using visual studio and c.NET, and SpecFlow to add test to power reports. In second demo, we're using vs code and JavaScript, and Cucumber JS to add tests to our synapse pipeline. And it's third and final demo because we are testing notebooks. Are we gonna use Synapse studio as our IDE. And the code and the test we're gonna write, gonna be in Python or specifically PI spark, which is the SDK and API to interact with the spark one time in Synapse. So let's imagine we have a fairly simple use case. We work for some kind of sales organization that has some data around sales orders.

And if we look we've got a simple kind of set of data here in a CSV file that has some orders by city and region within the us. And we are tasked with producing some aggregated kind of sales metrics and reports, and we thought, okay, a good way to start will be to use notebooks so we can explore the data and start to work on our kind of ETL logic so we might start something like this. We might wanna read the CSV data in, into a spark data frame from the data lake, and we might think, okay, there's some, probably some simple data prep we need to do. For example, let's drop duplicates based on duplicate order IDs in the data that we've notice.

And then we might want to start thinking about what kind of metrics are we calculating? We want to group the data by region and start to sum up the total price. We can start to do kind of total sales by region. Wonder if we've got that. We didn't want to save the output back into our data lake so we can start to build some reports or visualizations or dashboards or alerts based on the, the sales metrics we're calculating And once a spark session has started up the notebook, one's fairly quickly because it's fairly simple and the data set isn't particularly large. And as you can see, the output here is our aggregated total sales by region. Now I've mentioned a few times though, on the great advantages of notebook. His ability to go from our talk and experimental data processing into something repeatable and automated and in Synapse it can be as simple as saying adds to pipeline, we do this, our notebook is added to a new Synapse pipeline, and we start to add a trigger to run this notebook on lightly schedule.

We can start to parameterize it. So it looks in, the right fold in the data lake, for example, to take the latest sales figures. and suddenly we've gone from something that was proof of concept into, a line of business application that's now automated and gonna generate those results every night, however, that agility comes with a downside and that is, in the spirit of exploring and, ad hoc kind of data prep. In some ways we've lost, the rigour and best practice around solid engineering practices in this case, proving what we're doing is correct by adding tests. So let's go back to our notebook and have a look at what we might want to test in the first place.

So even though this is a fairly simple use case, there's still some logic in here we're dropping, dropping, duplicate orders in one of our cells. Now it's a one line piece of code, but it could be implanted different ways. Our logic is based on dropping duplicates based on the duplicate order ID, but equally it could be implemented based on, every sale in it, every row being the same, so how you even define a duplicate row? It's open to ambiguity. Now, if we wanted to add tests to actually lock in this logic and prove that the behaviour is exactly as we expect, we need to do some restructuring of the notebook. And this is the key thing that everything that we're gonna talk about is based around what's great is that notebooks have cells.

They already have the ability to separate out logical steps in a process. And we're gonna expand on that by turning each logical step into a defined function, so in this next notebook, we've exactly the same process. What we've done is wrapped each piece of code in a cell, in a function, and we've named the function with the intention of what it's gonna do. So we load the data. We remove duplicate orders. We calculate the sales by region and we save the output back to the data lake.

Now what's interesting. If we were to run this notebook end to end, nothing would happen. Because all we've done is define the functions themselves. we're not actually calling them. So in this notebook, we have one additional cell, which if you like is the orchestration cell or the workflow cell actually specifically calls into each of those functions in the order we want 'em to run, which runs everything in the notebook end to end. And we get exactly the same output as before. Now that we've structured our notebook so that every kind of separate piece of logical processing is defined in a separate function. We can start to add some tests.

So this next version of the notebook, we've done exactly that now because we're writing Python code. We can actually just import the very widely used and well known test framework called pie test. And that's pre-installed on the synapse and the spark clusters already to go. So further down the lip book here, I've got some test logic and there's a couple things tonight. The first is it's absolutely possible to create spark data frames in memory. And that allows us to queue up specific data scenarios, execute the logic we want to execute and assert that the results was expected. So we can follow a very standard unit test and a pattern of range act assert. So the first thing we're testing here is that ordered with a duplicated order ID are removed as per our removed, duplicate ordered as. So we keep a data frame that has two rows of data, and they both have the same order ID, even though the other fields inside the data frame are different.

If we were to run this cell, we would only get one row back because the second duplicate or duplicated row has been removed. Now what's interesting about writing tests is that by writing a test for a certain scenario naturally, It leads you onto what else could happen. So in this case if orders with Jupyter order ID removed, what about orders that have a different order ID, but every other field is the same.

What do we expect in that case? According to our current logic, we'd expect all three rows to be returned so we can test that as well. We can also add tests for our aggregations and calculations. In this case, we create four rows of data to add sales by different region. and confirm that they've been aggregated and some correctly in our total sales calculation. And what's interesting now is in this notebook, we actually have the code for executing our ETL process. And we also have functions that are our tests. So our orchestration cell now needs to deal with both of these cases. So we might wanna run all the tests first and then if they all pass, we run our work. And if one of the tests was to fail, we wouldn't get this far. The notebook would fall over and we wouldn't run our ETL processing. So in no way, we've added a very simple quality gate to our end to end notebook winning process. So let's run this notebook now, see what happens.

And the notebooks has succeeded because our test have all passed, but it was to change our logic in our notebook. Let's say, for example, let's change how the duplicate order processing has happened. And we actually want to care about distinct roads. We run the notebook again, this time we should run very quickly because the session is already started. Our test will actually fail. Meaning our ETL workflow will never actually run. And here we go, our test is kicked in and we can see the assertion error as explaining what's gone on.

This is very useful. However, even in this very simple use case, we've only got a couple of, fairly simple functions, defining some processing in a very small kind of ETL process, a notebook start to become quite unwieldy. We're mixing and matching ETL processing and test logic and having to orchestrate everything in a cell at the end and rely on the fact that we've done everything the right order in order for, to not run our ETL process by accident.

If our tests are actually showing that things. so how do we deal with that? Synapse and other notebook technologies have something called magic commands. And one of these magic commands is a run command that allows you to reference one notebook from another. So we can actually employ fairly standard software engineering practices to start to modularize and pull apart various bits of logic into separate notebook.

What we might do is create a separate sales data functions, notebook that just defines the functions that we care about. Loading the data, removing the duplicate orders, calculating the sales and saving the output. So there's no orchestration in this notebook. It's purely just the business logic of what goes on. Once this is in place, we can then create two separate notebooks, one for our tests that references the functions, using the magic run command. This cell basically pulls in all those defined functions into the executing context of this notebook, allowing us to run our tests without any fear of actually winning the ETL process. At the same time, we can actually create our ETL notebook. That is exactly the same thing. It references the sales data functions using the magic run command. So all the functions defined are pulled into memory when we're running this notebook. And then we have our ETL orchestration cell here that actually runs a process for. What's great now is we've got two entry points into this process. We've got an ETL process and we've got a test process, and we can actually start to add both those things into an end to end pipeline. And that's exactly what this pipeline is doing. We've got two notebook activities. The first one calls into the sales data test notebook, and the second one calls into the process sales data notebook, winning the ETL process. But the second notebook activity is only gonna run if the first one succeeds.

So our tests are acting as a quality gate for our ETL process to run. Now I know that inside our sales data functions, notebook, the logic still remains that we've changed it to drop distinct orders, meaning our tests are gonna fail. So we were to run this pipeline. We would expect the whole pipeline to, and it has failed as expected. And we can even see in the outputs, the error message of why it did fail. So we can go and debug and see what's going on without very simple kind of mindset shift, which is to restructure our notebooks, such that each logical piece of processing is defined in its own function. It makes it very test. Once we've done that we can use the magic run command to actually modularize and split out notebook logic, separating out core functionality from test functionality and from orchestration.

And that allows us to line things up in a pipeline. As we've seen here to add quality gates to our end to end process. So in summary notebooks are code and code should be tested. The key to testing notebooks is to treat each cell as a logical step in an end-to-end process, wrapping the code in each cell in a function. So it can be tested the nature of using notebooks for ad hoc and experimental data work means you're probably not gonna work test first. So probably most useful in a regression testing scenario. Or adding quality gates and dealing with edge cases. Once you've got the core processes up and running, and once the demo I've just shown you is used Synapse Notebooks, the concept is absolutely applicable to data, bricks, or Jupyter notebooks.

And that brings us to the end of this talk. Hopefully I've shown you. It's absolutely possible to add to quality gates and testing your data solutions, including Tabular models, pipelines, notebooks, long winning ETL processes and big data sets. And it's entirely possible to do that in a variety of languages and testing frameworks and tools such as.NET, such as pythons, such as TypeScript, JavaScript, no JS, etc. Now to end by restating what I said back at the beginning of the talk, which was that fundamentally the question I wanted to ask yourselves is this any data solution, any data insight that you're working on developing or responsible for providing to your stakeholders or customers or organizations does it matter if it's wrong? Cause if it does, you need to be able to prove that it's right. If you want to hear more about, we could help you then feel free to drop as align at hello. endjin.com. Give yourself questions or want to connect I'm available in the live chat Q and a, or you can find me afterwards in the virtual networking lounge.

Thanks for listening.