How do I know if my data solutions are accurate?
My last post argued that the recent events at Public Health England weren't "caused by Excel", but by human error, and a basic lack of understanding of the tools and processes that had been put in place to gather, prepare and present the data. I described a simple process for risk and mitigation analysis that would have highlighted the risks involved, and could have prevented the data loss if the appropriate protection barriers were put in place.
One of these barriers was around the quality assurance process - the absence of which meant that inaccuracies in the data ended up going unnoticed. Like any application, the need to validate business rules and security boundaries within a data solution is important, as well as the need for ensuring that quality doesn't regress over time as the solution evolves. Especially so if the data is being used for critical decision making. Data insights are useless and, as we've seen, even dangerous, if they can't be trusted, so should be treated as any other software project with respect to testing - by building automated quality gates into the end to end development process.
But, it isn't easy. As business applications have given more and more power to the "super user" - the distinctions between software developed by software engineers and business users get increasingly blurred. James Kwak wrote about the importance of Excel back 2013, and not a huge amount has changed since, other than business users have now got even more powerful tools.
But whilst the tools generally focus on enabling and empowering the business to get insights and answers more quickly, the best practice approaches to testing and automation are still often overlooked. Is an Excel workbook "an application?". Should it be version controlled, tested and updated using repeatable and reliable processes, that have built in quality gates? If people's lives are at stake, then arguably, yes. And if this can't be done, then maybe it's just not the right tool for the job.
So, what is? And how can you avoid the same mistakes in your own data solutions? Well, there's no simple answer, but this post attempts to highlight some of the key points - ultimately acknowledging that if you're doing anything serious with data, then you should be taking all of this seriously.
Representative data is key
One of the first questions we ask with any new project or customer is "do you have representative test data?". Often the answer is no - QA might be done over a copy/back up of the live system, or a using a "dummy" set of data that is drastically different from the live production system.
We ask for "representative" data, rather than actual data, as we don't want personally identifiable information, or commercially sensitive information, or anything else that exposes us to the risks involved in a data breach. GDPR, and equivalent regulations in other regions, have significant implications and should be taken seriously.
But if that's what we don't want, what do we want in a representative data set? "Representative" can mean many things, for example:
Size - do we expect 1,000 records or 1,000,000,000? This could have implications on all sorts of things - how a query would perform, how/where we can store the data, how users will interact with the data etc. The Track and Trace issue was caused by an old version of Excel truncating raw CSV files to around 65,000 rows. The "Excel template" that was developed was clearly never tested with anything over this limit (or at least the wrong version of Excel was introduced after this testing ocurred) despite being a fairly predictable scenario given the trends in the data. If we expect the application to grow to billions of rows of data, then we should be testing against a similar sized data set.
Quality - do we expect all the data to be present and correct all of the time? Or do we need to account for common error cases, and unexpected edge cases? Missing values, truncated values, duplicate values, malformed data etc etc. All of this can affect formulas and calculations, cause processing to fail, and the wrong information to be presented to the end user, or fed into the next system. We should be testing for these scenarios and ensuring that they're each handled appropriately.
Content - is the data evenly spread, or heavily skewed in one direction? This could affect how the data is presented, queried or processed. What are the limits in the data - the maximum and minimum values? Do we have data within these expectations, and do we handle the outliers appropriately? Does the test data we have "look like" real data - e.g. have we accounted for multi-lingual, multi-currency, multi-timezone? This is sometimes the hardest part, as we need a data set that's as close to "real" as possible, without actually being real. Automatically generated test data might not cut it. It may not be possible to reliably anonymise a real data set for testing purposes.
It's often the case that you need multiple test data sets - a smaller, targeted, "functional" data set that includes all the scenarios that you need to cater to, alongside a larger "big" data set that proves you can still process, store and query at the right scale, performance and cost that you need. But, the point is that to test a data solution, you need test data.
Formulas are code and should be tested
In the world of software development, the concept of automated testing has been around a long time. The idea that code that contains logic that powers business processes needs to be validated is fairly well established, and there's a huge variety of methodologies and frameworks for doing this. And data solutions shouldn't be treated differently - formulas, query logic, expressions, macros, calculated columns, stored procedures - they're all code, and therefore need to be validated.
Of course, this isn't any easy thing to do - there's no "built in" testing framework for a lot of these tools, it's on the end user to do find a way to do this. But that requires some knowledge and awareness that it should be done in the first place. If you're going to power your business from Excel, then the people working in the spreadsheets need to recognise the responsibility that they have in this.
Testing in Excel?
Excel isn't the only "super user" data tool, but it's certainly the most widely used and widely understood, so it's worth highlighting what testing might look like. Excel has a whole bunch of features built in that help with data validation. Conditional formatting is a great way to visually layer on things that aren't quite right, and there's built in rules, for example finding duplicate values.
So, a simple approach to validating a worksheet could be to add a "checks and balances" worksheet that applies conditional formatting to expected indicators - for example missing or duplicate values, max and min values etc. A simple check on the expected row count v.s. the actual row count in the data sheet might have prevented the truncated data issue from causing as much impact as it did - a simple "there should be 500 rows in this file" is all that was needed to add a cross check for when the new rows were appended to the master sheet.
Interestingly, a few years ago, we designed an entire application architecture around hosting an Excel spreadsheet "model" and using it to dynamically generate a web-based user interface, with form validation driven entirely from the rules defined in the cells. It may sound crazy now (to be fair, it still sounded a bit crazy at the time), but the first step in going from an Excel file on someone's laptop to a full-blown web app, was designing the testing and validation process. A new version of the file couldn't be applied unless all the checks and balances inside the worksheet had validated successfully.
There's always a way to test something
The point I'm trying to make is that there's always a way to test something, even if it's not immediately obvious, super elegant or even driven by technology (accountants have various cross-checks that they've been using since long before computerized automation became the norm).
Fast forward to 2020, and we see Excel being used less and less as the primary data store for business applications - but as business tools continue to evolve the problem doesn't go away, it just moves on. Power BI is one such tool that is a natural progression for those wanting to get more analytical capabilities than they can with Excel. It's also worth calling out as it's what Public Health England, and the NHS were using for some of the publicly available COVID reporting.
As a reporting tool, Power BI is inherently difficult to test due to the lack of available testing frameworks, the lack of control over the rendering of visualisations, and the encapsulation of business logic within the underlying service. But, this shouldn't stop us and, to prove that point, I've recently written about a couple of approaches for automated testing at various layers of a Power BI solution, firstly testing the Power BI Dataflows (the centralised and reusable data preparation steps that feed into the reports), and then testing the reports themselves, by validating the calculations and data in the underlying tabular models.
Neither of these approaches are immediately obvious, and definitely out of reach to those without programming and software development expertise, but they were a necessity in validating that the reporting functionality we were developing for a customer was accurate.
Nearly all data storage technologies support some sort of programmatic connectivity, meaning tests can be executed over the state of the data. But what you really want to test is the business logic that you're applying - did all the data end up in the right place, has it been processed correctly, are the algorithms/calculations/transformations accurate, are the security rules in place? etc.
Use the right tool for the job
Finally, it's worth highlighting again that blaming limitations in Excel for the issues that arose isn't really an acceptable response. Yes, Excel is easy to use, but that doesn't always mean it's the right tool for the job. If data quality, trust, reliability and security are important then it most likely isn't the right tool for the job.
Kwak sums up the problem pretty succinctly:
"Because it’s so easy to use, the creation of even important spreadsheets is not restricted to people who understand programming and do it in a methodical, well-documented way.”
So what is the right tool? Well, as a career consultant, I'm very used to giving the answer "it depends"! It depends on the context, requirements (functional and non-functional), and constraints. Ian Griffiths provided a more detailed response to this in his recent talk at SQL Bits - "Navigating the Bewildering Array of Data Services in Azure" aimed at helping you understand the many options to quickly narrow down a small set of candidate technologies.
But it's clear that, regardless of the tool, a grounding in quality assurance practices and software development processes is important.
Firstly - automation is king. Being able to repeatably and reliably run the same set of tests each time is paramount in ensuring that quality or behaviour doesn't regress over time as things change. Change might be caused by system enhancements, bug fixes, data volumes, security rules, or simply the passing of time (anyone remember the Y2K bug?)
Describing expected behaviour in an easily understandable way, using realistic examples ensures that business stakeholders and development teams are talking the same language. Techniques like specification by example and behaviour driven development are equally as applicable to data solutions as they are to application development - just as with my approach to testing Power BI reports with executable specifications.
Benchmarking is another technique that can be used to great effect with data solutions - especially when it comes to measuring performance/load/scale characteristics. We used it to great effect when evaluating the capabilities of the new, hugely impressive SQL on Demand service in Azure Synapse analytics. But, it's equally applicable to testing machine learning algorithms, ensuring that the quality of your predictive modelling doesn't degrade as models are retrained over newly captured data.
All of the above only highlights further that the Excel-based, manual processes in place at Public Health England weren't adequate for the task in hand.
Whilst testing Excel spreadsheets, Power BI reports, or other business-user focused tools isn't something that many people think about, it's critical that business rules and associated data modelling steps are validated to ensure the right data insights are available to the right people across the organisation.
Whilst there's no "obvious" way to do some of this testing, this post argues that that's not an excuse for ignoring it. In fact, in order to collectively raise the game in some of the tooling that's running our lives, we have a duty to deal with it properly.
Want some help with your data strategy?
If you'd like to know more about data services & analytics on Azure, we offer a free 1 hour, 1-2-1 Azure Data Strategy Briefing aimed at business and technical decision makers. Please book a call and then we'll confirm the time and send you a meeting invite.