Fake it 'til you make it - generating production quality test data at scale. | endjin

Barry Smart 13th September 2022

Data Scotland 2022

Many organisations provide digital products or services that need to handle personally identifiable information. The challenge is providing product and engineering teams with a sufficient volume of realistic looking synthetic data to enable them to design, develop and test their solutions.

Barry presents open source tools and open data sources that can be used to tackle this challenge, and then demos this in action to generate thousands of synthetic customers.

He describes how this approach can be used to build better products, to test products using production quality data at production scale, and embed data quality and best practice information security practices in your engineering processes.

Transcript

Good afternoon, everyone. I hope you're having a good conference. My name is Barry. And today I'm going to talk to you about generating production quality test data at scale. But before I do that, I just wanted to call out all of the wonderful sponsors who've made today's event possible. Data Scotland is very much a community driven event, but without the backing from these sponsors, it wouldn't have been possible to host the event today and bring the community together.

In particular, I wanted to call out the Data Lab. They inspired me back in 2019 to do a career pivot. I was a CTO at a financial services company, but having attended some of the data lab events and with the support through their MSC program, I completed Masters in artificial intelligence right here at the University of Strathclyde.

Then having graduated in 2019, I joined endjin. We're a Microsoft gold partner. We like to say that "we help small teams achieve big things". And we do that generally in two ways, through data analytics and .NET modernization.

Agenda

So what do I want to talk to you about today?

I'm gonna start with a story about how a change to one line of code cost a client I was working with 30 million pounds. I think that sets the scene really well about why production grade test data is really important.

Then I'm gonna talk about a project that we ran last summer in this space, where we set about generating synthetic customer data. I'll cover the vision, the people who are involved and the tools that were used, which will allow me to demo some of those tools as well.

Finally, I'm gonna wrap up and reflect on the project. Did we achieve what we set out to achieve and what did we learn along the way? And then hopefully at the end, we'll have time for some Q&A.

The £30M Bug

So what went wrong? What resulted in a change to a single line of code costing the client I was working for 30 million pounds?

This dates back to 2001 and the launch of the new electricity trading market in March of that year. The client I was working for had been running a two year program to get ready for the new market, and the company I was working for was one of the vendors.

We had a product team based in London building the software and a number of clients around the UK. I was working with a client based in Glasgow. I was on client site responsible for taking the software from the product team, configuring it and integrating it into the client's technology environment.

Now the bug was introduced about a week before go live. The lead developer on the product team was making some final small changes to the product as part of preparing the final release to go out to our clients. While he was in the code, he noticed an area of the code which to him was a confusing, there was a range of conditional statements and he decided to tidy those up to make it clearer what was going on. He was meaning well, but unbeknownst to him he introduced a bug, which meant that under certain conditions, when a trader booked to buy we were notifying the central market agent that they were selling and vice versa. If the trader booked to sell under these specific conditions, we notified the market that they were buying. And this obviously put the client out of balance so that what they were generating or consuming didn't match what they'd captured in the trading system.

And that resulted in punitive imbalance charges about a million pounds a day, for a month. At the end of the month, when the the back office settlement process kicked in, that's when they noticed the issue and it was rapidly resolved at that point. Now these were two mature organizations, both were foot FTSE 100 companies at the time, they'd recognized the risk of bugs in the software, and they put a number of barriers in place.

But those barriers failed. So what went wrong? First the product testing failed to stop the problem. The product team had a set of unit tests that were run over the software. None of those found the bug, the software then arrived with our clients.

Now they'd recognized the need to test the software, so they had a dedicated test team for the program. But the problem was that test team was staffed with independent contractors and they were motivated more by the volume of testing that they did and the number of defects that they found, but were they doing the right type of testing? They certainly weren't finding the right type of bugs!

Then the test approach itself was focused more on testing individual systems rather than testing the end to end business process.

And finally, perhaps the biggest failing was that the test data itself failed to reproduce the conditions under which the bug occurred. Now in the production environment, about 5% of the trades booked hit these conditions. In the test environment. In retrospect, none of the trades they booked replicated those conditions.

And together, all of those things resulted in the bug, making its way through, into production and costing the client 30 million pounds.

Now this diagram that I've used to illustrate the problem is called a Swiss cheese model and we use it a lot. We tend to use it more proactively with clients. To help them map out the risks that they face in their technology environments and to understand the barriers that they have in place and to assess whether those barriers are sufficient to prevent incidents just like this. And we've got a number of blogs on our website if you're interested that talk about Swiss Cheese in more detail.

DTAP in Data & Analytics

So fast forward 20 years. And this picture here is generally how we think about testing in data and analytics. On the right we've got the production environment. That's where the end users live generally using the latest production versions of the software over the production data.

On the left, we've got the development team. They work in their own environment where they can make changes to the software, whether that be new features, refactoring the code re-platforming, bug fixing, all of that goes on in development environment. And then it's subject to a quality gate. Generally, we want to do scripted, automated testing as part of that quality gate.

And then it goes into a testing and an acceptance phase where the development team and the end users come together to put the software through its paces. There's a dedicated environment for that. And then finally, once that's all gone well and approved, the software's subject to final quality gate before it goes into production.

Now over the last five years, companies like Microsoft have put significant investment into the data analytics platforms that they provide to the market to enable this kind of capability to be put in place.

It's been quite interesting because the tooling has historically been quite immature. But it's now catching up with mainstream software engineering. And certainly we feel today with the Microsoft stack there's no excuses for not putting this kind of capability in place from day one with any modern data and analytics project. But there still remains one big challenge and that's around generating realistic test data because it's domain specific.

And a lot of organizations fall into the temptation of just using production data in development and test. But we don't advocate that as an organization, we don't want to expose ourselves to personal information or commercially sensitive information. We've gotta think about our reputation and even our insurance premiums around this kind of stuff. It's something that we feel passionate about and there's a whole bunch of other reasons. It's not just about the risk of data leakage. It's the ability to generate those scenarios and those data volumes that the organization might not have encountered yet, but could come as the product becomes more widespread and the number of users increase.

Vision - Synthetic Customer

The need to generate realistic test data is something that we tackle over and over again with clients. But we thought, is there better tooling? Is a better way of approaching this? Can we generate even higher fidelity data? Can we generate even bigger volumes of this data? And that is what inspired us to run our project last summer with our summer intern students around generating synthetic customer data. Because what we were seeing was a common challenge across a range of clients that we worked with in these industry areas who all have a need to deal with customer data.

So what we mean there is, information about individual people with things that are quite sensitive, their date of birth, their name, their post code, their national insurance number, these kind of things present challenges when it comes to generating realistic data.

And whilst we tackled that challenge individually with each of the clients historically, we wondered whether there was a better way of generating this data. Could we take some of the friction out of the process? Could we generate iInformation that was of even higher fidelity, to really address the core purpose of this, which was to, push realistic data through development and test. Could we generate data at ever increasing volumes to simulate the scale at which these platforms need to operate?

So the vision was really to do just that. It was to generate a synthetic customer who lived in a house with all of the attributes that flowed from that.

And by starting with a household, we were able to use demographics for that region to then drive creation of customers and ultimately assign them to products and services that was realistic. We could reflect on the age demographic, the employment demographic, the educational demographic, the wealth demographic, all of those things could be played through into making this data as realistic as possible.

Once we had the household, we could create the customer. We also modelled out the wider members of the household as well, because we recognized that a lot of products these days, things like car insurance now and movie streaming products like Netflix and mobile products, they're not just selling to an individual they're selling to the family or the household members.

So we wanted to generate that data as well, to really enrich it and allow for those kind of scenarios. And then finally with the customer in place and the demographics in place, we could then assign them to the relevant product or service. And as you'll see later on with the demo, for the purposes of the project, we designed a fictitious broadband and TV streaming company and modelled out how we would generate subscriptions for their products.

The overall objective of this was really to generate data at volume, that was so realistic that when, someone looked at it they would think they were looking at real data and that would stand the test of analysis and exploration of the data. And hopefully you can see how close we got to that later on when I demo the data in power BI.

People - Summer Interns

Now, as I said earlier, we'd run this project with our summer interns last summer. And we had four interns we'd recruited those alongside four new graduates as well, by engaging with over 20 universities across the UK, with their their grad milk ground.

A lot of these universities are now using electronic platforms, so it allowed us to reach out to prospective candidates through their platforms.

As you can see here, we had 295 applications in total. Around half of those applied for the internship, and then we put them through a process of, reviewing their applications, short listing them, phone interviews, and code pairing sessions where we really tested their ability around problem solving and being dropped into new new situations and how they dealt with that. And then finally, meet the team session where, the four interns were selected.

So I'm really pleased to say that the four successful candidates were all women. Today we still have issues around gender equality within the technology sector. So to have four women on board for the summer working on this project was fantastic. And they worked really well together as a team.

Two Pizza Team

So we had Amy, Charlotte, Claudia, Thea on board and myself. I acted as product owner and I was there to support them and address any blockers that they came across along the way. And it felt like a really empowered team. We were working with agile practices. We had a lot of multidisciplinary input because we had one computer scientist, two engineers, and a biology student on board. That diversity really helped when we were thinking about the problems we were trying to solve and how to address them. So we effectively had that two pizza team that you hear people talking about.

Not too small that we couldn't get anything done, we had a lot of capacity, but not too big that it meant decision making was difficult. It was a perfect size of team to tackle this kind of project.

So this is the vision for what we wanted to build. Over on the far right we've got the end product that we wanted to produce, which was analytics and visualizations over the published synthetic customer data.

Tooling - Synapse Analytics

Over on the far left was the information that we identified that we wanted to ingest, clean up and prepare so that we could use it at the core of the process to generate synthetic customer data.

So with the process sketched out, we then wanted to decide which tools we'd use to deliver it.

And we immediately turned to Azure Synapse Analytics. This is a platform we think is fantastic. We use it in most of our data and analytics engagement with clients these days. And the way we think about it very much as a Swiss army knife. There's a lot of tools and capability in the box, and just like a Swiss army knife, you might not necessarily use all of the tools. In fact, it's unlikely you'll need to use all the tools for a particular use case, but it's nevertheless reassuring to have that capability in place. Should you need it in the future.

So with that in mind, we chose the relevant tools from the box that we thought we needed to leverage for this project. So we used Azure data lake storage to persist the data. We used SQL serverless to build some views over the data lake so that we could import the data into power BI. We use pipelines, which are effectively Azure data factory, to orchestrate the end to end process.

We wrote the core data wrangling logic in Pyspark using Apache spark notebooks to execute that code over spark pools. And because we were working as a team, we were working on different aspects of the project in parallel. We needed to leverage Git integration, so we could use things like branching and pull requests to manage the code base. And then finally, as you'll see in the demo we visualize the data in power BI.

Right at the heart of this tool is Synapse studio. And that worked really well for us. It brings all of the components together. It's a web based integrated development environment and in particular, it worked really well with our interns. This was the first time we'd been exposed to enterprise grade Cloud platforms like this and the fact that they had Synapse studio to bring everything together, it helped them to get up to speed quickly.

So this diagram just shows you the kind of core process and how those tools mapped onto it. At the top there, you can see how important that kind of developer experience was and how it played into enabling us to deliver all of the functionality we did over that two and a half month period of the summer.

Open Data

So next I wanted to turn my attention to the open data sources on the far left of the process, where we sourced the data that seeded the synthetic customer generator.

We identified these three sources. First of all, the land registry data, that's every house sale since 1995 in England Wales. A really rich data set. There's 30 million records there, four gigabytes of data. That provided invaluable information about the household addresses and locations.

Then we used two data sources, one from the 2011 census and another one called the indices of deprivation to get postcode district level demographics. So the idea was to marry up the house sales data with the demographics in that area so that we could generate the right customer, given the prevailing demographics in that particular area.

Demo: Pipeline

So what I want to do now is demonstrate how we went about doing that. And I'm going to show you how we ingested the price paid data from the land registry.

This is the website where we sourced the price paid data from the land registry. And if we scroll down here, we can see how that data's provided.

We get a file for the current year, which is appended every month as new data becomes available. And then for every historic year going back to 1995, there's a single CSV file provided that we wanted to download.

what we noticed, if you look in the bottom left hand corner, is that the URL for each of those files is predictable. There's a base URL that points to AWS S3 storage. And then the file name on the end, which can be constructed based on the year relevant to the file.

So we wanted to exploit that pattern to build a pipeline that would automate the process of downloading those files on mass, and then also provide the flexibility so that we could refresh the current year, maybe every month, to load the latest data onto our data lake so that we could run the end to end pipeline with new data.

So if we turn over now to a synapse studio, which is the web based integrated development environment for Synapse Analytics, we can see the pipeline we've built to do just that.

Now I'm gonna kick it off. You can trigger these pipelines to run on a schedule or based on an event such as a file arriving on storage or a message arriving in a queue. But I'm gonna trigger it manually.

So if I kick it off manually, you can see it's prompting me for the input parameters needed by the pipeline, which is the minimum and the maximum year that it should go and load the data for.

I'm gonna use the defaults, because I want all of the data to be downloaded onto the data lake. So I'll say, okay. And then you can see the processes being kicked off.

And the way that pipelines work is it's a set of activities that you can drag onto the canvas to build up an end to end process. You can see here it's pretty simple. We create the year list. Then, for each year that we've created in this list, we step through and we download the file. And the download process is executed via copy task. And you can see here that it's dynamically constructing the file name from the year that's being processed.

We can see that pipeline's just run. So let's go and have a look at that. So this is the rich logging information that you get for any pipeline. If I scroll to the bottom it shows you that first of all it created the year list. So let's have a look at what that looked like in this case. So it's 1995 all the way through to 2022.

It then triggered the "for each" process, which then in turn triggered all of these copy price paid data tasks. And if we look at one of those, we can see that in this case it's downloaded one of the files, which was 215 megabytes from the HTTP source that's provided by the land registry onto our own data lake storage.

If we flip now over to data lake storage, where that data should be hosted. Looking the data tab, I've actually already got it open, you can see that data's just been downloaded now onto the data lake.

So that's the outcome of that process running. So that's great. We now have the raw data landed on our own data lake.

Demo: Notebook

So the next stage is now to analyze that data. To understand the data in more depth to identify any tasks we need to perform on the data to cleanse it and prepare it for use in our synthetic customer generator.

So we do that using a notebook which are available inside Synapse studio. As you can see, we're using the development area. Inside the development area, there are different development tools available. So we're looking at notebooks, we've opened this particular notebook, which we've written to explore the land registry price paid data.

We've written the notebook in Pyspark, which is a package for Python that allows you to run jobs on a spark cluster, but there are other development languages available that you can use in notebooks. And we've attached this to a particular spark pool, which is another feature of synapse. It will dynamically spin up spark clusters for you to run resources like this notebook on.

So we really love notebooks. They combine code, the output from your code and documentation all into one document. And in particular for this of stage of the process, where we want to load the data up, explore it and capture our findings and our thoughts as we go through the data, this is a perfect medium to do that.

I'm not going to run the notebook because it takes about 10 minutes to run it end to end, I'll just talk you through what it does. First we're loading in the various packages that we need to work with spark and to plot data. You'll see that we've got some charts in the body of the notebook to visualize the data.

You can see here's an example of where we can capture documentation which is in markdown format. It allows you to write rich documentation, including linking out other sources of information that are useful reference points for the particular data that you're exploring.

The first thing we do is define the schema for the data. This is defined by the land registry, it reflects the structure of the CSV file that we're loading. And then through this single line of code here, you can see that we're asking it to read the data straight from CSV into a data frame.

You can see we're basically pointing it to the area of the data lake where our pipeline wrote the data. We're using a wildcard file name. So it's not just loading one CSV, it's loading all matching CSVs. So this is all 28 files that's going load. And you can see after that process is completed, it takes about 10 seconds to do that.

It's loaded 27 million rows of data. Across those 28 files, there's about a million house sales per year on average. Then what we do is we add some new features to the data frame. So a data frame think of it like a database table or a sheet in an Excel spreadsheet on steroids. It's basically an area where you can organize tabular data.

Once we've got that data loaded, we can extend it. So we're creating some new features here. We're translating the codes that are used for the property type into full descriptions. So it's easier to understand what's going on here.

We're adding new columns that record the year and the month and the first of the month based on the date that property was purchased. We've got a little function here that uses regular expressions to extract the postcode area from the postcode, because we want to be able to group the data up by postcode. We run a test on that and you can see it does its job perfectly.

We then run that over the data frame and generate a new column called postcode area. And just to check that's work, we count the number of distinct postcode areas that it's extracted, which is 110, this is in line with the number of postcode areas that we know are in England and Wales, that gives us reassurance that it seems to have worked.

And here we display the data frame to see the result of all of that. So you can see we've got the price column here with the purchase price of the property, the date it was sold and the full post code for the property. This is one of the columns we transformed from a code into a full description of the property type. There's information whether it was a new or an old property. And then we've got the address components: the street number, the locality, town or city, district, and so on. Which is all really useful information.

And then finally, you can see the features that we built on top of the data. So we've extracted the year of sale, the month of sale the first of the month at the time of sale and then the postcode area.

So we're all set up to explore the data. So what we're doing here is setting up some helper functions to plot the data. We want to style all our plots in a similar way. So that's what that's doing. And then we've got a helper function to plot lines and a helper function to plot bars.

So the first thing we do is then analyze the volume of house sales year on year. So we call our helper function, passing in the data that we've grouped up by year. And here you can see we're just counting the house sales data and you can see the data from 1995 all the way through to the current portion of 2022 that's available.

You can see the number of houses that have been sold in that year. And it's quite distinctive here where you can see the drop off as a result of the 2008 global financial crisis and then potentially something going on here that might have been triggered around COVID.

Let's have a look at the prices over the same period. So this is annual average prices by property type. You can definitely see a plateau as a result of the financial crisis. But then things seeming to pick up again.

We can then drill into a bit more detail. So we're just looking at January 2019 onwards here at monthly granularity. We can very clearly see in terms of volume of sales, the impact of COVID. So that seems to have been reflected in the data. But then also three spikes in the data in terms of the volume of house sales. And we were worried about this, actually, we thought, is there a problem with the way we've loaded the data, is this actually right?

And it is. This is the three months in which the the stamp duty holidays ended during lock down. They caused a localized surge in house sales. And then you can see the aftermath of that and then another peak and then the aftermath and then another peak.

And that had an impact on house prices as well. You can see it created quite a lot of volatility in house prices. So there's the first peak, the second peak and the third peak there. So it creates that volatility in the market.

You can also see from this monthly data, there's potentially a bit of a step down in the flat prices as well. Despite all of this volatility, does not seem to be on the same upward trend that we've seen in recent months. So that was interesting. The data seems to reflects what we know in terms of the macroeconomic effects that have been at play. So that's good.

We then identified a bunch of tasks that we wanted to perform on the data. So we identified that some of the addresses don't have a post code and we wanted to rely on post code to look up our demographics. So we dropped any rows where there's no post code.

We also found that in the data there's commercial properties. So that's things like offices and factories and the like, and they tend to be, one or two orders of magnitude more expensive than residential properties. So we decided to drop them from the data set.

We also discovered that there's a whole bunch of properties that have been sold more than one time over the last 25 years. It's quite surprising, actually. There's a bunch of properties here that have been sold 20 or more times in that period. And what we wanted to do was make sure that when we were generating synthetic customers, we didn't have more than one family generated at the same address. Which would be a bit awkward. So we we de-duplicated that.

Then we finally pushed that data out, back onto the data lake. So we treat this as specific projection over the data, because we've done some steps on this data that you might not want to do in other use cases. So we write this out to the data lake. For the purposes of meeting the requirements for our synthetic customer generator. So you can see here we select the specific columns that we're interested in. And then we write that out to the data lake as a table. So that allows us to address that as a SQL endpoint that we can use to query into downstream processes.

Demo: Faker

So that takes us through the data prep, data cleansing stage of the process. We've now got data to work with. Let's have a look at how we use that data to generate synthetic customers. This is a part of the process where we turned to open source. And in this case it was a Python package that we used called Faker.

There are other packages out there that are worth evaluating as well, like Mimesis, but in this case we chose Faker. And I'll demonstrate that in a second, but you'll notice here there's a common thread across the notebook that I showed you earlier and the core logic that we were developing for the synthetic customer generator, we're using Python as the base language.

And we did that consciously because all four of the interns had been exposed already to Python during their University degree. And indeed we see that as a common theme, most modern universities are now running at least one module with STEM graduates to expose them to Python, usually around data and analytics type modules, but also in other areas.

So that's why we chose Python in this case. Python's obviously got that rich ecosystem of open source packages. And therefore we had choice when it came down to choosing the particular package that we wanted to apply.

So let's show you the Faker package in action. And to do that, I'm using another notebook. This time it's running inside visual studio code locally on my laptop. There is fantastic documentation for faker. So please refer to that about how you install it and get up and running with it. But once you've got Faker installed and your Python environment is pretty straightforward to import it into your Python project.

And then the first step is to create an instance of Faker where you provide the locale that you want faker to generate fake data for that comes in two parts. The first part is language. So in this case, it's English. And the second part is the geography. So in this case, it's Great Britain.

The next stage is then to set a seed for the random generator used at the heart of Faker. And that's important if you want to reliably reproduce the results that Faker generates. And that was important to us. And once you've done that, you've got you can start using Faker. So here I'm using a "provider" as faker calls them that will generate a name.

So I basically call that provider and it outputs a name and I can keep calling that provider and it will generate more names on demand. There are a massive range of providers inside faker. They'll do things like credit card numbers, dates of birth, email addresses as we've got here and street addresses. Here's an example.

There's a range of different providers there that you can use to meet your different needs, but we tended to use Faker in a slightly more sophisticated way. We wanted to control the probability of it, choosing certain categories in this case, economic status, as an example.

Remember what we had, we were ingesting open source data in this case, data from the 2011 census, where we had data down to post postcode district level about the distribution of economic status in the population of that area. So in this case, for example, the distribution may suggest that 35% of people would be in employment, 10% would be unemployed, 5% would be full-time students and so on. So by setting that structure up and passing it into Faker to choose one of these categories, based on that probability we could govern how Faker was behaving. In doing so, when it was choosing things like economic status, it was doing that in a way that would reflect the local demographics for that area.

And we could do that en mass. So for example, we could generate 10 examples from that particular set of probabilities. And you can see it generates the distribution of employment status categories here that reflect what you're seeing above in terms of the probabilities.

And that's how we've maintained the realism in the data by taking this approach. So we used this not just economic status, but also marital status, age profiles, and household composition, where we were choosing whether to create a family or a retired couple or a house full of students. That was all driven via this kind of approach.

We plugged all of this together. It's quite sophisticated in terms of all the logic, but at the end of the process, we could call this code to generate a household and the people who lived in that household and we could call it as well to assign them to a particular subscription for the fictional broadband and media streaming company that we imagined this being used for. We could call that over and over again in a loop to generate as many fake customers or synthetic customers as we wanted to generate.

Demo: Power BI

So the next stage is to move over and show you the output. And we brought this to life using power BI. This is the Power BI report that we put together to analyze the output from the synthetic customer generator. We imported the output into Power BI, constructed the model, layered on some DAX measures to add some higher order analysis over the data, and then put together some pages to bring that, that data to life.

In this data set, we've generated 20,000 households. So if you drill into that in the first case, you can see we've generated a range of different house prices which we can explore. A range of different property types, and also the year of purchase of these different properties. You can see reflects the trends we were seeing in the data earlier, in terms of things like the financial crash in 2008.

The map visualization here groups the data up to the major town, cities and districts because 20,000 data points is too many to plot on this kind of visualization.

But as we zoom into the data, it's quite interesting to see ares like London, you can see they're a higher distribution of flats than other areas as you'd expect because it's a city center. If we look at areas like Cambridge, for example, you can see a low proportion of flats, so we can drill into the data.

Let's have a look at Cambridge in a bit more detail. So we can see here. How it's created households by postcode. Generated two houses there in, at the same post code. They're probably different street numbers, but generally you can see that's distributed the individual houses around the Cambridge area.

And if we drill into one of these examples, let's pick this this terraced house here. We can see that if we now look at the details, this is a one family household. So that's driven by the demographics. It's got two adults and three dependents or children. You can see here that we've created a married couple from their marital status. So again, that's driven by the demographics. They're they're both employed again, consistent with the demographics of the area.

We've given them an occupation, and we've also created the three children. So you can see here, we've got quite rich information about not just the household itself, but also the people that live in it. Magnify or repeat this process for 20,000 times and we've now generated 43,000 people because for every household we're generating, at least one person, but generally more than one person.

So if you now filter this data set to look at those who are customers of our fictitious broadband and media streaming company, you can see that this all makes sense. We don't have any children who are customers, just older teenagers and people through all other kind of age bands. You can see here, the distribution of people who are customers, the majority of them are employed.

The marital status of those customers again, is reflected there as well. So that all seems to make a lot of sense in terms of the demographics.

And then finally we then modelled the subscription to the fictitious product. We created three fictitious products and the interns actually had a lot of fun doing that. They modelled when a customer might start their subscription and also, based on an attrition rate, when they might have left the subscription.

So you can see here that the most popular product is the Cheeta product. There's 11,000 customers that have are using that product, but you can see over time, there were 12 and a half thousand customers, and a chunk of those have subsequently left, which you can see modelled in the data here.

So you see of this growth curve for the product. Healthy growth, but also we're modelling the attrition rate. So we're trying to create that next layer around the synthetic customers to simulate their engagement with the product or the service for the particular domain that we're generating these customers for.

Retrospective

So in conclusion, first of all, did we achieve what we set out to achieve? I think we did, we were able to generate thousands of synthetic customers on demand, with a rich spectrum of demographics, statistically accurate down to the postcode level. And we could regenerate those results reliably. We also wanted the synthetic customer generator to be configurable so we could adapt it for different domains.

And actually we proved that quite quickly after the project finished because we put it into use. The first example was with a client in the wealth management space, their production data was highly sensitive. It was personal data and it included information about each client such as their bank account details and their wealth in terms of funds under management.

We were able to use synthetic customer generator in that case to generate a rich, realistic data set for development and test, which meant that we, as a consultant, working with a client, never needed to touch their production data, which is always a good thing.

Another example was a FinTech we were working with they were moving from a successful startup phase into a scale up phase. They wanted to prove that their platform could scale to meet the anticipated demand. They were growing rapidly and they wanted to make sure that there was no technical blockers to their continued successful growth.

The problem was that they didn't have a sufficient volume of data in production to use in any kind of performance testing. So synthetic customer generator came to the rescue. We generated hundreds of thousands of realistic customer records. And we used that to put the test system through its paces and in doing so we highlighted a few performance bottlenecks, which we were unable to unblock before they became an issue in production.

So two great examples where the capability was leveraged for different clients.

Value

And in terms of the value, with a product like this, we've invested about a man year in the product if you think about those five of us working on it for the best part of two and a half months. So it's a significant investment.

So you've got to be sure that it's gonna generate sufficient value for the product that it's supporting over the lifetime of that product. So clearly you want to develop this kind of capability as early as possible in the product life cycle. And ideally to support the grow phase, but also it's gonna help with the sustain phase of that product.

But what's also quite interesting is once you have this kind of capability, it starts to be a tool that you can use earlier in the life cycle for new products and services. You can use this data set to fuel hackathons or as a tool for bringing product demos or product concepts to life because it gives you personas and rich data to, to do that.

It's not just a one dimensional capability. There's other use cases that could emerge from it in the future. Once you have that capability in your armoury.

Lessons Learned

So in conclusion what we learned is that synthetic data of this nature is incredibly useful. We've got a lot of benefit from it.

Over the 12 months since we created it, we've used it in a lot of different use cases. Hopefully shown through the open source and the open data the data that you, the open data and the open source packages out there yu can do this this yourself.

Also what surprised us is that this product was built by four interns. Undergrads that were on board with us for two and a half months during the summer. I think it helps to demonstrate that you don't need to be hiring specialists in this age where specialists are rare and difficult to recruit. You can grow your own talent by bringing in the right people earlier in their career, providing them with the right tools and the right level of support. And it's amazing what they can achieve.

And finally, we've also demonstrated that Synapse, as that Swiss army knife, is a great environment in which to build solutions like this, because it brings all of the different capabilities you need together, cutting a lot of the friction that you would otherwise encounter and get you productive as quickly as possible.

So thanks very much for listening. If you've got any further questions, please reach out. I'm on LinkedIn. You'll also find me on Twitter. I'd love to hear any questions and feedback about the talk.

Thanks very much and enjoy the rest of your conference.