Skip to content
· 5 min watch
Barry Smart By Barry Smart Director of Data & AI

Auditing UK energy policy with twenty years of government data, strict data contracts, and a DuckDB medallion model running on a laptop instead of a cluster.

About this talk

DuckCon

Barry Smart, Director of Data and AI at endjin, sets out to audit UK energy policy with his dad, a fellow energy-industry veteran, on a single laptop. Their question: is the dash to Net Zero quietly compromising the security, reliability, and affordability of energy? They answer it not with opinion but with twenty years of fragmented government data.

Twenty-five years ago, work like this meant a fortune and a monolithic data warehouse. This time it is simple Python ingestion guarded by strict data contracts, a DuckDB medallion model, and requirements written in Gherkin as composable, fully tested functions, running where PySpark once failed to scale. The result: man-months of cost and "data fear" collapsed into a short laptop project, and a glimpse of how data teams freed from infrastructure can finally work as innovation teams.

Chapters

  • 00:00 Introduction: endjin, DuckDB, and auditing UK energy policy
  • 00:50 The data challenge and lessons from a 25-year-old data warehouse
  • 01:40 Architecture: data contracts, Parquet, and a DuckDB medallion model
  • 02:40 A data-driven approach with Gherkin and composable relations
  • 03:40 Generating insights and why DuckDB scales where PySpark did not
  • 04:30 New ways of working for data teams
Read the transcript

Thank you very much. Thanks for having me. My name is Barry Smart. I live in Scotland and I'm director of data and AI at endjin. We are a small, fully remote technology company based in the UK, but we've got clients all around the world. We're huge fans of DuckDB. We've written a series of blog posts that share our experiences adopting DuckDB in the field. So if you're interested, please go and have a look at our blogs.

All I want to talk to you about today is a personal project that I'm working on with my dad. We've both spent a lot of time over our careers in the energy industry, and we're keen to use that experience to audit energy policy in the UK. Our concern is that the dash to Net Zero is perhaps compromising the security, reliability, and affordability of energy. Now, there's a lot of opinion out there in the general press, and we're determined to form our own opinion, taking an evidence-based approach.

The data that we need to do that is readily available. It's published by a range of government agencies, but typical of this, it's very siloed and very difficult to work with. The good news is that I've worked with this data before. It was 25 years ago at Scottish Power. We spent a fortune and a lot of man-hours building a traditional monolithic data warehouse. Some of you might be old enough to remember this kind of thing.

And whilst I could reuse the domain knowledge from that experience, I didn't want to reuse that architecture. But thankfully, in the last 25 years, thanks to Moore's Law and, more recently, DuckDB, the most powerful analytics engine I've got access to is now actually on my desktop. It's my laptop, and that's what I've used for this project.

So I've got some simple Python modules to ingest the data, apply strict data contracts to what ends up landing in Parquet format as trusted data. Then in DuckDB I've got medallion layers. Each layer is a schema in the database. Bronze is zero-copy views over the Parquet data. Silver is where all the hard work gets done to aggregate and work with the data. I materialize data in tables and then I project views into gold to generate the insights that I'm working for. So all of this is running on my laptop, but I've got confidence at some point I can push this to the cloud.

So one small technical insight I've chosen is that I've used a data-driven approach to building this solution. I've written my requirements in Gherkin syntax, and this has helped to deliver clean, self-documenting code and an extensible architecture. So what it leads to is functions that look like this. We take in a DuckDB relation type, and we return a DuckDB relation type. So it makes it a unit of functionality that's easy to test, but it also then allows me to compose those units of functionality together to build more sophisticated, complex transformations. And then when it comes to materializing these relation types, I've still got all the benefits of DuckDB with query optimization, predicate pushdown, and the like.

So what that's allowed me to do, because I've got a suite of tests, I've built the functionality incrementally all along. Every new release, every new insight I'm generating, I've got confidence that what I'm showing is right because I've got that suite of tests. We've tried to adopt this approach with PySpark in the past, and it just doesn't scale. But DuckDB just eats this for breakfast. It's a really nice way of working.

So we can generate really detailed insights like this, looking at particular points in time when there's been a major event on the network. And we've also been able to zoom out to look at — now we can detect these events — we've been able to look at the macro scale across big chunks of the data set. So we've gone from many man-months, a whole lot of money, a whole lot of pain, definitely data fear, to a project that's taken me a very short space of time to deliver, with a bit of help from Claude Code along the way, running on my laptop using DuckDB.

But I think what's more impressive about this technology is not necessarily these numbers. It's the new ways of working that it's enabling. It's allowing data teams to not worry about complex infrastructure. They can deal straight with the data, get delivering insights and responding to the feedback from users. And it's really transforming the way that we work, the way that our clients work, because they can work as innovation teams. That socio-technical system that we've been bound to for all of these years is fundamentally changing, and we can really align ourselves to the priorities and value streams within the businesses that we work with.

Thank you very much.

About the presenter

Barry Smart

Director of Data & AI

Barry Smart

Barry has spent over 25 years in the tech industry; from developer to solution architect, business transformation manager to IT Director, and CTO of a £100m FinTech company. In 2020 Barry's passion for data and analytics led him to gain an MSc in Artificial Intelligence and Applications.