Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Introduction)
The cloud offers many advantages when considering the environmental impact of your application and data workloads. Fundamentally, compute power needs electricity, which consumes resources that generate CO2. But cloud platform services (PaaS) and serverless compute offerings provide shared, on-demand processing power which reduce resource use over having dedicated machines that are constantly up and running (and probably sitting mostly idle). Autoscale functionality (or "elastic compute") takes it a step further by increasing or decreasing resources needed according to the size and scale of the workload - no need to over-provision "just in case". Whilst a lot of the focus/marketing around these benefits is on financial cost - you're only paying for what you need, when you need it - the statement is equally true when you consider the carbon footprint.
The major cloud providers have also heavily invested in running their data centres in a sustainable and efficient way - most are trying to become carbon neutral or carbon negative. A 2018 study found that using the Microsoft Azure cloud platform can be up to 93 percent more energy efficient and up to 98 percent more carbon efficient than on-premises solutions. So, by moving your workloads into the cloud, you're already benefitting from this investment, and reducing your environmental impact.
But there's always more that could be done. Especially when you start to consider the specific characteristics of your workloads, and have control over how, where and when they're being run.
Optimising Cloud Data Pipelines
Data processing pipelines are a great example of this - the cloud native version of classic ETL processes. As data and analytics needs increase, the flexibility that cloud data pipelines provide through polyglot processing, autoscale compute, and parallel processing (e.g. with Spark) means that data workloads can be delivered more efficiently and optimised for the specific of the use case.
However, one common pattern still remains. We see it in every data integration/cloud data platform project that we work on, regardless of the industry, domain, or technology:
The overnight batch process.
The requirement for delivering up-to-date reporting or insights on a daily basis is at the core of the vast majority of line of business applications and reporting workloads. Typically the requirement is based around data being refreshed before the start of the business day, so there's generally a going to be a window of time in which the processing can run. Maybe it needs to be after 12:00am to allow for all the previous day's data to be captured. And maybe you know that the early risers are going to start logging on to view the latest dashboards at 8:00am. So, once you have a rough idea of how long the processing is going to take, unless it consumes your entire time window, then you have flexibility in when you schedule your big data workload.
The pattern is also very applicable when you zoom out - month-end, quarterly or year-end reporting follow the same characteristics. They possibly have bigger and longer running processes, but there's possibly a bigger time window in which they need to be run.
So, rather than pick an arbitrary point in time to schedule our workloads every night, month or year, what if we considered other factors such as the carbon footprint of that processing? Is there an optimal time to schedule our processing for the lowest impact, and if there is, how would we calculate it?
Calculating Carbon Intensity
Some people might be aware of peak and off-peak electricity tariffs - off-peak tariffs are generally offered at night, when demand is at its lowest. Common sense might suggest that things are more efficient when not running at peak demand, and therefore have a lower carbon footprint, but there's actually lot more nuance to it than that once you start looking into it.
But, the good news is that others have already done this analysis. The National Grid ESO, in partnership with Environmental Defense Fund Europe, University of Oxford Department of Computer Science and WWF, have developed the world's first Carbon Intensity forecast with a regional breakdown for the UK. They explain that the carbon intensity of electricity is sensitive to small changes in carbon-intensive generation - carbon intensity varies by hour, day, and season due to changes in electricity demand, low carbon generation (wind, solar, hydro, nuclear, biomass) and conventional generation.
You can read more about the methodology they've used to model the Carbon Intensity data forecast here.
The even better news is that they've developed APIs (as well as providing other integration options) for others to use this data.
For example - this HTTP GET request returns the current carbon intensity for London (Region Id 13):
https://api.carbonintensity.org.uk/regional/regionid/13
Here's an example response, which shows the breakdown as largely made up of gas/imports, resulting in an overall forecast of a high intensity.
{
"data": [
{
"regionid": 13,
"dnoregion": "UKPN London",
"shortname": "London",
"data": [
{
"from": "2024-03-11T15:00Z",
"to": "2024-03-11T15:30Z",
"intensity": {
"forecast": 190,
"index": "high"
},
"generationmix": [
{
"fuel": "biomass",
"perc": 0
},
{
"fuel": "coal",
"perc": 0
},
{
"fuel": "imports",
"perc": 39.9
},
{
"fuel": "gas",
"perc": 39.2
},
{
"fuel": "nuclear",
"perc": 6.6
},
{
"fuel": "other",
"perc": 0
},
{
"fuel": "hydro",
"perc": 0.8
},
{
"fuel": "solar",
"perc": 2.1
},
{
"fuel": "wind",
"perc": 11.4
}
]
}
]
}
]
}
To use their own words:
"The goal of this API service is to allow developers to produce applications that will enable consumers and/or smart devices to optimise their behaviour to minimise CO2 emissions. Our OpenAPI allows consumers and smart devices to schedule and minimise CO2 emissions at a local level."
Of course, there are limitations - the data/modelling is for the UK only, and the forecast includes CO2 emissions related to electricity generation only. But, as the saying goes: perfect is the enemy of good. There's already huge value in this API for the purposes of optimising our data processing pipelines.
The Concept
So, given the above, we know that:
- We have long-running data pipelines that consume on-demand, auto-scale compute in cloud data centres
- A lot of these data pipelines have some flexibility in when they need to run, within a defined time window
- We can predict the carbon intensity associated with electricity generation at a regional level within the UK
Adding to that, we also know:
- The location of UK cloud data centres, including Azure (London and Cardiff) and AWS (London and Ireland)
Which means that we should be able to design intelligent data pipeline scheduling, that uses carbon forecasting to optimise the time window in which our processing should run, according to the geographical region in which it is running.
And the rest of this series of posts demonstrates how it can be done!
In the next post, I'll describe how this conceptual approach translates into a modern data pipeline architecture using a cloud native analytics platform like Microsoft Fabric or Azure Synapse.