By James Broome Director of Engineering 23rd October 2024

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Architecture Overview)

In my previous post I described how I believed the carbon footprint of long running data pipelines can be optimised using open source data APIs like the Carbon Intensity forecast from the National Grid ESO. I outlined a conceptual approach that could take metadata about a pipeline workload (where it was being run, and how long it would take) and combine it with carbon intensity data for the geographical location to find the optimised time window for when the workload should be scheduled to run.

In this post I'll explore this conceptual approach a bit further - looking at how this would translate into a modern data pipeline architecture using a cloud native analytics platform like Microsoft Fabric or Azure Synapse.

Carbon-optimised data pipelines

As what we're talking about here is optimising the scheduling of a long running data pipeline, we know that we're already going to be operating within a cloud native data platform. So, it makes sense that the "optimisation" piece should also use the same pipeline architecture. At a high level, we'd need the following constituent parts:

The workload pipeline to run
Metadata about the workload (e.g. in the form of configuration or parameters) that include the time window available and where it's going to be run
A pipeline that can use carbon intensity data to calculate the optimum schedule to run the workload based on the metadata
A trigger that kicks off the end to end process

Putting that all together, the end to end process looks something like this:

Data pipeline architecture

Translating into cloud native

The constituent parts above translate very easily into a modern cloud analytics platform like Microsoft Fabric or Azure Synapse. Central to both of these platforms is the notion of data orchestration through pipelines, something that has also been provided through Azure Data Factory as a standalone tool for a long time. (You could also potentially use something like Azure Logic Apps to orchestrate the process in a similar way, but as we're focussed here on data processing, using a cloud data platform seems like a logical fit.)

And what's great is that whether it's Azure Data Factory, Azure Synapse (Pipelines), or Microsoft Fabric Data Factory, the pipeline architecture will be exactly the same.

The workload pipeline

This is your long-running ETL process, the one that we're trying to schedule at the optimum time. It will take the form of a pipeline definition in Data Factory that has the necessary activities to perform the workload.

Metadata about the workload

These are the parameters that needed to optimise the scheduling. At a minimum this would be the earliest and latest start times that you want to trigger the pipeline, factoring in how long it takes to run (including allowances for re-running after failures) and when you need it to be finished by.

The best hour you can spend to refine your own data strategy and leverage the latest capabilities on Azure to accelerate your road map.

So, if you need to wait until after midnight for a process to start, it takes 3 hours to run, and you need it to be finished by 8am, then your time window might be between 00:30 and 04:30.

The other variable is the location where your workload is running. In the UK we have 2 Azure data centres - UK South (London) and UK West (Wales). Knowing the region that we're going to run the pipeline in allows us to get region-specific carbon data so that we can optimise the scheduling accordingly. Most likely this is something that's set at deployment time rather than a dynamic run-time parameter, but there's nothing from stopping you from deploying your workloads into multiple locations (you might be already from a DR-scenario perspective) and deciding where the optimum location is at runtime (more on this later...)

Optimisation pipeline

This is another pipeline, that contains activities to call the carbon intensity API and process the response. The output variable from this pipeline will be the optimum start time to trigger the workload pipeline.

Pipeline trigger

This is the thing that kicks the whole process off, and in Data Factory terms this means a pipeline trigger, set to run at the earliest start time you've defined in your workload metadata. It will pass all the necessary metadata parameters into the child pipelines to run the end to end process.

Pipeline definition

So now we have a conceptual architecture that can be delivered on a pipeline-based analytics platform. In the next post, I'll walk through the pipeline definition in detail, and provide a full code sample for you to use in your own workloads!

Who We Are

What We Do

Who We Help

What We Think

Contact Us

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Architecture Overview)

Carbon-optimised data pipelines

Translating into cloud native

The workload pipeline

Metadata about the workload

Optimisation pipeline

Pipeline trigger

Pipeline definition

Also worth reading:

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Pipeline Definition)

James Broome08/11/2024

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Next Steps)

James Broome14/11/2024

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Introduction)

James Broome18/10/2024

James Broome

Director of Engineering