Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Architecture Overview)
In my previous post I described how I believed the carbon footprint of long running data pipelines can be optimised using open source data APIs like the Carbon Intensity forecast from the National Grid ESO. I outlined a conceptual approach that could take metadata about a pipeline workload (where it was being run, and how long it would take) and combine it with carbon intensity data for the geographical location to find the optimised time window for when the workload should be scheduled to run.
In this post I'll explore this conceptual approach a bit further - looking at how this would translate into a modern data pipeline architecture using a cloud native analytics platform like Microsoft Fabric or Azure Synapse.
Carbon-optimised data pipelines
As what we're talking about here is optimising the scheduling of a long running data pipeline, we know that we're already going to be operating within a cloud native data platform. So, it makes sense that the "optimisation" piece should also use the same pipeline architecture. At a high level, we'd need the following constituent parts:
- The workload pipeline to run
- Metadata about the workload (e.g. in the form of configuration or parameters) that include the time window available and where it's going to be run
- A pipeline that can use carbon intensity data to calculate the optimum schedule to run the workload based on the metadata
- A trigger that kicks off the end to end process
Putting that all together, the end to end process looks something like this:
Translating into cloud native
The constituent parts above translate very easily into a modern cloud analytics platform like Microsoft Fabric or Azure Synapse. Central to both of these platforms is the notion of data orchestration through pipelines, something that has also been provided through Azure Data Factory as a standalone tool for a long time. (You could also potentially use something like Azure Logic Apps to orchestrate the process in a similar way, but as we're focussed here on data processing, using a cloud data platform seems like a logical fit.)
And what's great is that whether it's Azure Data Factory, Azure Synapse (Pipelines), or Microsoft Fabric Data Factory, the pipeline architecture will be exactly the same.
The workload pipeline
This is your long-running ETL process, the one that we're trying to schedule at the optimum time. It will take the form of a pipeline definition in Data Factory that has the necessary activities to perform the workload.
Metadata about the workload
These are the parameters that needed to optimise the scheduling. At a minimum this would be the earliest and latest start times that you want to trigger the pipeline, factoring in how long it takes to run (including allowances for re-running after failures) and when you need it to be finished by.
So, if you need to wait until after midnight for a process to start, it takes 3 hours to run, and you need it to be finished by 8am, then your time window might be between 00:30 and 04:30.
The other variable is the location where your workload is running. In the UK we have 2 Azure data centres - UK South (London) and UK West (Wales). Knowing the region that we're going to run the pipeline in allows us to get region-specific carbon data so that we can optimise the scheduling accordingly. Most likely this is something that's set at deployment time rather than a dynamic run-time parameter, but there's nothing from stopping you from deploying your workloads into multiple locations (you might be already from a DR-scenario perspective) and deciding where the optimum location is at runtime (more on this later...)
Optimisation pipeline
This is another pipeline, that contains activities to call the carbon intensity API and process the response. The output variable from this pipeline will be the optimum start time to trigger the workload pipeline.
Pipeline trigger
This is the thing that kicks the whole process off, and in Data Factory terms this means a pipeline trigger, set to run at the earliest start time you've defined in your workload metadata. It will pass all the necessary metadata parameters into the child pipelines to run the end to end process.
Pipeline definition
So now we have a conceptual architecture that can be delivered on a pipeline-based analytics platform. In the next post, I'll walk through the pipeline definition in detail, and provide a full code sample for you to use in your own workloads!