Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Next Steps) | endjin

By James Broome Director of Engineering 14th November 2024

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Next Steps)

Over this series of posts, we've gone from introducing the concept of optimising data pipelines based on carbon impact, to describing how it would fit within a modern cloud architecture, to making it real within Azure Data Factory-based pipelines (including Synapse or Fabric).

In the previous post I shared a sample codebase that illustrated a starting point that could be lifted and shifted into Data Factory to develop further or modify according to the specifics of your own requirements.

I highlighted that it was a deliberate decision to model the end to end process in a set of pipeline definitions, using out-the-box activities for the processing logic for ease of re-use and portability.

But I also recognise that there were some design choices that I made with simplicity in mind that introduce some limitations.

In this final post in the series, I'll explore those decisions in more detail and provide some thoughts about how they could be implemented differently to add even more value through flexibility and extensibility.

Single deployment region

Knowing the location of the data centre in which your workload is going to run allows us to query the Carbon Intensity forecast API for the specific UK region. In Azure, this means either UK South (London) or UK West (Cardiff).

In my example, I assumed that we'd deployed into a single Azure region, so we know what that region ID parameter for the API will be at deployment time, and we can pass it through to the Data Factory pipeline definition through tokenised templates, for example.

However, to take things a step further, we could deploy a Data Factory instance into both regions (that won't incur any costs unless we start using them) and decide which is the right one to use dynamically at runtime.

This would involve modifying the first step in the process to query the Carbon Intensity API for both regions (the Carbon Optimiser Pipeline would still need to run in a known region each time) and find the optimum carbon intensity window across both sets of data.

Once we know the lowest impact time and region, then we can use that to schedule and trigger accordingly.

This would also require a loose coupling between the optimiser pipeline and the workload pipeline, and that's fairly easily achieved by using the Data Factory API. If the Data Factory managed identity has access to both workspaces, then it can trigger a pipeline run using the API, and also poll for the status of that pipeline run until it completes.

In this scenario, we'd probably need to know the IDs of the workspaces and names/IDs of the pipeline definitions at deployment time so that we can query the Data Factory API accordingly. But once this pattern is in place, it's also very extensible - e.g. if there was an equivalent API for analysing carbon intensity in other geographical locations (say, the US) then more deployment regions could easily be added into the mix.

Workload-specific trigger pipelines

In my example, I created Trigger Workload Pipeline definition that "wrapped" the end to end process of calling the optimisation pipeline and the workload pipeline. To keep things simple, this wrapper pipeline was workload-specific - I'd hard-coded this pipeline to the workload pipeline by using the ExecutePipeline activity.

Unfortunately there's no way to do cross-workspace pipeline execution using the ExecutePipeline activity, but using the same principle as described above - making things loosely coupled with the Data Factory REST APIs this is very much achievable.

At the minute, I'd need to add a new Trigger Workload Pipeline for every workload definition, which seems unnecessary. Refactoring this to use the Data Factory REST API, rather than the ExecutePipeline activity means that we could have a single instance of this wrapper pipeline that's more dynamic and parameter driven.

In reality, we'd probably still end up with workload-specific Triggers (that pass the specific parameter values for the workload into the process), but we've reduced the number of boiler-plate pipeline definitions needed, which makes for better maintenance.

Activity-based processing logic

As previously mentioned, I took a deliberate decision to model the end to end process in a set of pipeline definitions, using out-the-box activities for ease of re-use and portability. However, some of the XPATH expressions are quite complicated and a lot of interim variables are needed, which means the solution could be hard to understand, as well as being hard to debug.

There are clearly other ways to implement this functionality - for example, calling into Azure Functions, or using Spark Notebooks for programmatic logic. This would make the solution easier to test, document and manage via version control, however it also adds architecture complexity that might not be justified. Going down the Spark route also adds a performance penalty (that's arguably minimised in Fabric), but in Synapse this could add minutes on to the end to end processing time.

The advantage to doing this all in pipelines is that it's quick to run and requires no additional infrastructure. The whole purpose of this solution is to minimise the carbon impact of long running cloud data pipelines, so it would seem counterintuitive to rely on heavyweight processes and additional infrastructure to perform this optimisation. It's certainly a hard one to quantify, but despite the complications with the activity-based processing it feels like this lightweight approach is inline with the spirit of what we're trying to achieve.

However, if you're taking this approach into an environment that's already dependent on other architectural building blocks then there may be advantages to re-engineering some of the core processes to use them. As ever, context is everything.

Wait-based scheduling

Finally, in my example, I used a simple Wait activity to provide the scheduling mechanism for our workload pipeline. If the process starts at 5am, and the optimum trigger time is 7am, then the pipeline will remain on the Wait activity for 2 hours.

In the spirit of optimisation, this is a lightweight solution, requiring no additional infrastructure and capitalising on the fact that Data Factory already has built-in polling mechanisms to allow it to work.

However, it isn't ideal, especially from a monitoring perspective as it provides misleading information as to how long things are actually taking to run. It might also cause problems with timeouts, depending on the timescales involved. And there's some compute overhead needed for Data Factory to run this Wait activity. My hunch is that, with all the shared infrastructure and inherent polling mechanism at the core of Data Factory, relying on this approach still results in an optimised end-to-end process regarding carbon impact, but the truth is that it's hard to quantify.

The best hour you can spend to refine your own data strategy and leverage the latest capabilities on Azure to accelerate your road map.

A "better" solution might be to make this loosely coupled too - once we have the ideal trigger time, store that somewhere/somehow that allows for an asynchronous scheduling mechanism. It's hard to think of a solution to this without introducing additional architecture/processes and, ultimately, whatever solution is designed, there's going to be some form of compute needed and polling under the covers.

So, whether this improve things, or results in a less optimum solution largely depends on the context.

Conclusion

In this series of posts I've described how we can design intelligent data pipeline scheduling - using carbon forecasting to optimise the time window in which our processing should run, according to the geographical region in which it is running. I've shared code samples for pipeline definitions that implement the conceptual approaches that I've described that can be lifted and shifted into your cloud data pipeline environments.

And finally I've explored limitations and potential future improvements to the design - it's clearly not perfect, but shows what's possible given the currently available carbon data.

As this landscape evolves, I'll revisit this approach and share any updates that improve the solution!

Who We Are

What We Do

Who We Help

What We Think

Contact Us

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Next Steps)

Single deployment region

Workload-specific trigger pipelines

Activity-based processing logic

Wait-based scheduling

Conclusion

Also worth reading:

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Pipeline Definition)

James Broome08/11/2024

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Introduction)

James Broome18/10/2024

Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Architecture Overview)

James Broome23/10/2024

James Broome

Director of Engineering