Skip to content
James Dawson By James Dawson Principal I
Using Cloud CI/CD in Zero Trust Environments

If you've been involved with agile software development for ~20 years then you can probably remember what was then considered the holy grail - Continuous Integration aka 'CI'. Back then achieving CI was assisted by what now looks very Heath Robinson setups. However, regardless of how polished (or otherwise) your CI process was they generally ran on your 'build server', sometimes also known as 'the spare PC in the corner'!

Whether you had a dedicated server-grade machine or not, it was still a machine that you had to manage and ensure that it was configured appropriately to build your software - perhaps you used tools like CruiseControl or Draco.NET as the basis for running your build?

Nowadays with the plethora of cloud-based CI/CD services, including free tiers and consumption-based pricing, unless you have very specialised requirements most organisations would think twice about the costs and maintenance overhead of running their own farm of CI/CD servers. With a cloud-based service your provider takes care of managing the fleet of servers required to run all its customers' pipelines - from your infrastructure's perspective these CI/CD servers are just random computers on the Internet.

DevOps environments are increasingly becoming an attack surface utilised by threat actors whether for gaining unauthorised access (as per the recent LastPass incidents) or crypto mining attacks.

Such is the scale of these cloud-based services that a given customer has no way of knowing which server is going to run its pipeline, other than perhaps which region of the globe it is located in.

For many scenarios this is not an issue, however, when you have pipelines that need to connect to infrastructure this can become a problem where networking restrictions limit traffic from the Internet. In this situation, the set of all possible CI/CD servers that your provider could use to run your pipeline are essentially indistinguishable from any arbitrary Internet-connected server. Whilst the service provider may publish details of all the potential IP address ranges that their servers can use, adding those ranges to some form of permanent allow-list is... ill-advised!

  • Unmaintainable - typically these IP address ranges are vast (accounting for all possible egress IP addresses from your providers' datacentres) and they change over time, so you would need to ensure that your mammoth allow-list was always up-to-date
  • Insecure - by adding all your providers' servers to your allow-list, you are not only allowing your pipelines to connect to your infrastructure but also anyone elses' pipelines who use the same provider. Admittedly, such malevolent actors should still need to authenticate to those infrastructure resources; but arguably if you were happy with just identity-based safeguards, then you probably wouldn't be employing the network-based restrictions in the first place.

Such network restrictions can be implemented using one or both of the following approaches:

  1. Disable open access to a service via its public IP address, defining an explicit list of allowed public IP addresses
  2. Connect the service to a private network (e.g. Azure Virtual Network) so it can be accessed via an internal IP address

The general advice from cloud-based CI/CD services for the above scenarios is to use your own private CI/CD agents, that you can ensure have either a static public IP address or can be connected to the private network. However, this is less than ideal in a number of ways:

  • Increased infrastructure costs
  • Increased management overhead to maintain the above, in terms of both uptime and keeping the various software versions up-to-date
  • Scalability can be less flexible as you need to decide how many agents you require - although this can be less of an issue if the service provider has some elasticity features (e.g. using Azure Virtual Machine Scale sets

In order to go the private agent route, ideally there would be a way to retain the on-demand benefits of the native cloud agents combined with consumption-based pricing so you only pay for what you use; as opposed to having the standing cost associated with the virtual machine(s) whether they are actively running CI/CD workloads or not. There are however a couple counter-points to this:

  • You already run a Kubernetes cluster, in which case running your private agents from there could be an option - taking Azure Pipelines as an example
  • Your service provider supports container-based elastic agents - non cloud-based tools have supported this for a long time (e.g. TeamCity, Jenkins), but take Azure Pipelines as an example this is the closest I've come to finding a first party solution.

OK, so we've established the following:

  • It can be desirable to restrict public network access to cloud resources
  • This can make CI/CD tricky when running on cloud-hosted agents
  • The typical recommended solution can be costly in terms of time and money

So what's another option? Often only a small part of the CI/CD process is impacted by these network restrictions. Continuing with the Azure example, this is dependent upon whether the action the CI/CD process needs to take is performed using the Control Plane or the Data Plane. When you talk to the control plane you are connecting to the Azure platform API services that Microsoft host on the public internet, whereas when you perform data plane operations you are connecting to the API services run by your provisioned Azure resources.

Control plane examples:

  • Provisioning an Azure Storage Account
  • Provisioning an Azure App Service
  • Configuring network restrictions on a Storage Account

Data plane examples:

  • Reading or writing data from a storage account
  • Deploying the latest web application code to an App Service

Also, blurring this boundary somewhat, the control plan can also offer mechanisms for making certain data plane changes. For example:

  • Creating a new storage account container
  • Updating a key vault access policy
Azure Weekly is a summary of the week's top Microsoft Azure news from AI to Availability Zones. Keep on top of all the latest Azure developments!

The solution I'm going to propose here is a Just-in-Time (JIT) approach, whereby your CI/CD workloads are able to dynamically grant themselves temporary network access prior to performing the data plane operations.

The best hour you can spend to refine your own data strategy and leverage the latest capabilities on Azure to accelerate your road map.

The third control plane example above is a bit of an "odd one out" compared to the other examples, however, I included it because it is vital to this approach. Configuring the network access controls for Azure resources is performed via the control plane, which means our cloud-hosted CI/CD agents will be able to do this (assuming they are running under an identity with the relevant permissions).

The high-level process is illustrated below:

flowchart TD pre(Run Unrestricted Steps) --> ip(Lookup Agent Public IP) ip --> jit(Add Allow Rules) jit --> do(Run Restricted Steps) do --> error{Errors?} error -- yes --> handle(Handle Errors) error -- no --> unjit(Remove Rules) handle --> unjit unjit --> post(Run Remaining Steps)

Let's look at some of those steps in more detail.

Lookup CI/CD Agent Public IP

This is the first pre-requisite for our JIT approach. Whilst we can lookup the local IP of the CI/CD agent we're running on, it's very likely that this will be an internal IP address as the machine is behind a network address translation (NAT) device.

There are many web sites out there that offer a service of telling you the public IP address that you are seen as originating from - I find ifconfig.io a reliable option that is easy to work with from a script:

Invoke-RestMethod -Uri https://ifconfig.io
90.255.167.45

I prefer Invoke-RestMethod over some of the other options in PowerShell as it gives simpler output for this scenario - see what happens when using Invoke-WebRequest:

Invoke-WebRequest -Uri https://ifconfig.io

StatusCode        : 200
StatusDescription : OK
Content           : 90.255.167.45

RawContent        : HTTP/1.1 200 OK
                    Date: Tue, 14 Feb 2023 13:02:44 GMT
                    Connection: keep-alive
                    CF-Cache-Status: DYNAMIC
                    Server-Timing: cf-q-config;dur=7.0000005507609e-06
                    Report-To: {"endpoints":[{"url":"https:\/\/a…
Headers           : {[Date, System.String[]], [Connection, System.String[]], [CF-Cache-Status, System.String[]], [Server-Timing, System.String[]]…}
Images            : {}
InputFields       : {}
Links             : {}
RawContentLength  : 14
RelationLink      : {}

Add Allow Rules

This is where the temporary network access is granted to the public IP address retrieved in the previous step. Unfortunately, from an automation perspective, how this step is achieved is entirely dependent on the cloud resource that you need access to.

Not only are the underlying API calls different, but also the representation of such access rules can be different between resource types, for example:

  • Can you specify a single IP address?
  • Are IP address ranges represented in CIDR notation or using a start and end address?
  • Can rules be named? If so, what are the naming constraints?
  • Can rules have additional descriptive text?
  • How are rules ordered relative to others?

This means we will require multiple implementations depending on which resource types we need to support, based on how their network restrictions functionality has been implemented.

Where possible we should name these temporary rules such that we can easily identity them for when they need to be removed.

Handling Errors

Given the nature of the security changes we are making, it is critical that this process takes as much care as possible to fail safely. The last thing we want is for an unexpected error to result in this temporary network access remaining active after the CI/CD process has finished.

Whilst we still need to handle the error in the same way as usual, so we can troubleshoot what went wrong, this must not prevent the next step from running. We can think of this as adding a try/catch/finally semantic to our CI/CD process.

"A common usage of catch and finally together is to obtain and use resources in a try block, deal with exceptional circumstances in a catch block, and release the resources in the finally block."

How you achieve this will largely depend on whether you are orchestrating the CI/CD process via a script or via the workflow engine of your CI/CD platform. In the former case, assuming your scripting language supports it, you will be able to use the native try/catch/finally language features.

Where you are orchestrating the process in the CI/CD pipeline definition, you will need to rely on the conditional support provided by your CI/CD platform to ensure that the next step is always run, even if a previous step has failed. For example:

Azure DevOps

- task: AzurePowerShell@5
  condition: always()
  displayName: 'Remove temporary network access'
  ...

GitHub Actions

- name: 'Remove temporary network access'
  if: always()
  run: |
  ...

Remove Rules

Finally we have the particularly security-sensitive step of removing the network access that was granted above. This is to ensure that the next CI/CD job that uses the same agent doesn't inherit the network access to your resources.

As previously mentioned this step must always run regardless of whether the pipeline has succeeded or failed - this step is what the finally block discussed above needs to implement.

Creating the temporary network access rules with a distinctive name can simplify this step, by making it easy to identify which rule(s) need to be deleted. It also has the advantage of being able to detect orphaned rules from previous runs that were, despite our best efforts, not deleted for some reason.

Where a particular cloud resource doesn't support naming rules, we will have to fallback to finding the rule by cross-referencing the IP address. However, this approach won't help us catch those orphaned rules that may have slipped through the net on previous runs.

Caveats & Warnings

It's fair to say that the above approach is not foolproof and whilst it offers a solution for how to reduce your infrastructure's attack surface without the overhead of running your own private CI/CD agents, it may not provide the level of security assurance that you need.

  • The public IP address may not be unique to 'your' CI/CD agent, so you could potentially be granting access from multiple agents that share the same public IP address (i.e. behind a NAT device).
  • It assumes that the CI/CD workload (or at least the 'Add Rule' and 'Remove Rules' steps) are run in the context of an identity that has the necessary control plane permissions to modify the cloud resources in question.
  • As previously mentioned, be aware that a CI/CD pipeline can itself be an attack vector. If a bad actor can modify the pipeline definition (e.g. in an open pull request) then they may be able to exploit the elevated network access and/or infrastructure permissions for their own ends. Although such issues could also apply when using private agents too.

Considering all the above points, you will have to decide whether the time-limited nature of the network access and the identity-based safeguards enforced by the data plane are adequate mitigations for your scenario.

Managing Temporary Network Access to Azure Resources with PowerShell

Endjin maintain over 50 open source projects, most of which are .NET libraries however we also have some PowerShell modules that we make available to the community.

One such module is called Corvus.Deployment and acts as a bit of swiss army knife for various Azure deployment scenarios. We recently added some new functionality which provides a pluggable mechanism for implementing the 'Add Rule' and 'Remove Rules' steps discussed above.

Currently it supports a few resource types, with what we call 'handler' implementations for the following:

  • Azure Storage Account
  • Azure App Service
  • Azure SQL Database

Check this link to see whether more have been added since this post was written. If a resource you need is not yet supported, the repo also contains some notes on how to write your own resource type implementations; and we would happily accept a PR!

This functionality is exposed via a single cmdlet Set-TemporaryAzureResourceNetworkAccess.

Rules can be added by calling the cmdlet as shown by the examples below.

  • Azure App Service (main web site)
    Set-CorvusTemporaryAzureResourceNetworkAccess -ResourceType WebApp -ResourceGroupName myapp-rg -ResourceName mywebapp
    
  • Azure App Service deployment backend (i.e. the 'SCM' site, aka the Kudu service)
    Set-CorvusTemporaryAzureResourceNetworkAccess -ResourceType WebAppScm -ResourceGroupName myapp-rg -ResourceName mywebapp
    
  • Azure Storage Account
    Set-CorvusTemporaryAzureResourceNetworkAccess -ResourceType StorageAccount -ResourceGroupName myapp-rg -ResourceName mystorageaccount
    
  • Azure SQL Database
    Set-CorvusTemporaryAzureResourceNetworkAccess -ResourceType SqlServer -ResourceGroupName myapp-rg -ResourceName myqslserver
    

The rules created by the above commands can be removed simply by adding the -Revoke switch, for example:

Set-CorvusTemporaryAzureResourceNetworkAccess -ResourceType WebAppScm -ResourceGroupName myapp-rg -ResourceName mywebapp -Revoke

The above implementations use the Azure PowerShell modules to interact with the cloud resources, but that decision is entirely down to how the handler is written.

Before we wrap-up this post, let's look at a couple of more realistic, if simple, use cases based on the orchestration options mentioned above.

We won't cover it in detail here, but the Corvus.Deployment module incorporates an approach to managing Azure authentication (Azure Powershell and/or Azure CLI) that aims to validate that the current session is both authenticated and connected to the correct subscription and tenant. This is handled by the Connect-CorvusAzure cmdlet you see below.

Script Orchestration

Below is a script fragment for an imagined deployment process that focusses on demonstrating how this command can be used to facilitate deploying an application to Azure App Service that has locked-down network access.

Install-Module Corvus.Deployment -Scope CurrentUser
Import-Module Corvus.Deployment
Connect-CorvusAzure -SubscriptionId <mySubGuid> -AadTenantId <myTenantGuid>

# Run the steps that do not require additional network access
doUnrestrictedSteps

# Setup temporary network access to enable deploying the latest web app code
$tempNetAccessSplat = @{
  ResourceType = "WebAppScm"
  ResourceGroupName = "myapp-rg"
  ResourceName = "mywebapp"
}
Set-CorvusTemporaryAzureResourceNetworkAccess @tempNetAccessSplat

try {
  # This requires access to the App Service data plane (i.e. Kudu)
  Publish-AzWebApp -ResourceGroupName $tempNetAccessSplat.ResourceGroupName `
                   -Name $tempNetAccessSplat.ResourceName `
                   -ArchivePath "mywebapp.zip"
}
finally {
  # Ensure the temporary network access is removed
  Set-CorvusTemporaryAzureResourceNetworkAccess @tempNetAccessSplat -Revoke
}

# Run any remaining steps that do not require additional network access
doFinalDeploySteps

Pipeline Orchestration

Below is an equivalent implementation, where the process is being orchestrated within an Azure DevOps pipeline.

jobs:
- job: deploy
  steps:

  #
  # Add your unrestricted tasks here
  # 

  - task: AzurePowerShell@5
    name: GrantAgentNetworkAccess
    displayName: 'Grant agent temporary network access'
    inputs:
      azureSubscription: 'MyAzureRmServiceConnection'
      scriptType: InlineScript
      pwsh: true
      inline: |
        Install-Module Corvus.Deployment -Scope CurrentUser
        Import-Module Corvus.Deployment
        Connect-CorvusAzure -SubscriptionId <mySubGuid> -AadTenantId <myTenantGuid>
        Set-CorvusTemporaryAzureResourceNetworkAccess `
          -ResourceType WebAppScm `
          -ResourceGroupName myapp-rg `
          -ResourceName mywebapp
  
  - task: AzureRmWebAppDeployment@4
    displayName: 'Deploy web app to Azure App Service'
    inputs:
      ConnectionType: 'AzureRM'
      azureSubscription: 'MyAzureRmServiceConnection'
      appType: 'webApp'
      WebAppName: '$(webAppName)'
      packageForLinux: '$(System.ArtifactsDirectory)/webapp'
  
  - task: AzurePowerShell@5
    condition: always()
    name: RevokeAgentNetworkAccess
    displayName: 'Remove agent temporary network access'
    inputs:
      azureSubscription: 'MyAzureRmServiceConnection'
      scriptType: InlineScript
      pwsh: true
      inline: |
        Import-Module Corvus.Deployment
        Connect-CorvusAzure -SubscriptionId <mySubGuid> -AadTenantId <myTenantGuid>
        Set-CorvusTemporaryAzureResourceNetworkAccess `
          -ResourceType WebAppScm `
          -ResourceGroupName myapp-rg `
          -ResourceName mywebapp `
          -Revoke

  #
  # Add your final unrestricted tasks here
  # 

In summary, whilst using dedicated private CI/CD agents can be considered the 'gold standard' for securing how CI/CD processes access network restricted cloud resources, it is not without its drawbacks; including cost, time/effort and the potential security risks if they are not kept up-to-date. The alternative solution described in this post offers a more lightweight, just-in-time approach that you may decide strikes a better balance for your scenario.

Perhaps you have an alternative solution you're already using, or suggestions for new resource types to add to the tooling shown here? Let me know if you do! @James_Dawson

FAQs

If I need to access virtual network connected Azure resources during a deployment, do I have to use a private build agent? Not necessarily. Whilst it is impractical and insecure to add the IP ranges used by your cloud-based CI/CD agents to a permanent allow list, you can extend your deployment pipeline to include temporarily adding the IP address of the agent running the pipeline to an allow list for each of the resources it needs access to.

James Dawson

Principal I

James Dawson

James is an experienced consultant with a 20+ year history of working across such wide-ranging fields as infrastructure platform design, internet security, application lifecycle management and DevOps consulting - both technical and in a coaching capacity. He enjoys solving problems, particularly those that reduce friction for others or otherwise makes them more effective.