Azure Batch - Time is Money in Big Compute
Earlier in the year, endjin worked with the Azure Batch Product Team to run a series of experiments against the Azure Batch service using a framework we developed for performing scale, soak and performance tests.
We've had conversations with a number of organisations over the last 5 years who have scaled their compute intensive workloads (SAS, SQL Server, financial models, image processing) to the biggest box, with the highest number of cores, and the largest amount of memory that their hardware vendor can supply, only to find that their application still creaks and slowly creeps towards exceeding their SLA.
Moving to the cloud in the hope that Microsoft can provide "a bigger box" doesn't work, as Microsoft too only has access to the same vendor hardware as the customer does. So, we simulated a number of real world scenarios, which included spinning up 16,000 A1 virtual machines, to determine whether Batch would be a good fit for the financial services organisations we work with.
We ended up consuming over 1,000,000 core compute hours (at the cost of $100,000), capturing as much telemetry as we could about what was happening to gain real insights as to how the platform behaves when you start to use it at large scale. As well as exploring the various supported features, we wanted to understand the most efficient ways to use the service.
The reason for the existence of Big Compute platforms in the first place is to improve the performance of (and in some cases simply enable) large scale distributed computational workloads - as such, any inefficiencies in this process are only amplified as things get bigger and bigger.
The old saying "time is money" is never more true than in a Pay as You Go service, and as Microsoft Azure is just that, Azure Batch is the equivalent of enabling data roaming and inviting the whole world to tether at once. Like for like, the costs for Batch are the same as for normal compute hours, but it's difficult to really start playing around with it unless you're prepared to spend some real money - simply because of the nature of the beast.
You could absolutely use it to run a single task on a small node but where's the fun in that? It wants to be turned up to 11 - or in our case 16,000 - which can get very expensive very quickly.
The good news is our tests showed that there were indeed some optimisations that could be applied to the out-the-box platform to improve efficiencies in how tasks are scheduled across a pool, how a pool of nodes is provisioned and how an individual node is configured.
For example, some versions of the Guest OS applied to the node image take longer to apply than others, which can have a big impact in provisioning the nodes. Also, the many configuration options available mean that there's many ways to distribute tasks across the nodes in a pool.
Adjusting the configuration according to what your job is actually doing in terms of the resources it needs and the dependencies between the tasks will directly affect the overall performance. Finally, the data centers used have peaks and troughs of usage - so with some careful planning, the same job with the same configuration can be scheduled to run quicker at different times of the day.
We really believe in the Azure Batch platform and would recommend it to organisations in a number of scenarios - including those with a need for Big Compute and those who are processing data as part of an Azure Data Factory pipeline.
However, with the additional insight we've gained from conducting exhaustive testing, we've been able to optimise the use of Azure Batch whilst applying it to some of our own business problems, for example building an image matching solution that can perform 4.25 billion comparisons an hour with a relatively small pool of nodes.
Unless you're building your own version of Azure Media services, Azure Batch probably isn't something you'll be running 24/7. A more common use case is on-demand scale for a high intensity job that you need to run quickly.
Azure Batch provides a prescriptive, easy to follow architecture that allows you to distribute and schedule your workload in order to do this. But, as every workload is different, the platform should be configured in the most optimal way to minimise the associated cost with each running job. Unfortunately, the reality is that figuring all of this out through testing different configurations takes time, which once again means money.
The data and recommendations that we have captured can save you a lot of this time and money, allowing you to start passing on the performance gains and reduced running costs to your own customers quicker. The output of our experiments - our Azure Batch Evaluation Whitepaper - describes the risks and benefits of adopting Azure Batch, as well as the detailed insights from our exhaustive testing.
If you're interested in reading the whitepaper when it is released, please let us know, so we can keep you updated.
Keep up-to-date on developments in the Azure ecosystem by signing up for our free Azure Weekly newsletter, you'll receive a summary of the weeks' Azure related news direct to your inbox every Sunday, or follow @azureweekly on Twitter for updates throughout the week.