Optimising software engineering at scale
ASOS are one of the world's largest online retailers, with over £2.5bn in revenues. As the team built to over 4000 employees, their CTO, Bob Strudwick, turned to endjin to help optimise their product delivery.
It's a real challenge to understand the reasons why a development process might not be as productive as the business expects.
The problems are rarely technology (however often developers complain that "the build is slow!").
So we always start by talking to the people - from senior stakeholders to the most junior developers.
We used our machine learning tools to analyse the product development lifecycle and it swiftly became apparent that the real problem was that the team was doing too much!
Dependencies will get you
Dependencies between teams in different time zones caused knock-on effects, deployment windows were missed, and whole release cycles of critical functionality could be delayed by an issue in a minor feature, in another team.
We helped identify the process bottlenecks and developed a scheduling methodology that could deal with dozens of teams delivering concurrently.
Automate the automatable
Once the methodology was right, we could then look at the detail. What was time consuming, manual, and error prone in the DevOps process? Those things are prime candidates for automation.
ASOS has a huge Azure estate across multiple technologies and environments. The continuous deployment process through various release tiers, regions and versions, means that something is always in flux. And in any sufficiently large, dynamic system, you have to take transient failure as a given, and engineer for resilience.
We designed and developed a solution that enabled the team to track deployment failures through their estate, visualize the "tracer bullet" through their logging and root cause problems directly into the failed system, with just a couple of clicks.
Always be learning
More importantly, it integrated a feedback loop into their helpdesk system, so common errors could be identified, recorded, and solutions built in to both manual and automated runbooks. The system could learn from its past mistakes.
We left ASOS with process and tools improvements that reduced common diagnostic times from hours to seconds, and more than quadrupled the number of releases to production they could perform each year. That was a grest result in itself.
But we also left them with a "feedback loops" mentality that they could apply to new technology and processes as they evolved, readying them for the Data Science and Machine Learning projects that were part of their evolving data strategy.