Troubleshooting Twilio with New Relic
Recruitment Genius asked endjin to develop an automated, telephony-based candidate interviewing system, using Twilio and Microsoft Azure.
After it went live, Recruitment Genius started receiving a number of support calls from users who were having trouble logging into the system. Troubleshooting involved a series of laborious steps to manually tally Twilio logs with logs from the rest of the application stack. This information gathering exercise could take up to 30 minutes per support request.
We decided to integrate New Relic into the solution, to capture pertinent information about any errors at the time they happen, and surface it all in one easily accessible place; a New Relic dashboard.
New Relic offers real time application monitoring. It provides dashboards of graphs and metrics covering an array of data sources. It can be deeply integrated into applications, using their agents and API.
The rest of this article explains in more detail how Twilio and New Relic fit into the wider picture, and provides sample code demonstrating how to achieve the end result.
The solution – Twilio and New Relic integration
Integrating New Relic took two days, with a further two days of testing.
The graphic below shows the full application lifecycle for all parties involved. The relevant part of the story has been highlighted with numbered steps.
The steps can be described as follows:
- A user makes use of the automated interviewing app, which interacts with Twilio.
- Twilio calls into the backend service which is hosted in Microsoft Azure.
- If there are any errors in the backend process, they are logged to New Relic.
- The backend service returns an appropriate success or error response to Twilio.
- Twilio transforms that response into an outcome to the User, e.g. a spoken message.
- If there's an issue, the user contacts Recruitment Genius support to report it.
- The support team use the information available in New Relic to diagnose the root cause of the issue. The dashboard can also be monitored in real-time, and appropriate action taken pre-emptively.
- The support team communicate the outcome to the user, e.g. an incorrect pin code has been entered so try again or reset the pin code.
Once errors were being logged to New Relic, which included all relevant data about that request and the error itself, it was easy to consume, via the New Relic dashboard. This cut the information gathering exercise down to around 1 minute per request for 1^(st) line support. If it was a known error like an invalid pin code it could be responded to immediately.
This screenshot shows a New Relic dashboard that is graphing and listing errors for the last 3 days. Notice there have been 89 occurrences of InvalidPinException. You can then click on the error to drill into all of the logged data for that error. If all of these users called the support team for help, having this dashboard would have saved the team around two man days.
Once the Twilio logs were visible to customer support, patterns emerged. 90% of issues were related to candidates mistyping their interview pin code. User Experience changes were made to the interview invite email to make pin codes more readable and additional instructional text was added, resulting in a considerable reduction in support calls.
From a technical perspective, several performance bottlenecks became apparent, and were easily remedied once all the end-to-end diagnostic information about the interactions of the two platforms were unified.
The technical bits!
For a good understanding of Twilio basics, you could read this step by step guide to building a Twilio voice app with Web API.
A sample application demonstrating using Twilio and integrating with New Relic has been created and is available in full on GitHub. The sample application was written by endjineers Pascal (@TheBooleanFrog) and Mike (@MikeLarah).
We'll run through the main parts of this demo app now, to show at a high level how it integrates with Twilio and ties to New Relic for logging errors.
When someone calls the phone number for this demo app, Twilio is configured to make a request to an ASP.NET MVC controller called WelcomeController:
This asks a user to enter their 3 digit pin code, then returns a response to Twilio telling it to get 3 digits from the user, then call the AuthenticateController with the result.
All the controllers in this demo app inherit from a base controller called TwilioApiController. This contains a helper method to simplify returning an HTTP response:
The AuthenticateController makes a call to an authentication service.
In this demo, it simply does a lookup of some in memory user data, to determine if the pin code is a known and valid one.
If the user is authenticated successfully, the controller responds to Twilio, asking for a message to be spoken to the user, and for another request to be made to the UserController, passing the pin code along.
The UserController asks Twilio to say 'hello', along with the user's name, then end the call.
If, however, authentication is not successful, an InvalidPinException is thrown.
When the demo app starts, an error filter called TwilioRequestErrorFilter is registered. Error filters let you customise what happens to unhandled exceptions in ASP.NET. The simplest way to write an exception filter is to derive from the System.Web.Http.Filters.ExceptionFilterAttribute class and override the OnException method.
This error filter collects some pertinent information about the error, such as the exception itself, the action to take for this error (escalate, ignore, retry), and the parameters of the request that caused the error.
If the error is a known custom exception, the error action is defined on the exception itself. In the case of an InvalidPinException, the action is set to ignore. If the exception isn't known, the action is set to escalate. This allows filtering of known and uninteresting errors for monitoring purposes.
The error details are then logged to New Relic, using the New Relic .NET agent API, where they can be monitored and alerted on, and used to troubleshoot issues effectively, as described previously in this article.
Don't forget the full sample application is available on GitHub! Hopefully this proves useful to people trying to achieve something similar.