Latency from a general point of view is a time delay between the cause and the effect of some physical change in the system being observed.
Request rate, error rate and request duration or latencies (RED) are some of the key metrics which are commonly monitored in any online system or service.
Latencies are generally plotted and monitored at percentiles of p90, p95, p99 etc. But, as you move towards the higher end of the spectrum, the tail latencies keep increasing. The Tail at Scale paper by Google, in fact, talks in detail about this. It also mentions various techniques to counter it. One of the techniques discussed is Request Hedging.
This post in on tail latencies, and how request hedging was used to curtail tail latencies in one of the high scale services at Razorpay.
The service in picture here is the notifications’ platform. Let’s talk a little bit about it before jumping into the problem and its solution.
The Notifications’ Service
The notifications’ service in Razorpay is a high throughput service, which receives requests to send out various types of notifications like webhooks, SMS, emails etc. at a peak rate of approximately 2000 requests/sec.
The complete architecture of the service is not really in the scope of this post, but to give a high level overview, the API layer of this service receives the requests and pushes the messages to SQS. The workers can then consume messages from SQS to hit the webhook or providers’ endpoints, post which, the status is updated in the database. There is also a job which runs periodically to do exponential retries in case of failures.
The Problem Statment
The clients of this notifications’ service had a strict timeout of 350ms in making any API call, but many a times, we would notice client timeouts which was not desirable. On further debugging, the tail latencies in pushing to SQS turned out to be the culprit. The p99.9 latencies would sometimes go upto 600ms!
PS: We do have fallbacks in place in the clients so as not to loose any event.
Before talking about the solution we employed, let’s look at how the Google paper defines Hedged Requests.
A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first. We term such requests “hedged requests” because a client first sends one request to the replica believed to be the most appropriate, but then falls back on sending a secondary request after some brief delay. The client cancels remaining outstanding requests once the first result is received.
While the Google paper talks about Hedged Requests primarily in the context of read requests, we used it in the write flow and piggybacked on the database and the cron job setup, which was already in place, to write the request to the database if SQS push doesn’t succeed within the defined timeout period. One of the drawbacks of this approach is that it can lead to duplicate deliveries, but that was acceptable as we anyway promise at least once delivery semantics.
Please do note that hedging writes is not a good idea if you don’t have at least once delivery semantics or if your writes are not idempotent.
The implementation involved
Enqueueing the message in a different thread or goroutine with a strict timeout something similar to this:
Why not use http
Now, this is an interesting question and an alternate approach to solve this problem could have been to use http transport timeouts like Dialer Timeout, TLS Handshake Timeout and ResponseHeaderTimeout.
Here is a good primer on
net/http timeouts in golang.
Using these transport timeouts would have meant:
- The sum of Dialer Timeout, TLS Handshake Timeout and ResponseHeaderTimeout would need to be less than the desired value of 350ms.
- The initial connection establishment which includes fetching IAM credentials, DNS resolution, SSL handshake and connection establishment, even before the payload can be sent and acknowledged, within 350ms would have been a close call.
- It can even lead to connections never getting established in case of some minor degradation in any of the aforementioned phases.
- Keeping relaxed transport timeouts along with a strict timeout in the application helps to mitigate this issue.
Is there a better approach to achieve the same? Please do provide your suggestions in the comments :)