Building A Highly Cached Fault Tolerant System using Redis, Hystrix and Spring Boot
A bit brief about how it started —
The challenge presented before us was to migrate the core ordering application from Ruby to Java (Spring Boot). To begin with this huge task, we picked our main booking show aggregate api, which was hitting around 30 downstream api’s and serving at average throughput of 1280rpm with average latency of 1100ms
The tasks we had in-front of us were -
- Build a system that is fault tolerant and isolated from downstream failures
- Build a fault tolerant and refreshable caching mechanism
Technologies we shortlisted for this task (Prerequisites for this article)
1. Redis (https://redis.io/)
2. Netflix Hystrix (v2.0.2) (https://www.baeldung.com/spring-cloud-netflix-hystrix)
3. Spring Boot (v2.1.0)
Architecture Low Level Design
- TTL1 — Short lived catching time
- TTL2 — Main time to live set in redis against the key
How the flow works -
The journey of making the external api call begins at our custom ThreadPoolTaskExecutor. Its various properties are modified to fulfill our requirement. Since we use java’s inheritable thread locals to make some fields globally accessible throughout the thread context, we have added a custom decorator to the threadpool to avoid the issues of data leak associated with a threadpool.
Part 1 — Hystrix
How hystrix fit’s the picture?
Hystrix in our current configuration creates a wrapper over the existing thread and serves as a circuit breaker. Now when the downstream service is hit and if it fails, either due to a timeout or due to an error response code. Hystrix considers it as a fail and triggers the fallback method which is associated with that api call. More details regarding how we handle this fallback in part 2 of this article. Now if the number of failures breaks a threshold i.e., x number of requests in say 10 second window, hystrix breaks the circuit (open circuit). Subsequent requests are directly sent to the corresponding fallback method.
By setting timeout property for each downstream endpoint, we can control how long to wait for response before hystrix fallback event triggers.
@HystrixCommand(commandKey = "GET_CITY_BY_ID", fallbackMethod = "getCityByIdFallback")
Part 2 — Redis
Redis at it’s core saves a key, value pair along with their expiration time (TTL i.e time to live). It provides methods to fetch value and time to live against any key. For our use-case we needed a fail proof, refreshable caching mechanism. To achieve this we created another dummy time to live. So our implementation involved 2 TTL’s, TTL1 and TTL2.
TTL2 is the main time to live for redis. This is the time after which the key is removed from cache.
TTL1 on the other hand is a name-say time to live. Any key with it’s time to live more than TTL1 and less than TTL2 is considered as stale, i.e., it’s value may or may not have changed in the underlining system. This time is not saved at redis level.
How redis with TTL1 and TTL2 fit the picture?
Caching is enabled for any api call by adding our custom annotation called @CustomCache.
@CustomCache(key = "'CityForCityId_'.concat(#id)", ttl1 = 1440, ttl2 = 2000)
This annotation works on the principles of Spring Aspect Oriented Programming. It takes 3 parameters, the key name, TTL1 and TTL2. The key name is a string java SpEL expression. How annotations and interceptor work is beyond the scope of this article. Following articles can be read for complete understanding -
Picking up from part 1,
After passing through hystrix, before landing on downstream api call function, the request is intercepted by our cache interceptor.
This request is only intercepted if the method is marked with our CustomCache annotation. The Interceptor obtains the key name from the SpEL expression and stores it, along with the TTL1 and TTL2 in the thread’s context (ThreadLocal)
The request now executes the api call function and reaches the common adapter layer.
Part 3 — Adapter Layer
Here the following steps of execution take place.
Assumption: Caching is enabled
- If data exists in cache, check the time to live for the key. If time to live < TTL1 then fetch from cache and return.
- If data exists in cache but the time to live > TTL1, then the data is considered not completely reliable (stale) and downstream service is hit to fetch the updated data. This fetched data is then saved back in redis. However, if now the downstream is down, then api fails and hystrix calls the fallback method of that particular api. In this fallback method we basically return the stale data, on the condition that sending this particular stale data in response is affordable.
As an example, say we are caching hotel data(name, location etc) and time to live has passed TTL1 but less than TTL2. Now as per the above explanation, we will be hitting the downstream service. In case of downstream failure, for our use case we can afford to display the old cached hotel data without causing any business impact. By this mechanism we ensure that our caching system is fault tolerant and in a way refreshable.
- If data does not exist then call the downstream service and save the data in redis.
- If the hystrix circuit is open, then by default the fallback method is executed, till circuit closes. By returning data from redis in these fallback methods we ensure that incases of major disruptions in the downstream service, our api doesn’t breaks and the latencies remain within check.
In the end we were able to reduce the average latencies by ~60% to 450ms. Our system was able to handle downstream disruptions with slight increases in latencies.
- In future, we plan to make cache update non blocking by incorporating redis pub-sub model
- Hystrix, provides caching mechanism too. However since hystrix is in maintenance mode we decided not to rely too heavily on it.
- Java8 CompletableFuture is being used to make async parallel calls to various downstream services
Folk’s that will be all from my end. Please feel free to contact in case of any doubt. Happy Coding :)