One of our customers British Telecom had previously faced an issue with a very high traffic data analytics based web application. For that- when a customer from our customer’s side logs into their portal, s/he used to see a lot of analytical data on the dashboard. As data is user customized hence it needed to evaluate at run-time. Mostly it worked pretty well in normal hours but in busy hours login took more than a minute and faced cases of timing out. We solved this problem with the message queue.
Now you might be thinking what is a message queue.
Well, Message queuing allows applications to communicate by sending messages to each other. The message queue provides a temporary message storage when the destination program is busy or not connected.
The basic architecture of a message queue is simple, there are client applications called producers that create messages and deliver them to the message queue. Another application, called the consumer, connect to the queue and get the messages to be processed. Messages placed onto the queue are stored until the consumer retrieves them.
A message queue provides an asynchronous communications protocol, a system that puts a message onto a message queue does not require an immediate response to continuing processing. Email is probably the best example of asynchronous messaging. When an email is sent can the sender continue processing other things without an immediate response from the receiver? This way of handling messages decouple the producer from the consumer. The producer and the consumer of the message do not need to interact with the message queue at the same time.
So what happened is-
We had a distributed architecture and having a load balancer to pass the request one of the nodes. That didn’t work as these requests varies from user to user and that can take the processing of 1 second to 15 seconds due to computation complexities and underlying downstream bottleneck.
We could scale up or scale horizontally but we’ve noticed that on many occasions, 2 out of 5 nodes are at 100% CPU usage and others at 50% or even less. CPU that is at 100% are probably occupied with hundreds of threads to perform complex data analytics where other nodes are probably serving requests that have low computation complexities and downstream responses are relatively simple and fast. The distribution of work is totally due to the decision made by the load balancer.
On the bad times, there are cases where customers failed to log in and page timed out, just because load balancer assigned the job to an existing node which already too busy to serve the request. After a significant amount of analysis, we wanted to try-out and re-architect the solution in a different way.
Rather than pushing the workload, we wanted to implement pulling the workload based on the server capacity and heuristic data based on historical data for that user. We somewhat managed to implement a bit of machine learning in this scenario with very simple statistical classification and that just nailed it. I will go more into details, how we have done it.
We took the brave decision removing the load balancer and replace with a custom application that we have written which is capable of queuing the requests when the user tries to log into the system and worker nodes to pick up from the queue one by one.
This somewhat fixed our problem to a certain extent, however, we could not use the CPU across the nodes, also there were cases of a bottleneck in downstream systems and the response time didn’t drastically improve. We knew that there is a room for improvement on this issue and the business is growing we will hit a dead end and that is not so far away.
We picked the phase two of our work and re-architect it again on top of phase one. Now rather than blindly depending on the nodes to pick up the work, the request will be queued based on some historical data of previous login details.
We created four different queues based on the weighted cumulative average time taken by last 10 logins of that particular user. The highest weight is set based on recent logins i.e. say latest login is n where the weight of n =10, n -1=9, n – 2=8 and so on. We found that the weighted cumulative gave us more nearest approximation depend on. There is also a heuristic mechanism on the nodes to identify an approximate time remaining of all the work are in progress that is being processed currently by the working node. This gives us a great flexibility to pick up a work from the defined queues. The diagram below shows strategic queues are formed and categorized based on time, that generally takes to process the analytical data based on previous login sessions.
The underlying workers have access to all the four queues and they intelligently pick a work item from the queues based on the following constraints:
- -Current CPU load on the node
- -Approximate time remaining to complete one of the tasks that are running on the node.
- -A total number of tasks running on the current node.
CPU load is the prioritized. If the CPU usage is less than 80%, that node picks up a work right away knowing that other tasks are consuming less resource and may end soon or task are blocked on a downstream system and we try to compute our task in between. There is another edge case scenario where the historic login time did not work and took a lot more time than expected or using high CPU usage.
For instance, for the given time, the status of 4 nodes shows as below:
- Node 1 (CPU: 60%, Time to complete all remaining tasks: 15 secs, Task about to finish: 4 secs, Number of threads: 20)
- Node 2 (CPU: 20%, Time to complete all remaining tasks: 12 secs, Task about to finish: 7 secs, Number of threads: 19)
- Node 3 (CPU: 100%, Time to complete all remaining tasks: 8 secs, Task about to finish: 2 secs, Number of threads: 5)
- Node 4 (CPU: 90%, Time to complete all remaining tasks: 15 secs, Task about to finish: 15 secs, Number of threads: 8)
Now, if we have a request that needs to be queued that the request will take 8 seconds to finish. Here is the intermediate case of the nodes:
- Node 1 – qualified to pick up this job.
- Node 2 – qualified to pick up this job.
- Node 3 – CPU is 100% but most probably it will be freed in next couple of seconds.
- Node 4 – Not much CPU is left and may take time to finish the remaining work.
The best match is Node 1 and Node 2 and they’re both ready to pick up. Now consider a case when both Node 1 and Node 2 are trying to pick up the same job at the same time. As a rule of thumb, message queue channel should be shared between threads or instances running asynchronously.
What we have done is that we took the leverage of Redis which we were already using and used as thread safety lock. Before we start to pop from the queue, we first peek and relevant information to Redis saying that we’re going to pop this item and to process. In the meanwhile, if any other thread/node comes to pick up a work, that node looks whether it has been picked up, if not then it proceeds to peek and pop for processing the job.
This is probably not a very good solution but it is the nearest thread-safe mechanism that we could provide across the nodes.
This solution gave us a significant amount of boost in our login process from many unhappy customers who the belief that they will not be able to login at the first time and need to try again, to a place where there was no failure since the implementation and average login time was cut down to 8 seconds. We’ve also taken the leverage of this approach in many other components in the future to provide better user experience to our customers.