Unicorn Workers Timeout Because of Stalled Redis Connection
We encountered a strange problem of unicorn workers timeout this week. So we
want to describe 1. how did we find the cause was Redis, 2. how did we fixed
the problem in this post.
Prelude
The story started when timeout errors were found in the unicorn log. It seemed
that workers were killed due to timeout by the master process while they were
waiting for something to complete.
Finding The Cause
The first problem was that they were killed without any log. So we patched
unicorn and modified its config to dump the stack trace when workers were
killed. The idea is using different signal INT instead of KILL to kill
workers, trap the INT and dump the strack trace when killed.
Thanks to this patch we could get the stack trace.
From the log we figured out that the workers were blocked when they tried to
write to the socket. That is because although Redis server disconnects clients
being idle for more than 300 seconds, it seems ruby redis client gem doesn’t
handle the disconnection.
Fix
We update redis config not disconnect clients by setting timeout 0 in
redis.conf for the time being. Of course fixing library itself is the ideal
solution and we might blog about it in the future.