-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel ext4 lockup causing nginx
slowdown
#177
Comments
Looks like if the average cpu utilization is above 20% for many sequential intervals then they need a reboot. FWIW, in my experience the only way to build fully reliable systems is to detect bad states and restart. |
Also, in this case nginx is not something we really need to be in the business of ensuring the reliability of... |
Should we switch from Nginx to something else? |
I don't think this is nginx's fault, I'm pretty sure this is a kernel bug. |
Do you have a full backtrace and also what's the filesytem? |
The root filesystem is |
I thought you said it was in fsync? |
Indeed I did, but I must have crossed some wires, because it was actually a
|
Ok, there are a couple of kernel bugs with backtraces that look like this, but nothing particularly recent. I would recommend a kernel upgrade and if it still happens, let me know next time and i can try to diagnose on a live system. |
We upgraded from |
nginx
slowdown
In my ideally world we would have monitoring that detects this, records whatever dmsg info it needs to diagnose this, notifies us (on #infrastructure?), and automatically reboots the machine. Or would it be better to keep this machine around for debugging and replace it with a new machine? |
We had a discussion about the
us-east
loadbalancer getting slow. Initial inspection showed that the network interface was rarely achieving more than 180Mbps out. Diving deeper, it was found that somenginx
processes had become stuck in "uninterruptible sleep" (D
inps
output). Looking indmesg
after aecho w > /proc/sysrq-trigger
showed that they were stuck in the kernel during anfsync
.This is a fairly pathological failure, but there are a few things we could do to ameliorate it:
nginx
processes. Right now we only have 2 processes on the loadbalancer; we could probably double this and go up to 2x per core (so 4 processes total) without any harmful effects, which would at least delay the problem in the future. This needs a templating step on the "optimized" nginx config to insert$(($(nproc) * 2))
into theworker_processes
directive.I am loathe to do something drastic like auto-reboot the loadbalancer because it is supposed to be the piece that you don't have to reboot. If this happens again, I'll consider it.
The text was updated successfully, but these errors were encountered: