USB/coredump hangs in 8 and 9
Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
Re: debugging frequent kernel panics on 8.2-RELEASE (originally on =
freebsd-stable)
Re: System hang in USB umass module while processing panic (originally =
on freebsd-usb)
Hello Andriy and Hans,
Sorry for tying in so many discussions on this topic, but I think I have =
an explanation for the problems we have been reporting* with hanging =
coredumps on multicore systems on 8.2-RELEASE, and it has implications =
for Andriy's proposed scheduler patch** and for USB.
In today's 8.X and 9.X branches, nothing that I can find stops the other =
CPUs when the kernel panics, but many parts of the locking code get =
disabled (grep on 'panicstr'). The 'bufwrite: buffer is not busy???' =
panic is caused by the syncer encountering an error. If that happens =
when it's on the dumping CPU everything hangs. If it's running on a =
different CPU, it will be blocked and hidden by the panic_cpu spinlock =
in panic(), and the dump continues, polling every attached keyboard for =
a Ctl-C.
But, the new 8.X USB stack relies on multithreading. (The new stack is =
the variable that broke coredumps for us in the 7.1->8.2 transition, I =
think.) SVN 224223 fixes a hang that would happen when dumpsys() polls =
the USB keyboard (IPMI KVM, in our case). That helps, but it only gets =
as far as usb_process(), where it hangs in a loop around a cv_wait() =
call. This is easy to reproduce by adding code to the watchdog to break =
into the debugger if panicstr is set.
I am experimenting with Andriy's patch** to stop the scheduler and it =
seems to be most of the way there, stopping the CPUs and disabling the =
rest of locking. There are a few places that still reference panicstr, =
but that's minor. These are the changes I made to the patch:
* Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() =
is true, so that we don't hang up in USB. ukbd_yield() locks up in =
DROP_GIANT(), and if you skip ukbd_yield(), usbd_transfer_poll() locks =
up trying to drop mutexes.
* Changed the call to spinlock_enter() back to critical_enter(), so =
that interrupts stay enabled and the hardclock still functions.
* Added code in the beginning of panic() to switch to CPU 0, so that =
we're able to service the hardclock interrupts and so that watchdog =
panics get through.
This has worked 100% for me so far, although anyone using a USB keyboard =
or dump device would still be out of luck.
Thoughts? It seems like stopping all of the other CPUs is the right =
thing to do on a panic (what are they doing otherwise?). Are the USB =
issues fixable? If Andriy's patch get committed it might just involve =
short-circuiting all of the locking in the polling path, but I haven't =
gotten that far yet. I bet dumping to NFS will have the same problem.
Thanks,
Andrew
* - http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dkern/155421
** - http://people.freebsd.org/~avg/stop_scheduler_on_panic.8.x.diff
--------------------------------------------------
Andrew Boyer aboyer@averesystems.com
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
討論串 (同標題文章)
完整討論串 (本文為第 1 之 11 篇):