Re: Handle kernel module crashes
--Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=iso-8859-1
On 10 Jun 2013, at 15:40, Florent Peterschmitt <florent@peterschmitt.fr> =
wrote:
> Ok and isn't it a "bad" thing ? I mean, even if the video driver
> crashes, I still want to have the ability to reboot the right way,
> avoiding corrupted files and WIP lose.
>=20
> Another thing is a non-critical module that can crash, but because not
> used by all apps on the machine, letting them ones that can continue =
run.
>=20
> But I don't know what is the approach of FreeBSD and devs about that.
Yes, it's a bad thing. If we had privilege domain crossing that was as =
cheap as a function call (or, at least, almost as cheap) then we could =
implement fine-grained separation within the kernel and not incur any =
performance penalty. Unfortunately, this is not possible without some =
fairly significant changes to current CPU instruction sets (which, =
actually, several of us in FreeBSD land are working on, but that's =
unlikely to be seen in any mainstream processor for at least 5-10 =
years). =20
In the current world, we have a fairly poor selection of choices for =
isolation. On i386, we had 4 protection rings, but on the 486 and newer =
the cost of transitions between to and from rings 1 and 2 were =
increasingly expensive because most operating systems only used rings 0 =
and 3 (Netware and OS/2 are the two exceptions that I know of). On =
other architectures we just have privileged and unprivileged modes. =
Code in privileged mode can't be isolated from other code in privileged =
mode, code that is in unprivileged mode incurs some overhead for calls =
into privileged mode.
There are some tricks that you can do to enforce some weaker protection. =
For example, every driver could be written on 64-bit platforms to use =
32-bit pointers and have a 4GB segment of privileged-mode virtual memory =
allocated for it to use and have to go through special gates to do =
anything with the whole kernel's address space. You'd then end up with =
a lot more TLB churn, but protection against a number of kinds of =
pointer error (protection faults inside the 32-bit window would just =
result in that module being killed and restarted). =20
Unfortunately, there are several problems with this. The most obvious =
is that killing a module is not always trivial. For example, a module =
may hold various locks, but it's not always clear which module owns a =
lock. Locks are held by kernel threads, but a thread can have a call =
stack spanning several modules. Working out exactly which driver holds =
the lock is not always trivial, and there is also the question of what =
you do about a thread that contains some call frames belonging to the =
module that you've just killed. You'd need to provide some =
exception-like mechanism for handling this case (and unwinding the stack =
in the case where it is potentially corrupt is also nontrivial). =20
An alternative is to run the driver entirely, or mostly, in userspace. =
The 'mostly' option is often better. For example, certain categories of =
USB devices are exposed by the FreeBSD kernel as USB generic devices =
(ugen driver) and some userspace component sends USB commands to it. =
This involves some extra copying, but means that most of the =
(potentially buggy) driver logic is in the application. If it crashes, =
you lose the application state (which, in a desktop setting, is only =
slightly better than crashing the kernel), but not the whole kernel. =20
In the case of certain modern network interfaces (Infiniband in =
particular) and modern GPUs, the kernel handles even less. The device =
has some hardware support for multiplexing and isolation and so all that =
the kernel has to do is set up some memory that both the device and the =
userspace code can access - including the device registers for =
controlling a command queue - and then delegate most of the operation to =
the userspace code. This requires an IOMMU to actually provide =
isolation, otherwise an errant DMA request can still result in accessing =
or modifying kernel memory.
Even with this kind of isolation, there are still potential problems. =
Many devices react poorly to bad input and can be left in a state that =
is hard to recover from, even if the driver itself is easy to restart. =
A lot of OS instability (I saw a number as high as 20% of OS crashes =
quoted at MSR recently) is caused by drivers poorly reacting to =
intermittent hardware errors. Just restarting the driver (an approach =
that they tried) solved some, but not all of these cases.
Of course, there are a lot of things in the kernel that are not drivers. =
For example, FUSE allows us to run filesystems in userspace instead of =
in the kernel. This comes with a performance penalty as a result of =
having to copy data from the kernel's buffer cache into the filesystem =
process, then back into the kernel, and then into the destination =
process (for a read - the same sequence in the opposite order on write). =
Similarly, we have CUSE for character devices, which is used by a lot =
of webcam drivers. These are a relatively good use-case for userspace =
drivers, because they are typically a streaming interface (data comes =
just from the device and there isn't a lot of need for latency-sensitive =
round trips from the app to the driver) and the latency that users care =
about is on the order of 1/24th of a second, which is a very long time =
on a modern computer. There are other examples, such as Netmap for =
pushing network packets directly into userspace, which can be combined =
with something like Ilias Marinos' userspace network stack to run the =
entire TCP/IP stack in userspace.
Moving drivers into userspace is not a panacea. It adds more =
asynchronous behaviour, which makes reasoning about the code harder and =
makes deadlocks far easier to introduce (for example, any userspace =
process has a lot of implicit interactions with the VM subsystem, which =
are more explicit in the kernel, and doesn't have a shared global =
namespace for locks). Most of the code in the kernel is there because, =
when the code was written, it was the most sensible place for it. In =
most cases, that is still true, although as CPU and software =
architectures evolve that may change.
David
--Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename=signature.asc
Content-Type: application/pgp-signature;
name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.18 (Darwin)
Comment: GPGTools - http://gpgtools.org
iQIcBAEBAgAGBQJRvZd6AAoJEKx65DEEsqIdpy0QAMYTKaeKqbNXoRBv0+JVMnMi
1cZI4O6WKDJ573tHKd0HH+/ijl7P35X3tX8hdIdLP40R+x+SeImQj/64rcVrogaj
8pPNHeMqlC5cdG2DyBDkSXbjibGpW1vQZVvIbgCP+vlfcfbjUBLUC8WfG2Mjb/uA
GqZhMJ2JkKqHg1N4hxLUSMSJtsqecBfw5ZDa0qWu30TL8aIFoJ3ExzuFQksaMoqd
DuHv+hisMQ5kQDmSXyWS9cWjsaqzBP3rQemP7aVuaD7vsnG6qs6tuuXJyoJwcc2f
V0nUUEiTuF/ZwcRguU77XdfPyfWFqqJTmCIFrPR5c1vU+lop6G/dV5BRsFpBZ3dN
XrYvb4BIbUszevHl0Yz9eCfDeDF41jWtsw/FiA7xxfMmVnesWCz35vZlIK8DTNBj
TqWrtl5RvabsmdtniuvcRMHm0X4m9b4ia1p/QQAjmiKHO2My6/cAVHdTPKkA7p6D
WoipuLX5GfrhSPVxVpa9DHQwtTJPTqlIgSyUiRYIB0Euo1N1EXS4vAsTVZrh4FJQ
ywJane3XwWKt2pb89a3AAtupzUyw1lJUiogIjAUxwkpHcS6jFASIagTk8Hc8u+iL
ZyQZ+BZ/wxmU2lJk7geo7srpHOw/HlArsgZM23qEJC3AD3ix2zLZDFRE3KIEqAP+
Zf5AXT1BOZ23qSHwJEML
=FHc5
-----END PGP SIGNATURE-----
--Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8--
討論串 (同標題文章)
完整討論串 (本文為第 8 之 8 篇):