Re: rc and smf

看板DFBSD_kernel作者時間21年前 (2005/02/25 05:01), 編輯推噓0(000)
留言0則, 0人參與, 最新討論串43/53 (看更多)
I think what Bill is trying to say, not very diplomatically, is that the truely important pieces of software out there in world don't rely on simple-stupid little monitoring programs to deal with failures. They do far more sophisticated tests and the consequences of a failure are far more robust then a worker coming in at 8:00 a.m. and finding that the system restarted service X at 4:00 a.m. With these systems if a failure occurs, alarm bells ring, people get paged, and the system goes into a failsafe mode. Sophisticated systems have a lot more going on then an easily restartable web server. I have my own example. I designed the hardware and software for the telemetry system that Tahoe Donner PUD uses. This is a medium sized water district serving the Truckee, California area. It monitors tanks, controls pumps, and records 20-40 data pointers on a two-minute basis across 35 sites 24x7. And has done so for the last 17 years without a software-caused failure. The base stations are running FreeBSD. They handle the UI, data collection, and reporting only. The field units are running a completely autonomous custom designed RTOS with memory protection and a hardware watchdog. They are responsible for monitoring tanks and other things, controlling pumps, buffering data, and sending alarm pages. The system still works 100% if a base station goes down. The boards have a hardware watchdog. The RTOS abstracts the hardware watchdog out to the processes running on the boards. If any process fails to hit its virtualized watchdog, the OS doesn't hit the actualized watchdog, logs the failure, turns off the pumps, and the entire board goes through a hard reset. There are multiple layers of redundancy and failsafes, everything from handling a blown transducer to turning off the pumps if a tank level gets too low (or too high) to making sure that failure modes from lightning strikes do not report false readings. What I am saying here is that when one is building a highly reliable system, there's a lot more to it then writing a little service restarter. I get the feeling, Dan, that you are trying to find a magic bullet to solve these problems. No such bullet exists, believe me. It certainly isn't this 'overcommit' stuff. It isn't an auto-restarter, not alone anyway. What it is, ultimately, is running reliable software AND hardware and screaming bloody hell if something goes wrong, and then taking further action depending on the situation (e.g hard reset, failsafe, fallback, etc). -Matt Matthew Dillon <dillon@backplane.com>
文章代碼(AID): #127a2N00 (DFBSD_kernel)
文章代碼(AID): #127a2N00 (DFBSD_kernel)