Re: ECC memory driver in FreeBSD 10?
On Apr 9, 2012, at 6:04 AM, O. Hartmann wrote:
> Am 04/08/12 14:53, schrieb Miroslav Lachman:
>> Nikolay Denev wrote:
>>> On Apr 6, 2012, at 2:48 PM, O. Hartmann wrote:
>>>=20
>>>> I'm looking for a way to force FreeBSD 10 to maintain/watch ECC =
errors
>>>> reported by UEFI (or BIOS).
>>>> Since ECC is said to be essential for server systems both in =
buisness
>>>> and science and I do not question this, I was wondering if I can =
not
>>>> report ECC errors via a watchdog or UEFI (ACPI?) report to syslog
>>>> facility on FreeBSD.
>>>> FreeBSD is supposed to be a server operating system, as far as I =
know,
>>>> so I believe there must be something which didn't have revealed =
itself
>>>> to me, yet.
>>=20
>>>=20
>>> If the hardware supports it, such errors should be logged as MCEs
>>> (Machine Check Exceptions).
>>> I can say for sure it works pretty well with Dell servers, as I had=20=
>>> one with failing RAM module, and
>>> it reported the corrected ECC errors in dmesg.
>>=20
>> Memory ECC errors are logged in to messages and you can decode it by
>> sysutils/mcelog. I did it in the past on one of our Sun Fire X2100 M2
>> with FreeBSD 8.x.
>>=20
>> Miroslav Lachman
>=20
> Seems that I have been blessed with non-faulty memory over tha past
> three or four years. Last time I saw errors was around 2000. All of =
our
> 24/7 servers do have ECC RAM.
>=20
> So, your replies all implies if I log the system's messages via syslog
> properly (as we do remotely on a centralized server), then ECC errors
> should be reported by FreeBSD/kernel in a canonical way as the =
UEFI/BIOS
> reports them?
> Without special drivers/tools, scripts which scans for those errors
> should report occurences?
>=20
> Since my (FreeBSD) boxes didn't show up errors of that kind - Linux
> boxes of a colleague did once! - doesn't imply missing capabilities.
> This is nice to hear/read.
>=20
> Thanks a lot,
>=20
> Oliver
>=20
This is what you see in syslog when sys/x86/x86/mca.c detects a memory =
error:
> Mar 16 12:37:33 hostname kernel: MCA: Bank 8, Status =
0x8c0000400001009f
> Mar 16 12:37:33 hostname kernel: MCA: Global Cap 0x0000000000001c09, =
Status 0x0000000000000000
> Mar 16 12:37:33 hostname kernel: MCA: Vendor "GenuineIntel", ID =
0x206c2, APIC ID 0
> Mar 16 12:37:33 hostname kernel: MCA: CPU 0 COR (1) RD channel ?? =
memory error
> Mar 16 12:37:33 hostname kernel: MCA: Address 0xb43ca6240
> Mar 16 12:37:33 hostname kernel: MCA: Misc 0x4ac8111000064808
mcelog will help you figure out which DIMM is affected.
Also, if your server includes an IPMI controller, the BIOS should be set =
up to log memory errors to the IPMI system event log (SEL). You can =
look at the SEL with ipmitool from the ports collection. 'ipmitool sel =
list' will show you if any errors have been reported.
-Andrew
--------------------------------------------------
Andrew Boyer aboyer@averesystems.com
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
討論串 (同標題文章)
完整討論串 (本文為第 6 之 6 篇):