Re: bad sector in gmirror HDD
On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote:
> On Fri, Aug 19, 2011 at 09:39:17PM -0400, Dan Langille wrote:
>>=20
>> On Aug 19, 2011, at 7:21 PM, Jeremy Chadwick wrote:
>>=20
>>> On Fri, Aug 19, 2011 at 04:50:01PM -0400, Dan Langille wrote:
>>>> System in question: FreeBSD 8.2-STABLE #3: Thu Mar 3 04:52:04 GMT =
2011
>>>>=20
>>>> After a recent power failure, I'm seeing this in my logs:
>>>>=20
>>>> Aug 19 20:36:34 bast smartd[1575]: Device: /dev/ad2, 2 Currently =
unreadable (pending) sectors
>>>=20
>>> I doubt this is related to a power failure.
>>>=20
>>>> Searching on that error message, I was led to believe that =
identifying the bad sector and
>>>> running dd to read it would cause the HDD to reallocate that bad =
block.
>>>>=20
>>>> http://smartmontools.sourceforge.net/badblockhowto.html
>>>=20
>>> This is incorrect (meaning you've misunderstood what's written =
there).
>>>=20
>>> Unreadable LBAs can be a result of the LBA being actually bad (as in
>>> uncorrectable), or the LBA being marked "suspect". In either case =
the
>>> LBA will return an I/O error when read.
>>>=20
>>> If the LBAs are marked "suspect", the drive will perform re-analysis =
of
>>> the LBA (to determine if the LBA can be read and the data re-mapped, =
or
>>> if it cannot then the LBA is marked uncorrectable) when you =
**write** to
>>> the LBA.
>>>=20
>>> The above smartd output doesn't tell me much. Providing actual =
SMART
>>> attribute data (smartctl -a) for the drive would help. The brand of =
the
>>> drive, the firmware version, and the model all matter -- every drive
>>> behaves a little differently.
>>=20
>> Information such as this? =
http://beta.freebsddiary.org/smart-fixing-bad-sector.php
>=20
> Yes, perfect. Thank you. First thing first: upgrade smartmontools to
> 5.41. Your attributes will be the same after you do this (the drive =
is
> already in smartmontools' internal drive DB), but I often have to =
remind
> people that they really need to keep smartmontools updated as often as
> possible. The changes between versions are vast; this is especially
> important for people with SSDs (I'm responsible for submitting some
> recent improvements for Intel 320 and 510 SSDs).
Done.
> Anyway, the drive (albeit an old PATA Maxtor) appears to have three
> anomalies:
>=20
> 1) One confirmed reallocated LBA (SMART attribute 5)
>=20
> 2) One "suspect" LBA (SMART attribute 197)
>=20
> 3) A very high temperature of 51C (SMART attribute 194). If this =
drive
> is in an enclosure or in a system with no fans this would be
> understandable, otherwise this is a bit high. My home workstation =
which
> has only one case fan has a drive with more platters than your Maxtor,
> and it idles at ~38C. Possibly this drive has been undergoing =
constant
> I/O recently (which does greatly increase drive temperature)? Not =
sure.
> I'm not going to focus too much on this one.
This is an older system. I suspect insufficient ventilation. I'll look =
at getting
a new case fan, if not some HDD fans.
> The SMART error log also indicates an LBA failure at the 26000 hour =
mark
> (which is 16 hours prior to when you did smartctl -a /dev/ad2). =
Whether
> that LBA is the remapped one or the suspect one is unknown. The LBA =
was
> 5566440.
>=20
> The SMART tests you did didn't really amount to anything; no surprise.
> short and long tests usually do not test the surface of the disk. =
There
> are some drives which do it on a long test, but as I said before,
> everything varies from drive to drive.
>=20
> Furthermore, on this model of drive, you cannot do a surface scans via
> SMART. Bummer. That's indicated in the "Offline data collection
> capabilities" section at the top, where it reads:
>=20
> No Selective Self-test supported.
>=20
> So you'll have to use the dd method. This takes longer than if =
surface
> scanning was supported by the drive, but is acceptable. I'll get to =
how
> to go about that in a moment.
FWIW, I've done a dd read of the entire suspect disk already. Just two =
errors.
=46rom the URL mentioned above:
[root@bast:~] # dd of=3D/dev/null if=3D/dev/ad2 bs=3D1m conv=3Dnoerror
dd: /dev/ad2: Input/output error
2717+0 records in
2717+0 records out
2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec)
dd: /dev/ad2: Input/output error
38170+1 records in
38170+1 records out
40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec)
[root@bast:~] #=20
That seems to indicate two problems. Are those the values I should be =
using=20
with dd?
I did some more precise testing:
# time dd of=3D/dev/null if=3D/dev/ad2 bs=3D512 iseek=3D5566440
dd: /dev/ad2: Input/output error
9+0 records in
9+0 records out
4608 bytes transferred in 5.368668 secs (858 bytes/sec)
real 0m5.429s
user 0m0.000s
sys 0m0.010s
NOTE: that's 9 blocks later than mentioned in smarctl
The above generated this in /var/log/messages:
Aug 20 17:29:25 bast kernel: ad2: FAILURE - READ_DMA =
status=3D51<READY,DSC,ERROR> error=3D40<UNCORRECTABLE> LBA=3D5566449
> [stuff snipped]
> That said:
>=20
> http://jdc.parodius.com/freebsd/bad_block_scan
>=20
> If you run this on your ad2 drive, I'm hoping what you'll find are two
> LBAs which can't be read -- one will be the remapped LBA and one will =
be
> the "suspect" LBA. If you only get one LBA error then that's fine =
too,
> and will be the "suspect" LBA.
> Once you have the LBA(s), you can submit writes to them to get the =
drive
> to re-analyse them (assuming they're "suspect"):
>=20
> dd if=3D/dev/zero of=3D/dev/XXX bs=3D512 count=3D1 seek=3DNNNNN
>=20
> Where XXX is the device and NNNNN is the LBA number.
>=20
> If this works properly, the dd command should sit there for a little =
bit
> (as the drive does its re-analysis magic) and then should complete.
ad2 is part of a gmirror with ad0. Does this change things?
I haven't tried the dd yet.
>=20
> You'll want to check SMART stats after that; you should see
> Current_Pending_Sector drop to 0. If Offline_Uncorrectable =
incremented
> then the LBA could not be re-read/remapped.
It did increment:
197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always =
- 2
[was 1]
> If Reallocated_Sector_Ct
> incremented then you now have a total of 2 LBAs which are remapped.
It did increment:
$ diff smarctl.1 smarctl.3 | grep Reallocated_Sector_Ct
< 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail =
Always - 1
> 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail =
Always - 2
Full output of smartctl has been appended to =
http://beta.freebsddiary.org/smart-fixing-bad-sector.php
> In
> the case of remapping, you get to deal with the UFS/FFS thing above.
> To get the stats to update in this situation you *might* (but probably
> not) have to run "smartctl -t offline /dev/XXX".
I didn't try that...
>=20
> You might also be wondering "that dd command writes 512 bytes of zero =
to
> that LBA; what about the old data that was there, in the case that the
> drive remaps the LBA?" This is a great question, and one I've never
> actually taken the time to answer because at this present time I have
> absolutely *no* bad disks in my possession. I'm under the impression
> that the write does in fact write zeros if the LBA is remapped, but =
that
> might not be true at all. I've been waiting to test this for quite =
some
> time and document it/write about it.
>=20
> I still suggest you replace the drive, although given its age I doubt
> you'll be able to find a suitable replacement. I tend to keep disks
> like this around for testing/experimental purposes and not for actual
> use.
I have several unused 80GB HDD I can place into this system. I think =
that's
what I'll wind up doing. But I'd like to follow this process through =
and get it documented
for future reference.
--=20
Dan Langille - http://langille.org
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
討論串 (同標題文章)
完整討論串 (本文為第 12 之 28 篇):