Re: SU+J systems do not fsck themselves
On Dec 28, 2011, at 12:34 AM, David Thiel wrote:
> On Tue, Dec 27, 2011 at 11:54:20PM -0700, Scott Long wrote:
>> The first run of fsck, using the journal, gives results that I would=20=
>> expect. The second run seems to imply that the fixes made on the=20
>> first run didn't actually get written to disk. This is definitely an=20=
>> oddity. I see that you're using geli, maybe there's some strange=20
>> side-effect there. No idea. Report as a bug, this is definitely=20
>> undesired behavior.
>=20
> Not impossible, but I was seeing similar issues on two non-geli =
systems=20
> as well, i.e. tons of errors fixed when doing a single-user=20
> non-journalled fsck, but journalled fsck not fixing stuff. I'll try to=20=
> replicate on a test machine, as I already lost data on the last=20
> (non-geli) machine this happened to.
>=20
>> For the love that is all good and holy, don't ever run fsck on a live=20=
>> filesystem. It's going to report these kinds of problems! It's=20
>> normal; filesystem metadata updates stay cached in memory, and fsck=20=
>> bypasses that cache. =20
>=20
> Ok. I expected fsck would be softupdate-aware in that way, but I=20
> understand it not doing so.
>=20
>>> - SU+J and fsck do not work correctly together to fix corruption on=20=
>>> boot, i.e. bgfsck isn't getting run when it should
>>=20
>> The point of SUJ is to eliminate the need for bgfsck. Effectively,=20=
>> they are exclusive ideas. =20
>=20
> This is surprising to me. It is my impression that under Linux at =
least,=20
> ext3fs is checked against the journal, and gets a full e2fsck if it=20
> finds it's still dirty. Additionally, there's a periodic fsck after =
180=20
> days continuous runtime or x number of mounts (see tune2fs -i and -c). =
=20
> Is SU+J somehow implemented in such a way that this is unnecessary? =
What=20
> does it do that the ext3fs people have missed?
>=20
SUJ isn't like ext3 journaling, it doesn't do 100% metadata logging. =
Instead, it's an extension of softupdates. Softupdates (SU) is still =
responsible for ordering dependent writes to the disk to maintain =
consistency. What SU can't handle is the Unix/POSIX idiom of unlinking =
a file from the namespace but keeping its inode active through =
refcounts. When you have an unclean shutdown, you wind up with stale =
blocks allocated to orphaned inodes. The point of bgfsck was to scan =
the filesystem for these allocations and free them, just like fsck does, =
but to do it in the background so that the boot could continue. SUJ is =
basically just an intent log for this case; it tells fsck where to find =
these allocations so that fsck doesn't have to do the lengthy scan. =
FWIW, this problem is present in most any journaling implementation and =
is usually solved via the use of intent records in a journal, not unlike =
SUJ.
So, there's an assumption with SUJ+fsck that SU is keeping the =
filesystem consistent. Maybe that's a bad assumption, and I'm not =
trying to discredit your report. But the intention with SUJ is to =
eliminate the need for anything more than a cursory check of the =
superblocks and a processing of the SUJ intent log. If either of these =
fails then fsck reverts to a traditional scan. In the same vein, ext3 =
and most other traditional journaling filesystems assume that the =
journal is correct and is preserving consistency, and don't do anything =
more than a cursory data structure scan and journal replay as well, but =
then revert to a full scan if that fails (zfs seems to be an exception =
here, with there being no actual fsck available for it).
As for the 180 day forced scan on ext3, I have no public comment. SU =
has matured nicely over the last 10+ years, and I'm happy with the =
progress that SUJ has made in the last 2-3 years. If there are bugs, =
they need to be exposed and addressed ASAP.
Scott
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
討論串 (同標題文章)
完整討論串 (本文為第 8 之 16 篇):