Re: git: HAMMER - Add live dedup sysctl and support

看板DFBSD_kernel作者時間15年前 (2011/01/04 18:32), 編輯推噓0(000)
留言0則, 0人參與, 最新討論串2/2 (看更多)
On Tue, Jan 04, 2011 at 10:00:18AM +0100, Thomas Nikolajsen wrote: > Thank you! > This looks really interesting, I will have to play with it right away. Bear in mind that while it basically works it is marked experimental - extensive testing as still ahead (obscure races, etc). But any feedback is very welcome! > The commit message is a bit short IMHO, but code has extensive comments; > I was enlightened quite a bit by the one below. I wanted Matt to write commit message, but he refused ;) > How is the relation between online (aka live) and offline HAMMER dedup? > Is typical use to run both if online is used? The thing is that as the quoted comment says there is this on-media requirement, which along with simplifying code tremendously limits its use cases (I mean the number of situations where duplicate data will be dedup'ed). The primary use case is cp and cpdup of files and directory hierarchies. If you want to get maximum of its abilities set sysctl vfs.hammer.live_dedup=2, but note that it will slightly impact the normal write path (performance wise). As for the relation, both of them can and should be used together. Live dedup can be turned on at all times, while offline dedup should be run periodically to pickup all the leftovers. In combination this arrangement gives us a full-fledged deduplication support in HAMMER without major complications in the implementation. > Is offline HAMMER dedup always more efficient (disk space wise)? > Will online HAMMER dedup dedup data between PFSs (offline is per PFS, right)? Yes, offline is per-PFS. Online is fs-wide, that is it will dedup data between PFSs (cp pfs1/a pfs2/b will be dedup'ed). > What are your plans/thoughts to further enhance HAMMER dedup? Apart from testing and closing races in live dedup: 1) The main issue with both offline and online dedup is reblocker. Under certain (rare) circumstances it may partially re-dedup dedup ;) So reblocker has to be made aware of dedup'ed data, but it is pretty a separate project. 2) 'hammer dedup-everything' directive - fs-wide (as opposed to per-PFS) offline dedup. Actually I consider per-PFS separation more of a feature than a drawback, but for people who think they may have possible duplicates across PFSs it will be useful. 3) Per-file (and possibly per-directory) nodedup flag. 4) Make live dedup cache size a tunable (for now I think I'll just make it a sysctl, but it clearly has to scale automatically). This is all I can remember of the top of my head, any thoughts and comments are welcome. Thanks, Ilya > http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/507df98a152612f739140d9f1ac5b30cd022eea2 > .. > +/************************************************************************ > + * LIVE DEDUP * > + ************************************************************************ > + * > + * HAMMER Live Dedup (aka as efficient cp(1) implementation) > + * > + * The dedup cache is operated in a LRU fashion and destroyed on > + * unmount, so essentially this is a live dedup on a cached dataset and > + * not a full-fledged fs-wide one - we have a batched dedup for that. > + * We allow duplicate entries in the buffer cache, data blocks are > + * deduplicated only on their way to media. By default the cache is > + * populated on reads only, but it can be populated on writes too. > + * > + * The main implementation gotcha is on-media requirement - in order for > + * a data block to be added to a dedup cache it has to be present on > + * disk. This simplifies cache logic a lot - once data is laid out on > + * media it remains valid on media all the way up to the point where the > + * related big block the data was stored in is freed - so there is only > + * one place we need to bother with invalidation code. > + */
文章代碼(AID): #1D8lSctf (DFBSD_kernel)
文章代碼(AID): #1D8lSctf (DFBSD_kernel)