Re: GSOC: Device mapper mirror target

看板DFBSD_kernel作者時間14年前 (2011/05/17 02:32), 編輯推噓0(000)
留言0則, 0人參與, 最新討論串4/16 (看更多)
On Thu, Apr 7, 2011 at 10:27 AM, Adam Hoka <adam.hoka@gmail.com> wrote: > Please see my proposal: > > http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/ahoka/1# Hi! I'll take a look at your proposal in just a bit. Here are some things you might want to think about when looking at RAID1 though... Here are some details about how I planned to do dmirror and why I think RAID1 is a much more difficult problem than it seems at first glance. Imagine a RAID1 of two disks, A and B; you have an outstanding set of I/O operations, buf1, buf2, buf3, buf4*, buf5, buf6, buf7, buf8*. The BUFs are a mix of READ and WRITEs. At some point, your friendly neighborhood DragonFly developer walks over and pulls the plug on your system (you said you were running NetBSD! Its a totally valid excuse! :)) Each of the write bufs could be totally written, partially written, or not written at all to each of the disks. More importantly, each disk could have seen and completed (or not completed) the requests in a different order. And this reorder can happen after the buf has been declared done and biodone() has been called (and we've reported success to userland). This could be because of reordering or coalescing and the drive controller or the drive, for example. So in this case, lets say disk A had seen and totally written buf2 and partially written buf1. Disk B had seen and totally written buf1 and not seen buf2. And we'd reported success to the filesystem above already. So when we've chastised the neighborhood DragonFly developer and powered on the system, we have a problem. We have two halves of a RAID mirror that are not in sync. The simplest way to sync them would be to declare one of the two disks correct and copy one over the other (possibly optimizing the copy with a block bitmap, as you suggested and as Linux's MD raid1 (among many others) implement; block bitmaps are more difficult than they seem at first [1]). So lets declare disk A as correct and copy it over disk B. Now, disk B's old copy of buf2->block is overwritten with the correct copy from disk A and disk B's correct, up-to-date copy of buf1->block is overwritten with an scrambled version of buf1->. This is not okay, because we'd already reported success at writing both buf1 and buf2 to the filesystem above. Oops. This failure mode has always been possible in single-disk configurations where write reordering is possible; file systems have long had a solitary tool to fight the chaos, BUF_CMD_FLUSH. A FLUSH BUF acts as a barrier, it does not return until all prior requests have completed and hit media and does not allow requests from beyond the FLUSH point to proceed until all requests prior to the barrier are complete [2]. However the problem multi-disk arrays face is that disks FLUSH independently. [3: important sidebar if you run UFS]. A FLUSH on disk X says nothing about the state of disk Y and says nothing about selecting disk Y after power cycling. --- The dmirror design I was working on solved the problem through overwhelming force -- adding a physical journal and a header sector to each device. Each device would log all of the blocks it was going to write to the journal. It would then complete a FLUSH request to ensure the blocks had hit disk. Only then would we update the blocks we'd meant to. After we updated the target blocks, we would issue another FLUSH command. Then we'd update a counter in a special header sector. [assumption: writes to single sectors on disk are atomic and survive DragonFly developers removing power]. Each journal entry would contain (the value of the counter)+1 before the operations were complete. To know if a journal entry was correctly written, each entry would also include a checksum of the update it was going to carry out. The recovery path would use the header's counter field to determine which disk was most current. It would then replay the necessary journal entries (entries with a counter > the header->counter) to bring that device into sync (perhaps it would only replay these into memory into overlay blocks, I'd not decided) and then sync that disk onto all of the others. Concretely, from dmirror_strategy: /* * dmirror_strategy() * * Initiate I/O on a dmirror VNODE. * * READ: disk_issue_read -> disk_read_bio_done -> (disk_issue_read) * * The read-path uses push_bio to get a new BIO structure linked to * the BUF and ties the new BIO to the disk and mirror it is issued * on behalf of. The callback is set to disk_read_bio_done. * In disk_read_bio_done, if the request succeeded, biodone() is called; * if the request failed, the BIO is reinitialized with a new disk * in the mirror and reissued till we get a success or run out of disks. * * WRITE: disk_issue_write -> disk_write_bio_done(..) -> disk_write_tx_done * * The write path allocates a write group and transaction structures for * each backing disc. It then sets up each transaction and issues them * to the backing devices. When all of the devices have reported in, * disk_write_tx_done finalizes the original BIO and deallocates the * write group. */ A write group was the term for all of the state associated with a single write to all of the devices. A write transaction was the term for all of the state associated with a single write cycle to one disk. Concretely for write groups and write transactions: enum dmirror_write_tx_state { DMIRROR_START, DMIRROR_JOURNAL_WRITE, DMIRROR_JOURNAL_FLUSH, DMIRROR_DATA_WRITE, DMIRROR_DATA_FLUSH, DMIRROR_SUPER_WRITE, DMIRROR_SUPER_FLUSH, }; A write transaction was guided through a series of states by issuing I/O via vn_strategy() and transitioning on biodone() calls. At the DMIRROR_START state, it was not yet issued to the disk, just freshly allocated. Journal writes were issued and the tx entered the DMIRROR_JOURNAL_WRITE state. When the journal writes completed, we entered the JOURNAL_FLUSH state and issued a FLUSH bio. When the flush completed, we entered the DATA_WRITE state; next the DATA_FLUSH state, then the SUPER_WRITE and then the SUPER_FLUSH state. When the superblock flushed, we walked to our parent write group and marked this disk as successfully completing all of the necessary steps. When all of the disks had reported, we finished the write group and finally called biodone() on the original bio. struct dmirror_write_tx { struct dmirror_write_group *write_group; struct bio bio; enum dmirror_write_tx_state state; }; The write_tx_done path was the biodone call for a single write request. The embedded bio was initialized via initbiobuf(). enum dmirror_wg_state { DMIRROR_WRITE_OK, DMIRROR_WRITE_FAIL }; struct dmirror_write_group { struct lock lk; struct bio *obio; struct dmirror_dev *dmcfg; /* Parent dmirror */ struct kref ref; /* some kind of per-mirror linkages */ /* some kind of per-disk linkages */ }; The write group tracked the state of a write to all of the devices; the embedded lockmgr lock prevented concurrent write_tx_done()s from operating. The bio ptr was to the original write request. The ref (kref no longer exists, so this would be a counter now) was the number of outstanding devices. The per-mirror and per-disk linkages allowed a fault on any I/O operation to a disk in the mirror to prevent any future I/O from being issued to that disk; the code on fault would walk all of the requests and act as though that particular write TX finished with a B_ERROR buffer. The disk read path was simpler -- a single disk in the mirror was selected and vn_strategy() called. The biodone callback checked if there was a read error; if so, we faulted the disk and continued selecting mirrors to issue to until we found one that worked. Each faulted disk had outstanding I/Os killed. I had not given thought as to what to do when a mirror was running in a degraded configuration or with an unsynced disk trying to catch up; the latter requires care in that the unsynced disk can serve reads by not writes. Also about what to do to live remove a disk. Or how to track all of the disks in a mirror. (It'd be nice to have each disk know all the other mirror components via UUID or something and to record the last counter val it knew about for the other disk. This will prevent disasters where each disk in a mirror is run independently in a degraded setup and then brought back together.) AFAIK, no RAID1 is this paranoid (sample set: Linux MD, Gmirror, ccd). And it is a terrible design from a performance perspective -- 3 FLUSH BIOs for every set of block writes. But it does give you a hope of correctly recovering your RAID1 in the event of a powercycle, crash, or disk failure... Please tell me if this sounds crazy, overkill, or is just wrong! Or if you want to work on this or would like to work on a classic bitmap + straight mirror RAID1. -- vs [1]: A block bitmap of touched blocks requires care because you must be sure that before any block is touched, the bitmap has that block marked. Sure in the sense that the bitmap block update has hit media. [2]: I've never seen seen exactly what you can assume about BUF_CMD_FLUSH (or BIO_FLUSH as it might be known as in other BSDs)... this is a strong set of assumptions, I'd love to hear if I'm wrong. [3]: UFS in DragonFly and in FreeBSD does not issue any FLUSH requests. I have no idea how this can be correct... I'm pretty sure it is not.
文章代碼(AID): #1DqMsh1y (DFBSD_kernel)
討論串 (同標題文章)
文章代碼(AID): #1DqMsh1y (DFBSD_kernel)