Re: Description of the Journaling topology

看板DFBSD_kernel作者時間21年前 (2004/12/31 17:32), 編輯推噓0(000)
留言0則, 0人參與, 最新討論串34/42 (看更多)
:I'm using a simplistic, though, I think, accurate perspective. When I :used the term atomic, I simply meant when a file is (being) modified it :is not available until that change is complete, ie no chance of reading :a half written file. This entire process is what I meant by transaction. :I'm not sure of a better way to describe that than "atomic transaction" :even though those words often have other meaning. I'd also call that :"transaction" a "commit" to the filesystem for similar reasons. Also, :I mean a context above the VFS. Simply, it's imperative that in user :space, a file read cannot complete if that file is open for write, :similarly a file write should not alter the the outcome of a read that :has already started. I could be wrong about this, but it seems a good :requirement to have. These are the types of guarentees that a cache coherency protocol is able to make. :the fundamental requirements. Is that process above something we can :expect from the kernel? Or, will the need to write to a temporary file :on the device and mv it to the actual name when the write is complete, :continue in user space, for the "atomic" effect? Which is really what I :was getting at. :... :If this read blocking while file write process is handled by the :kernel (VFS?), wouldn't it greatly simplify a cache-coherent messaging The atomic effect is accomplished by having a cache coherency protocol which locks out conflicting operations until the conflict is resolved. So, for example, if you were to do a 2GB write() and it took 30 seconds to accomplish this, then the cache coherency mechanism might prevent anyone else from reading any portion of the seek range involved in the write until the write completes. :> This 'mastership' requires communication with the other machine in the :> cluster, but the communication may have ALREADY occurred sometime in the :> past. That is, your machine might ALREADY have mastership of the :> necessary resources, which means that your machine can conclude the :> operation without any further communication. : :Indeed it's the unpredictable order by which the warm mirrors get in :sync that gives value to the term coherent! : :So, what about conflicts? What happens when a node tries to write :a file, but another node also tries to write the same file, a tick :earlier? :... :Do you mean a warm mirror can commit to local disk and trigger the 3 hot :mirrors to sync up and propagate to the other 20? Slick! :... :<other questions about cache coherency mechanisms> Ok, you are asking a lot of good questions here and most of them relate to the same mechanism so I will tackle all of them at once. First, lets consider the case where you have 3 machines which in a multi-master configuration. That is, you do NOT have a designated 'master' machine, and now several of these machines wish to issue conflicting I/O's to a common set of data that all three machines have independant copies of. Lets say that all three copies are synchronized to start with. From any given machine's point of view, lets say from machine A's point of view, the first order of business is to gain mastership of the zone in which they intend to do the I/O. None of the machines currently have mastership of the zone in question so this requires some communication. machine A sends a mastership request to itself, to machine B and to machine C. Machine A then awaits responses. machine A (responding to itself): Ok, you can have mastership. machine B (responding to A): Ok, A, you can have mastership. machine C (responding to A): I'd really rather have mastership, so I would like to deny you (A) mastership. (say machine C started its own conflicting I/O at near the same time and wants mastership too). machine C does the same thing... it gives itself an OK and it asks A and B whether it can have the OK, but in this case A and B say NO, you can't. So you have this situation: machine A: A, B say A can have mastership. C says no. machine C: C says C can have mastership, A and B say no. So who wins ? The answer is actually simple... you use what is called a QUORUM VOTING algorithm. Whichever machines gets a quorum of affirmative responses is the machine that gains mastership of the zone. In this case, the quorum for 3 machines is 2 (it's basically half the machines negotiating the request plus one). Since A has an affirmative response from itself and B, it has the quorum and thus mastership, and thus A can proceed with the operation. In fact, A is able to proceed the moment it receives affirmation of mastership from EITHER B or C. It does not have to wait for BOTH B and C to affirm mastership. It only needs a quorum. Once mastership has been assigned, in this case to A, you then have to deal with the fact that A will now have the only up-to-date copy of the data and that B and C's copy will be stale and unusable. If B now wants mastership and performs the same quorum negotiation, C, with its stale copy *CANNOT* be allowed to return an affirmative response to B's request because that would then make another quorum (B and C) and B would wind up operating with stale data (because A's journal update may not have reached B and/or C yet). So another part of the cache coherency protocol is to ensure that operations only occur with the correct, non-stale data. The way that is done is also quite simple in concept, but a very complex implementation in reality. What happens is that during A's initial negotiation, A supplies a TRANSACTION ID to represent the zone being modified. A includes this transaction ID in its journal updates, but keep in mind that journal updates are performed asynchronously. But since a quorum of machines must have responded affirmatively to A's original request, that also means that a quorum of machines now *KNOW* what A's transaction ID is. This effects the quorum negotiation in two ways: * First, a machine which knows it has out of date data (i.e. it knows what A's transaction ID was but it has not yet received the journal update from A with that transaction id)... machines in this state MAY NOT PARTICIPATE in a quorum negotiation. e.g. when B now asks for mastership, C will NOT respond in the affirmative to B, so B will not be able to get a quorum yet. In fact B will not even respond in affirmative to its own request because it also knows that its local copy of the data is stale. * As A's modifications are asynchronously propogated to other machines in the cluster, these other machines will be able to begin to participate in quorum negotiations for the conflicting zone. Once a quorum of machines has the update, B will get sufficient affirmative responses to be able to start its own transaction. * Now, once B has enough responses, it is still possible for B's own local copy of the data to be stale. But B will know this, because in the affirmative responses the other machines have told B what the transaction ID of the latest data is. So B can opt to either wait for the asynchronous journal update from one of the other machines, or B can proactively request the latest date from one of the other machines. The cache coherency protocol also has to deal with the case where NO machine gets a quorum of affirmative responses. This can occur if several machines try to request mastership of the same zone at the same time. e.g. if A, B, and C all simultaniously want mastership, they each respond affirmatively to themselves and negatively to the others, so none of the machines wind up with a quorum. There are many ways to 'solve' this deadlock. For example, they could abort, wait a random period of time, and retry, or they could further negotiate based on a fixed priority, or negotiate based on a moving token, or any number of possible ways. Keep in mind here that what I am talking about is the CACHE COHERENCY protocol *NOT* the JOURNALING protocol. In fact, in all my descriptions of the cache coherency protocol above you will note that the actual act of sending the journal out to other machines in the cluster does not get more sophisticated... the journal is still written out asynchronously and it can still be buffered by any amount, and with any latency. That is, the journaling mechanism is virtually independant of the cache coherency protocol. Also note that the cache coherency mechanism I describe above is only one of many possible ways of doing cache coherency, and it is also very complex. High performance cache coherency mechanisms are by definition very complex beasts. There is no simple solution or simple implementation. Even my description above is cutting about half a dozen other cases that have to be dealt with. For example, I have not described crash recovery issues with quorum systems at all, and those can get pretty complex. I have not discussed how nodes are added and deleted from the cluster from the point of view of the protocol, either, and I have not discussed how temporary (memory) caches are maintained. It gets very, very, VERY complex, very quickly. :If mastership can freely (with restrictions!) move to any box in the :system, doesn't that dictate the cache coherent system must be organized :_outside_ of all the systems? Since we can't expect something from :nothing, the steady state should be just that, nothing. All the disks :... No, multi-master cache coherency protocols (with no single master) are definitely possible. I described one above. :... :signals: warning, a journal entry in the pipe (coming, to be followed :by an IO). Each box listens for external coherency signals (inode by :UDP, ICMP?) and prepares for a journal and IO. Maybe better than silent Not necessarily. Your thinking processes are tending towards the assumption that systems have to be synchronized in order to be able to begin executing an operation. In fact, systems do not have to be synchronized. The negotiation of mastership can be independant of the data synchronization mechanism. A machine can negotiate mastership without having the most up-to-date data. The trick is that it has to KNOW that it doesn't have the most up-to-date data and either wait for it or proactively fetch it. :steady state, each node listens for a 'synchronized and ready' heart :beat from all the other nodes, waits a short interval and broadcasts its :own 'synchronized and ready' heart beat. Normally (during no operation), :the beats would align them selves on the delay interval and function as :indicator the coherency system is up and ready. If a signal is missed :the data propagation path (mesh) is worked out to originate from the box :with the missing heartbeat, before the data comes. (kinda like AMD NUMA :ram bus discovery at power-up) Again, your thinking process here is that of a system maintaining a fully synchronized state. This isn't a bad way to think about it, but fully synchronous systems do not equate to high performance systems so you really have to think 'semi-synchronous' or 'asynchronous' rather then 'synchronous' when you think about these protocols. :> There is no simple way to do this. Cache coherency protocols are complex :> beasts. Very complex beasts. I've written such protocols in the past :> and it takes me several man-months to do one (and this is me we are :> talking about, lightning-programmer-matt-dillon). Fortunately having :> done it I pretty much know the algorithms by heart now :-) : :How's my algorithm? ...I never said it was easy! Not a bad start. Better then most people would have put together having never written one of these beasts before. I am impressed, in fact! :-) :> If you are trying to mirror data in a clustered system, ignoring :> robustness issues, : :What sort of robustness issues do you think I'm ignoring? Diskless :clients aren't susceptible to disk failure, nor do they need expensive :local raid. The idea behind a raid NFS/GFS for diskless clients is all :the availability robustness is applied, cost effectively, in one place, :more bang-for-buck as the number of nodes increase. But, maybe you are :considering another type of robustness? Which is more robust, three systems talking to a single RAID box or three systems each with their own independant copy of the data which use a quorum cache coherency algorithm to stay golden with each other? RAID method: Two of the three systems can go down and you are still golden, but if the RAID hardware goes down you are screwed. QUORUM method: One of the three systems can go down and you are still golden. But what if you had 5 systems? 7? 31? Is a quorum of 31 (i.e. 16 machines have to be up) more robust then a single piece of RAID hardware? But you know, there is no single 'correct' answer to that question. For example, the RAID hardware might be so reliable that it would normally be more reliable then a 2-out-of-3-systems-are-up configuration, but on the other hand the RAID hardware exists in one physical locale and it is going to be toast of the building burns down, whereas with the QUORUM method, with three machines in three separate buildings, you actually could have any one of those three buildings burn down and still have a working system. The network also effects the answer. A LAN is far more reliable then a WAN, and that can be a major factor. :> One of the things a good cache coherency protocol does is reduce the :> amount of duplicate data being transfered between boxes. Duplicate :> information is a real killer. So in that sense a good cache :> coherency algorithm can help a great deal. : :sorta like auto-discovery mesh networks? or do you mean rsync like IO? :Regards, :// George Typically this is a combination of strategic mirroring of data and also the caching of data in memory (similar to how a normal disk cache works). -Matt Matthew Dillon <dillon@backplane.com>
文章代碼(AID): #11rHoN00 (DFBSD_kernel)
討論串 (同標題文章)
文章代碼(AID): #11rHoN00 (DFBSD_kernel)