Re: Description of the Journaling topology

看板DFBSD_kernel作者時間21年前 (2004/12/30 16:01), 編輯推噓0(000)
留言0則, 0人參與, 最新討論串22/42 (看更多)
:... :level aren't we concerned about one thing? Atomic transactions. However :many "hot" physical devices there are across whatever network, shouldn't :they all finish before the exit 0? : :Minimizing the data to transfer across the slowest segment to a physical :device will lower transfer times, unless that procedure (eg compression) :overweighs the delay. (I wonder if it is possible to send less data by :only transmitting the _changes_ to a block device...) Your definition of what constitutes an 'atomic transaction' is not quite right, and that is where the confusion is stemming from. An atomic transaction (that is, a cache-coherent transaction) does not necessarily need to push anything out to other machines in order to complete the operation. All it needs a mastership of the data or meta-data involved. For example, if you are trying to create a file O_CREAT|O_EXCL, all you need is mastership of the namespace representing that file name. Note that I am NOT talking about a database 'transaction' in the traditional hard-storage sense, because that is NOT what machines need to do most of the time. This 'mastership' requires communication with the other machine in the cluster, but the communication may have ALREADY occurred sometime in the past. That is, your machine might ALREADY have mastership of the necessary resources, which means that your machine can conclude the operation without any further communication. In otherwords, your machine would be able to execute the create operation and return from the open() without having to synchronously communicate with anyone else, and still maintain a fully cache coherent topology across the entire cluster. The management of 'mastership' of resources is the responsibility of the cache coherency layer in the system. It is not the responsibility of the journal. The journal's only responsibility is to buffer the operation and shove it out to the other machines, but that can be done ASYNCHRONOUSLY, long after your machine's open() returned. It can do this because the other machines will not be able to touch the resource anyway, whether the journal has written it out or not, because they do not have mastership of the resource... they would have to talk to your machine to gain mastership of the resource before they could mess with the namespace which means that your machine then has the opportunity to ensure that the related data has been synchronized to the requesting machine (via the journal) before handing over mastership of the data to that machine. There is no simple way to do this. Cache coherency protocols are complex beasts. Very complex beasts. I've written such protocols in the past and it takes me several man-months to do one (and this is me we are talking about, lightning-programmer-matt-dillon). Fortunately having done it I pretty much know the algorithms by heart now :-) :But here are a few things to ponder, will a 1Gb nfs or 10Gb fiber to a :GFS on a fast raid server just be better and cheaper than a bunch of :warm mirrors? How much of a performance hit will the journaling code :be, especially on local partitions with kernels that only use it for a :"shared" mount point? djbdns logging is good example, even if you log to :/dev/null, generation of the logged info is a significant performance :hit for the app. I guess all I'm saying is, if the journaling is not :being used, bypass it! Well, what is the purpose of the journaling in this context? If you are trying to have an independant near-realtime backup of your filesystem then obviously you can't consolidate it into the same physical hardware you are running from normally, that would kinda kill the whole point. As for performance... well, if you are journaling over a socket then the overhead from the point of view of the originating machine is basically the overhead of writing the journal to a TCP socket (read: not very high relative to other mechanisms). Bandwidth is always an issue, but only if there is actual interference with other activities. If you are trying to mirror data in a clustered system, ignoring robustness issues, then the question is whether the clustered system is going to be more efficient with three mirrored sources for the filesystem data or with just one source. If the cost of getting the data to the other mirrors is small then there is an obvious performance benefit to being able to pull it from several places rather then from just one place. For one thing, you might not need 10G links to a single consolidated server... 1G links to multiple machines might be sufficient. Cost is no small issue here, either. There are lots of variables in the equation, and no simple answer. :As far as a coherent VFS cache protocol, I'm reminded of wise words from :Josh, a db programmer, "the key to performance is in minimizing the :quantity of data," ie use bit tokens instead of keywords in the db. And, :it was Ike that put the Spread toolkit in my "try someday" list, :... :// George :-- :George Georgalis, systems architect, administrator Linux BSD IXOYE The key to performance is multifold. It isn't just minimizing the amount of data transfered... it's minimizing latency, its being able to asynchronize data transfers so programs do not have to stall waiting for things, and its being able to already have the data present locally when it is requested rather then have to go over the wire to get it every time. What is faster, gzip'ing a 1G file and transporting it over a GiGE network or transporting the uncompressed 1G file over a GiGE network? The answer is: it depends how fast your cpu is and has little to do with how fast your network is (beyond a certain point). Now, of course, it *IS* true if you are comparing a brute-force algorithm with one that only transports needed data. e.g. if a program is trying to read 4KB out of a 10GB file it is obviously faster to just ship the 4KB over rather then ship the entire 10GB file over. One of the things a good cache coherency protocol does is reduce the amount of duplicate data being transfered between boxes. Duplicate information is a real killer. So in that sense a good cache coherency algorithm can help a great deal. You can think of the cache coherency problem somewhat like the way cpu caches work in SMP systems. Obviously any given cpu does not have to synchronously talk to all the other cpus every time it does a memory access. The reason: the cache coherency protocol gives that cpu certain guarentees in various situations that allow the cpu to access a great deal of data from cache instantly, without communicating with the other cpus, yet still maintain an illusion of atomicy. For example, I'm sure you've heard the comments about the overhead of getting and releasing a mutex in FreeBSD: "It's fast if the cpu owns the cache line". "It's slow if several cpus are competing for the same mutex but fast if the same cpu is getting and releasing the mutex over and over again". There's a reason why atomic bus operations are sometimes 'fast' and sometimes 'slow'. -Matt Matthew Dillon <dillon@backplane.com>
文章代碼(AID): #11qxN800 (DFBSD_kernel)
討論串 (同標題文章)
文章代碼(AID): #11qxN800 (DFBSD_kernel)