File Consistency
When we involve caching and replication in a distributed file system, we need to consider a lot of issues which can affect file consistency and concurrency (as in concurrent updates):
To name a few:
- Should we cache on server side only or client side or both?
- Shall we cache the whole file or a chunk of it?
- Shall we cache the file content or the file attributes too?
- When and how to do cache validation? What is the cache consistency protocol?
- What to do in case one client has a file opened and another one deleted it?
- Concurrent updates in case where multiple clients have opened a file for writing.
- Should we bother about the synchronization or let the application worry about the same?
- For synchronization, a full cache consistency protocol simply eliminates the network issues and reduces the problem to what it was on time-sharing systems.
- Stateless or state-based? Whether a server should keep any state information about the clients accessing the file?
- Binding decisions: binding filename to the actual location of the file. Static binding, setup binding, binding at the time of opening the file or binding at each access.
- How the crash recovery will work?
Sun NFS
It uses a TTL (time to live) based approach at the client-side to invalidate caches. Since NFS is stateless, the server does not maintain any locks between requests and a write may span several RPC requests and hence two clients writing to the same remote file may get intermixed data on long writes.
Cache-Validation:
1. On first reference after TTL expires.
2. File is opened or a new block is fetched.
As far as file consistency is concerned, it is not always guaranteed. Consider a case where a client modifies a file, closes it and hence updates the server. The latest data will still not be available to another client sharing the file until the TTL period is over.
The design of NFS involved simplicity and hence they didn.t take into consideration any of the complex concurrent read/write sharing issues. They assumed that clients will not need to do any concurrent updates.
Sprite
Sprite maintains the list of clients and files each client is accessing. In case of Read-sharing, multiple clients can cache the data and need not check for updates. In case of Read-write sharing, we have two cases:
a. Concurrent write sharing
It occurs when multiple clients are accessing the file and at least one has it opened for writing. Sprite deals with this situation by disabling all the clients caching for the file, so that all reads and writes for the file go through to the server.
b. Sequential write sharing
It occurs when one file is modified by one client, closed and then opened by some other client. There are two potential problems. First, the second client has some stale data in its cache. This is handled through version numbering done at the server end and checked by clients on each open. The second problem could be that the last client might still have some dirty blocks in its cache. It might not have flushed all the dirty blocks yet. It notifies this writing client to flush all the latest updates. This approach is similar to NFS and AFS-1.
The result of such an approach is that file access under Sprite has exactly the same semantics as if all of the processes on all of the workstations were executing on a single timesharing system.
Sprite is the strictest in terms of consistency. It almost guarantees the consistency of single time-sharing system. It didn.t make any assumptions about the conflicts but still expects that concurrent write-sharing will be infrequent and hence the consistency algorithm is optimized for the case where there is no sharing. Though in today.s time when network access has become cheaper than disk access, it might be optimized for concurrent write-sharing too.
AFS-2
AFS-2 uses an optimistic approach and assumes the cache is valid on the client side until notified by the server. The server promises to notify the client before allowing modification by any other workstation. This is known as callback. In case of packet loss, the consistency is not exactly achieved. The client who did not get such a message from server would continue to use its cache. In case of concurrent writes, since the whole file is cached and no intermediate updates are sent, updates are sent only at the closing of file, the granularity of synchronization is file. That means whoever closes the file last will have its changes marked as the latest. All the other concurrent changes will be lost.
In AFS-2, they assumed that most of the shared files are executables and they do not change quite often and hence a callback approach would give much better performance.
CODA
It.s a descendant of AFS-2. The main idea of coda was constant data availability and hence they assumed that consistency and concurrent updates are not of prime importance. Client caches entire files on their local drives. It also uses callbacks for cache coherency. It supports disconnected operations . when the client can not contact any of the servers. It works even if there is a network partitioning. Arguments similar to AFS-2 would apply here for consistency. Coda has read-one and write-all approach. The Coda uses a 2-phase commit-protocol and detects conflicts in case of concurrent conflicting updates. It made an assumption that sequential write sharing between users is relatively rare in UNIX environments, so conflicting updates are likely to be rare.
Google File System
GFS introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. GFS has a relaxed consistency model that supports highly distributed applications well but remains relatively simple and efficient to implement.
File namespace mutations (e.g., file creation) are atomic. They are handled exclusively by the master. Namespace locking guarantees atomicity and correctness; the master.s operation log defines a global total order of these operations.
When an update/mutation succeeds without interference from concurrent writers (means no overlapping in time), the state is defined and consistent. If interference occurs, then state is undefined (means the order is not known) but consistent (means the same view is available to all clients) by maintaining the order of operations on all the replicas. GFS also uses a 2-phase write protocol to achieve consistency among replicas.
Since the clients cache chunk locations, they may read from a stale replica before the information is refreshed, hence GFS.s consistency is not strict. However this window is limited by the cache entry.s timeout and the next open of the file.
GFS is built on many assumptions which are relevant here:
- Files are generally big in sizes. It does not try to optimize for smaller files.
- The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. Successive operations from same client read contiguous region of file (this assumption is similar to employed in NFS, read-ahead). Applications will do batch processing of smaller reads to optimize the access.
- The work load has large sequential writes that append data. Seldom modified. Small writes at random location is supported but need not be efficient.
- For multiple concurrent writes, GFS provides atomic append with minimal synchronization overhead. It can be read at the same time by other clients.
All these assumptions led to such a consistency protocol design.
Comparison
If we try to order these distributed file systems in the order of increasing strictness towards consistency and concurrent updates, most probably the order would be:
1. Coda, minimum because it even supports disconnected operations where the concept of consistency and concurrent updates almost vanishes.
2. AFS-2, next because it is only slightly different than Coda and has callbacks which can not guarantee much consistency and concurrency.
3. Sun NFS, on third because it has a stronger cache validation algorithm which is more immune to network failures.
4. GFS, it guarantees a greater degree of consistency and concurrency through mechanisms such 2-phase commit protocol and operation logs. But still it leaves out a window where it might serve stale data to a client.
5. Sprite, strongest because it is stringent. As soon as a write-sharing starts, it immediately invalidates all the client caching and all the updates go through to the server and all the read go through the server as well.
Write Performances
File systems face the issue of disk access time being way more than memory access time while trying to improve performance. Hence they try to reduce the disk access time by:
- Reducing the number of disk accesses or number of seeks (as seeks form a major component in the total access time).
- Reduce access time by performing parallel access.
Log-Structured File System
Log Structured File System takes the first approach. Log-structured file systems are based on the assumption that files are cached in main memory and that increasing memory sizes will make the caches more and more effective at satisfying read requests. As a result, disk traffic will become dominated by writes. A log-structured file system writes all new information to disk in a sequential structure called the log. This approach increases write performance dramatically by eliminating almost all seeks.
Impact of large file caches is that they can serve as write buffers where large numbers of modified blocks can be collected before writing any of them to disk. Buffering may make it possible to write the blocks more efficiently, for example by writing them all in a single sequential transfer with only one seek.
LSFS assumes that crashes are infrequent and that it is acceptable to lose a few seconds or minutes of work in each crash.
LFS tries to improve both latency and throughput. Latency is improved by cutting down heavily on the number of seeks performed while throughput is improved by buffering the updates and writing them all in a single sequential update.
To sum it up for LSFS, we can say that the fundamental idea of a log-structured file system is to improve write performance by buffering a sequence of file system changes in the file cache and then writing all the changes to disk sequentially in a single disk write operation.
Redundant Array of Inexpensive Disks
RAID takes the second approach of parallelism. In case of RAID, different raid levels try to improve latency and throughput in a different. Some are successful in improving only latency; some are successful in improving only throughput and some improve both. For example, in case of RAID Level 1, by seeking in parallel, it tries to improve latency but then writing in parallel improves bandwidth too. Raid Level 2, 3 and 4 involve one or more separate check disks per group and needs to be updated along with every write. Hence these RAID levels could not improve a lot in terms of write performance because every write required a read from the check-disk. Raid Level 5, on the other hand, brings the best of both worlds by spreading the checksum across all the disks and hence allowing simultaneous writes.
In RAID, if we are able to do seek in parallel, it provides us with improvement in latency, because the seek-time reduces by a factor of n, where n is the number of disks being accessed in parallel. On the other hand, if we are able to do writes in parallel then it improves throughput. In general RAID tries to improve performance by doing concurrent access to independent disks.
Google File System
Google File System tries to combine the best of both the worlds by:
- Having Append as the primary write operation instead of random writes. GFS application involves append primarily.
- Distributing chunks of files over multiple machines and different clients uses different chunk-servers. Although it does not hold true completely because when number of clients increase, they end up writing to the same chunk-server, but still an overall improvement in the performance in achieved.
GFS is built on an assumption that file sizes are going to be big. Following this assumption, it uses another technique to improve the performance, having large chunk size. GFS chose 64MB as the chunk size which is much larger than the typical file system block sizes. A large chunk size offers several important advantages:
- Less number of requests to the master for chunk locations.
- It can reduce the network overhead by keeping a persistent TCP connection to the chunk-server.
- In total, lesser number of chunks in the system, hence less meta-data stored at the master. This allows GFS to keep the meta-data in memory, hence faster access.
There is a disadvantage too for large chunk size. Small files end up having one chunk allocated to them, and hence all the clients end up going to the same chunk-server. Such servers are called hot-spots. This can be avoided with higher replication for smaller files.
GFS again tries to decrease latency by having Append as the primary operation and improves throughput by dividing files in large chunks and distributing them over multiple chunk-servers.
References:
1. http://homepage.ntlworld.com/gedmurphy/oreilly/ror/nfs/ch07_04.htm
2. All the papers in the readings, class notes and study group discussion we had.
The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees