Fault tolerant locking for shared disk filesystems
Abstract
Shared disk filesystems are essential for enabling direct, high-performance access over block-based Storage Area Networks (SANs). A critical component of such systems is the distributed lock manager, which ensures synchronized access to data and metadata across multiple hosts. To enhance system availability, this work presents a fault-tolerant, multicast-based lock manager for the open-source GFS filesystem on Linux, which currently lacks such a mechanism.
The proposed protocol minimizes network hops during lock acquisition, which is particularly beneficial when transferring dirty data over the network instead of performing disk writes. It leverages a group communication toolkit to ensure ordered message delivery and failure handling. A group communication system (GCS) was ported to the kernel to support this functionality, simplifying the protocol’s design and implementation. Performance evaluations on a Fibre Channel SAN setup demonstrate that the fault-tolerant system performs comparably to the existing non-fault-tolerant solution, validating its effectiveness.