Fault-tolerant distributed algorithms for reconfiguration of rings using management tokens
Abstract
In this thesis, we propose two algorithms using slightly different approaches to detect failures in a ring, and to reconfigure the ring using a combination of multithreading and management tokens which circulate around the ring. The failures are detected using timeout values for management tokens at each node. The proposed ring-based algorithms are fully distributed, and can take care of node crash failures, link crash failures, timing failures, and message losses. These algorithms free the applications running on top of the ring from being worried about fault-tolerance issues.
Both the algorithms can handle any number of the above-mentioned failures provided every sequence of (k?1)(k - 1)(k?1) consecutive nodes affected by these failures is preceded and followed by up nodes which can communicate with each other, where kkk is the number of predecessor nodes known to each node in the ring. The algorithms assume that the underlying physical link failures do not lead to network partitioning.
In both the algorithms, a “regular” token circulates around the ring in the clockwise direction. In the first algorithm, each of the up nodes can generate a “reconfiguration” token which circulates in the anticlockwise direction during a reconfiguration process, but the reconfiguration token generated by the node having the highest node identity will eventually be elected and will complete its full circulation around the ring. So, this algorithm is essentially based on a leader election algorithm.
In the second algorithm, only one node can generate a reconfiguration token and this is the node which times out first for the “regular” token. After the reconfiguration of the ring is complete by the “reconfiguration” token, the “regular” token starts circulating around the ring again.
We have evaluated and compared the performances of the proposed algorithms using simulation studies. The second algorithm is found to be efficient in terms of number of messages in a reconfiguration round compared to the first algorithm in all situations and has comparable performance in terms of reconfiguration time. But the first algorithm is comparatively easier to understand and implement.