| dc.description.abstract | Designing cost-efficient cache coherence protocols has long been pursued in the context of Distributed Shared-memory Multiprocessors (DSM). With increasingly aggressive implementations of DSM systems that use high-performance processors to exploit Instruction Level Parallelism (ILP), devising a solution for the cache coherence problem becomes more challenging. Furthermore, the shift towards multiprocessors on a chip demands faster, more scalable, and more cost-efficient cache coherence protocols.
In addition, emerging applications in multimedia, internet computing, science, and engineering place increased demands for very high bandwidth, faster communication between processors, and stringent latency requirements.
Existing mechanisms to maintain cache coherence in DSM systems are either hardware directory-based or compiler-directed. Directory-based schemes suffer from large storage overhead and limited scalability. They incur increased memory access latency and network traffic, as coherence transactions lie in the critical path of shared accesses. Optimized directory schemes attempt to improve cost-effectiveness but inherit these disadvantages.
Compiler-assisted schemes with simple hardware support have been suggested as viable alternatives since they maintain cache coherence locally without the need for interprocessor communication and expensive hardware. However, their conservative approach results in inaccurate detection of stale data, leading to unnecessary cache misses.
This thesis presents a cache coherence scheme that is more cost-efficient compared to existing ones. By eliminating the need for a directory and adopting a dynamic approach, our scheme addresses performance issues more effectively. It relies on local control to detect potential candidates for coherence enforcement and enforces coherence at release synchronizations.
The scheme uses software support in the form of program annotations to identify live shared variables between consecutive release synchronizations and to identify the synchronizations. Hardware support is provided by a small Coherence Buffer (CB) with an associated controller, local to each processor. We assume a release consistent memory model for our experiments with the CB-based scheme.
The scheme works as follows: the cache controller detects every shared reference between two consecutive release boundaries. The status of distinct cache blocks corresponding to the detected accesses is recorded in the CB. A CB Controller (CBC) carries out all CB-related operations. During CB updates, if there is a capacity or conflict miss in the CB, corresponding cache blocks are replaced, writing back only the dirty subblocks. These are early coherence actions, which may seem disadvantageous due to additional coherence transactions, but in reality, performance improves due to increased overlap of coherence transactions with other memory operations.
A release boundary is marked by release fence instructions in a program. When a release fence is ready to graduate, the processor issues a special memory request, INVL_WB, to the cache controller for flushing the CB. Unmodified cache blocks corresponding to CB entries are invalidated, and modified subblocks are written back and invalidated using special coherence requests specific to the CB-based scheme. The CBC selects a valid CB entry and sends the special coherence requests to the cache controller. Upon receiving acknowledgments for all coherence requests, the CBC informs the cache controller, which then signals the processor about the completion of the INVL_WB request, indicating that all coherence actions intended at the present release boundary are complete.
Through detailed architectural simulation of DSM configurations with MIPS R10000 type processors, we obtained results indicating that the CB-based system with an 8-entry, 4-way associative CB achieves a speedup of 1.07 to 4.31 over a full-map, 3-hop directory-based system for five of the SPLASH-2 benchmarks (representative of migratory sharing, producer-consumer, and write-many workloads) under the Release Consistency model.
By eliminating the need for a directory, all coherence activities are removed from the critical path of shared access, thereby improving memory performance. Our study of scalability and latency reveals that the CB-based system scales well compared to a directory-based system.
Based on the performance study, we conclude that the CB-based cache coherence protocol offers the following advantages over directory-based protocols:
Reduction in miss penalty
Increased ILP
Favoring compiler-based ILP optimizations
Reduced network transactions
Increased available communication bandwidth due to reduced network activity
Absence of false sharing problems due to subblock replacement of cache lines
Inherent scalability
Considerable savings in storage due to the absence of a directory
With its capability for runtime coherence detection, fine-grained coherence maintenance, and support for multiword cache lines, the CB-based scheme is proven to be more efficient than existing compiler-directed schemes.
Overall, this study shows that the CB-based scheme is a promising approach for maintaining cache coherence in current-day DSM systems for emerging applications. | |