Improving Last-Level Cache Performance in Single and Multi-Core Processsors
Abstract
With off-chip memory access taking 100's of processor cycles, getting data to the processor in a timely fashion remains one of the key performance bottlenecks in current systems. With increasing core counts, this problem aggravates and the memory access latency becomes even more critical in multi-core systems. Thus the Last Level Cache (LLC) is of particular importance as any miss experienced at the LLC translates into a costly off-chip memory access. A combination of on-chip caches and prefacers is used to hide the off-chip memory access latency. While a hierarchy of caches focus on exploiting locality by retaining useful data, prefacers complement them by initating data accesses early for blocks that are likely to be accessed in future. In the first half of this thesis, we focus on improving the performance of LLC in single-core processors by focusing on prefetchers. In the case of multi-cores, the LLC is shared across many cores and therefore by many programs running on them. Thus, in the second half of this thesis, we focus on novel and efficient management mechanisms for shared LLC to improve the performance of programs running on the various cores.
Prefetchers observe a training stream of primary misses in the cache and rely on the regularity present in them to predict and avoid future misses. We quantify the regularity present in the training stream using the information theoretic measure of entropy and study the impact on regularity by extending the training stream to include secondary misses and accesses. We also consider triggering prefetches on secondary misses. We _nd that the extended histories are more regular in general and it is beneficial to trigger prefetches on secondary misses also. However, the best design choice varies on a per-benchmark and prefetcher basis, necessitating a dynamic approach to identify the best prefetcher configuration. We propose an inexpensive bloom filter based dynamic mechanism to identify the best performing prefetch design point at run time. The adaptive scheme improves the performance in terms of Instructions Per Cycle (IPC) by 4.6% on average over a baseline prefetcher. This performance improvement is achieved along with a reduction in memory traffic requirements.
It is well known that aggressive prefetching can harm performance due to increased contention for memory bandwidth and cache pollution. Prefetchers treat all loads as equal and try to eliminate as many misses as possible while certain (static) load instructions are known to be more performance critical. As our second contribution, we propose Focused Prefetching, a generic mechanism to introduce performance awareness in prefetching. We identify that a small number of static loads, referred to as Loads Incurring Majority of Commit Stalls (LIMCOS), account for a majority of the commit stalls in processors. We propose simple history-based classifier to identify LIMCOS with high accuracy. We use the classifier to focus the prefetching efforts on LIMCOS. This is achieved in a generic prefetcher-agnostic fashion by filtering the history used by the prefetchers. Focused Prefetching improves performance in terms of IPC by 9.8% for a set of memory intensive SPEC2000 workloads. This performance gain is achieved along with a reduction in memory traffic and an improvement in prefetch accuracy.
In the second part of the thesis, we focus on improving the performance of shared caches in multi-core systems. Last level caches are affected by a lack of temporal locality in the access stream as the locality gets filtered out by caches above it. In the case of multi-cores, the interleaving of accesses from the various cores further adds to the problem. To overcome this, we propose a PC-Centric Next-Use Aware Cache Organization (NUcache) for shared caches in multi-cores, with an ability to retain a subset of cache blocks longer. This is achieved by a logical partitioning of the associative ways of a cache set into Main Ways and Deli Ways. While all the blocks have access to the Main Ways, blocks that are likely to be accessed in the near future (with shorter Next-Use distance) are candidates to be retained longer in the Deli Ways to eliminate future misses. We make use of the fact that a small number of PCs, referred to as delinquent PCs, bring in a majority of the cache blocks and learn the Next-Use characteristic of blocks brought in by them. We propose an intelligent cost-benefit based PC-selection mechanism to identify the best set of delinquent PCs that should have access to the Deli Ways to maximize the cache hits. Performance evaluation reveals that NUcache improves the performance (in terms of Average Normalized Turnaround Time, ANTT) of multi-programmed workloads by 6.2%, 13.9%, 15.8% and 19.6% in dual, quad, eight and sixteen core machines respectively. NUcache also performs better than some of the state-of-the-art cache partitioning mechanisms.
The last part of the thesis deals with effective shared cache management in multi-core systems to achieve various performance objectives. Explicitly controlling the shared cache occupancy of competing applications is a flexible and practical way to achieve a variety of high level performance goals. Existing solutions control cache occupancy at a coarser granularity, do not scale well to large core counts and, in some cases, lack the flexibility to support a variety of performance goals. To overcome this, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. The proposed framework requires only simple hardware changes to implement, can scale to larger core count and is flexible enough to support a variety of performance goals like hit-maximization, fairness and QoS. PriSM with Hit-Maximization improves the performance (of multi-programmed workloads) in terms of ANTT by 16.5%, 18.7% and 12.7% over baseline LRU in eight, sixteen and thirty two core machines respectively.