Efficient Cache Organization For Application Specific And General Purpose Processors

Rajan, Kaushik

dc.contributor.advisor	Govindarajan, R
dc.contributor.author	Rajan, Kaushik
dc.date.accessioned	2010-08-25T06:06:11Z
dc.date.accessioned	2018-07-31T05:08:50Z
dc.date.available	2010-08-25T06:06:11Z
dc.date.available	2018-07-31T05:08:50Z
dc.date.issued	2010-08-25
dc.date.submitted	2008
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/838
dc.description.abstract	The performance gap between processor and memory continues to remain a major performance bottleneck in both application specific and general purpose processors. This thesis strives to ease the above bottleneck by exploiting the characteristics of the application domain to improve the cache organization for two distinct processor architectures: (1) application specific processors for packet forwarding, (2) general purpose processors. Packet forwarding algorithms make use of a trie data structure to determine the forwarding route. We observe that the locality characteristics of the nodes at various levels of such a trie are different. Nodes that are closer to the root node, especially those that are immediate children of the root node (level-one nodes), exhibit higher temporal locality than nodes lower down the trie. Based on this observation we propose a novel Heterogeneously Segmented Cache Architecture (HSCA) that uses separate caches for level-one and lower-level nodes, each with carefully chosen sizes. We also propose a new replacement policy to enhance the performance of HSCA. Performance evaluation indicates that HSCA results in up to 32% reduction in average memory access time over a unified cache that shares the same cache space among all levels of the trie. HSCA also outperforms a previously proposed results cache. The use of a large root branching factor in a forwarding trie forcefully introduces a large number of nodes at level-one. Among these, only nodes that cover prefixes from the routing table are useful while the rest, are superfluous. We find that as many as 75% of the level-one nodes are superfluous. This leads to a skewed distribution of useful nodes among the cache sets of the level-one nodes cache. We propose a novel two-level mapping framework that achieves a better nodes to cache set mapping and hence incurs fewer conflict misses. Two-level mapping first aggregates nodes into Initial Partitions (IPs) using lower order bits and then remaps them from IPs into Refined Partitions (RPs), that form sets, based on some higher order bits. It provides flexibility in placement by allowing each IP to choose a different remap function. We propose three schemes conforming to the framework. A speedup in average memory access time of as much as 16% is gained over HSCA. In general purpose processor architectures, the design objectives of caches at various levels of the hierarchy are different. To ensure low access latencies, L1 caches are small and have low associativities, making them more susceptible to conflict misses. The extent of conflict misses incurred is governed by the placement function and the memory access patterns exhibited by the program. We propose a mechanism to learn the access characteristics of the program at runtime by analyzing the repetitive phases of program. We then make use of the two-level mapping framework to dynamically adapt the placement function. Further, we elegantly incorporate two-level mapping into the cache organization without increasing the cache access latency. Performance evaluation reveals that the proposed adaptive placement mechanism eliminates 32—36% of misses on average over a range of cache sizes. To prevent expensive off-chip accesses, L2 caches are larger and have higher associativities. Hence, the replacement policy plays a significant role in determining L2 cache performance. Further, as the inherent temporal locality in memory accesses is filtered out by the L1 cache, an L2 cache using the widely prevalent LRU replacement policy incurs significantly higher misses than the optimal replacement policy (OPT). We propose to bridge this gap through a novel replacement strategy that mimics the replacement decisions of OPT. The L2 cache is logically divided into two components, a Shepherd Cache (SC) with a simple FIFO replacement and a Main Cache (MC) with an emulation of optimal replacement. The SC plays the dual role of caching lines and shepherding the replacement decisions close to optimal for MC. Our proposed organization can cover 40% of the gap between LRU and OPT, resulting in 7% overall speedup.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G22459	en_US
dc.subject	Internet Cache Organization (Computer Science)	en_US
dc.subject	Internet Microprocessors	en_US
dc.subject	Processor Architectures	en_US
dc.subject	Packet Forwarding Engines (Routers)	en_US
dc.subject	General Purpose Processors	en_US
dc.subject	Heterogeneously Segmented Cache Architecture	en_US
dc.subject	Shephepherd Cache	en_US
dc.subject	Cache Architecture	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Efficient Cache Organization For Application Specific And General Purpose Processors	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.discipline	Faculty of Engineering	en_US

Files in this item

Name:: G22459.pdf
Size:: 3.040Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Supercomputer Education and Research Centre (SERC) [98]

Show simple item record