Power Efficient Last Level Cache For Chip Multiprocessors

Mandke, Aparna

dc.contributor.advisor	Srikant, Y N
dc.contributor.advisor	Amrutur, Bharadwaj
dc.contributor.author	Mandke, Aparna
dc.date.accessioned	2015-09-09T07:14:19Z
dc.date.accessioned	2018-07-31T04:38:33Z
dc.date.available	2015-09-09T07:14:19Z
dc.date.available	2018-07-31T04:38:33Z
dc.date.issued	2015-09-09
dc.date.submitted	2013
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/2485
dc.identifier.abstract	https://etd.iisc.ac.in/static/etd/abstracts/3207/G25417-Abs.pdf	en_US
dc.description.abstract	The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant. Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively. Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches. To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration. Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G25417	en_US
dc.subject	Processor Architecture	en_US
dc.subject	Chip Multiprocessors (CMPs)	en_US
dc.subject	Cache Memory	en_US
dc.subject	Cache (Computers)	en_US
dc.subject	Genetic Algorithms	en_US
dc.subject	Leakage Power Optimization	en_US
dc.subject	Working Set Size Optimization	en_US
dc.subject	Near Optimal Remap Configuration	en_US
dc.subject	Thread Contention Predictors	en_US
dc.subject	On-Chip Cache	en_US
dc.subject	Cache (Computers) Architecture	en_US
dc.subject	Non-Uniform Cache Architecture (NUCA)	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Power Efficient Last Level Cache For Chip Multiprocessors	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.discipline	Faculty of Engineering	en_US

Files in this item

Name:: G25417.pdf
Size:: 3.140Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [561]

Show simple item record