Show simple item record

dc.contributor.advisorSrikant, Y N
dc.contributor.advisorBharadwaj, Amrutur
dc.contributor.authorMandke, Aparna
dc.date.accessioned2015-09-09T07:14:19Z
dc.date.accessioned2018-07-31T04:38:33Z
dc.date.available2015-09-09T07:14:19Z
dc.date.available2018-07-31T04:38:33Z
dc.date.issued2015-09-09
dc.date.submitted2013
dc.identifier.urihttp://etd.iisc.ac.in/handle/2005/2485
dc.identifier.abstracthttp://etd.iisc.ac.in/static/etd/abstracts/3207/G25417-Abs.pdfen_US
dc.description.abstractThe number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant. Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively. Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches. To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration. Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application.en_US
dc.language.isoen_USen_US
dc.relation.ispartofseriesG25417en_US
dc.subjectProcessor Architectureen_US
dc.subjectChip Multiprocessors (CMPs)en_US
dc.subjectCache Memoryen_US
dc.subjectCache (Computers)en_US
dc.subjectGenetic Algorithmsen_US
dc.subjectLeakage Power Optimizationen_US
dc.subjectWorking Set Size Optimizationen_US
dc.subjectNear Optimal Remap Configurationen_US
dc.subjectThread Contention Predictorsen_US
dc.subjectOn-Chip Cacheen_US
dc.subjectCache (Computers) Architectureen_US
dc.subjectNon-Uniform Cache Architecture (NUCA)en_US
dc.subject.classificationComputer Scienceen_US
dc.titlePower Efficient Last Level Cache For Chip Multiprocessorsen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.disciplineFaculty of Engineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record