On the effectiveness of exploiting instruction level reuse in superscalar microprocessor
: Power-performance prespective

Surendra, G

dc.contributor.advisor	Nandy, S K
dc.contributor.author	Surendra, G
dc.date.accessioned	2025-10-30T10:39:53Z
dc.date.available	2025-10-30T10:39:53Z
dc.date.submitted	2006
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/7255
dc.description.abstract	Modern microprocessors exploit Instruction Level Parallelism (ILP) by employing substantial on-chip resources and incorporating microarchitectural features that exploit properties of programs. Among program attributes, locality is probably the most commonly exploited phenomenon to improve processor performance, with units such as caches, branch predictors, and value predictors exploiting variants of this property. Value locality describes the likelihood of a previously seen value reoccurring at a storage location and is the basis for another program property called instruction repetition. Instruction repetition denotes the property of instructions being frequently executed with the same input values and producing the same results as a previous instance. Value reuse or Instruction Reuse (IR) mechanisms have been proposed in previous studies to exploit instruction repetition and reduce the number of dynamic instructions executed by a processor. IR is a non-speculative microarchitectural technique that reduces processor work and instruction execution time by buffering data associated with recently executed instructions in a lookup table called a Reuse Buffer (RB), so that future dynamic instances of instructions use the results from the RB if they have the same input operands. Previous research has predominantly focused on the performance potential of IR. In this thesis, we consider both power and performance as metrics and examine the effectiveness of IR in a superscalar processor for SPEC2000, media, and packet processing or network processing (NPU) benchmarks. The introduction of power as a metric essentially imposes several hardware constraints that have performance implications as well, and affects the degree of freedom with which IR can be exploited. Most prominent among these constraints are the RB size and the RB management policy, which determine which instructions access/update the RB. Both are extensively studied in this work. This thesis makes three primary contributions: 1. It studies the opportunity available to exploit IR and characterizes “delay” and criticality properties of instructions that can perform the reuse test. These properties are used to devise new RB query and update policies with the intention of improving the energy efficiency of IR. IR is also examined from a criticality perspective, since the performance potential depends on the criticality of instructions being reused. The criticality study also highlights some important limitations of IR. Furthermore, the study sheds light on how instruction criticality varies with the processor front-end and motivates the need for Implementing a combination of front-end throttling and IR to reduce total processor work, especially in future aggressive architectures. 2. To improve the energy efficiency of IR, this thesis advocates passing an index (instead of the actual result) from a reused producer instruction to dependent consumer instructions in the Instruction Window. This scheme, which we shall refer to as the resultbus optimization, exploits “communication reuse” (as opposed to “computation reuse”) and reduces the power dissipated over the high-capacitance result bus. We present a microarchitecture to exploit the above optimization, discuss its limitations, and present optimistic results for the same. 3. As a domain-specific study, the thesis considers network processing benchmarks and examines if RB hit rates can be improved by exploiting packet “flow” information. Packet processing applications are unique in the sense that repetition in data values within “flows” is quite prevalent, especially in packet headers. We present a flow-based IR scheme in which packet flow information is used to select an RB (from among multiple RBs) and compare it against (i) an Opcode and (ii) a random selection mechanism. This study essentially quantifies the impact of external inputs and examines the effect of sharing the RB among threads in a multithreaded processor. Our simulation results indicate that IR is promising both in terms of power and performance provided that: (i) the RB has less than 512 entries and is organized as either a direct-mapped or 2-way associative structure, (ii) the RB is partitioned to reduce conflicts, with each partition catering to a specific set of opcodes, (iii) the RB is dynamically accessed/updated by all instructions capable of performing the reuse test instead of restricting accesses to only critical instructions, and (iv) IR is not exploited for load instructions. Specifically, by exploiting IR, an average Energy-Delay Product (EDP) savings of 3.9%, 5.3%, and 2.7% is achieved in SPEC, media, and NPU benchmarks respectively. By exploiting the resultbus optimization for ALU instructions, an additional savings of 4.2% is achieved in SPEC benchmarks. In summary, the effectiveness of IR depends on the underlying processor microarchitecture and in particular on: (i) the number of pipeline stages that can be bypassed by reused instructions, and (ii) instruction execution latency. Also, instruction repetition is a property of a program due to which certain benchmarks are more amenable to reuse than others. This provides a motivation for exploiting IR in application-specific systems as well.
dc.language.iso	en_US
dc.relation.ispartofseries	T06120
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation
dc.subject	Reuse Buffer Management
dc.subject	Flow-Based Instruction Reuse
dc.subject	Multithreaded Processor Optimization
dc.title	On the effectiveness of exploiting instruction level reuse in superscalar microprocessor : Power-performance prespective
dc.degree.name	PhD
dc.degree.level	Doctoral
dc.degree.grantor	Indian Institute of Science
dc.degree.discipline	Engineering

Files in this item

Name:: T06120.pdf
Size:: 114.3Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Supercomputer Education and Research Centre (SERC) [113]

Show simple item record

On the effectiveness of exploiting instruction level reuse in superscalar microprocessor : Power-performance prespective

Files in this item

This item appears in the following Collection(s)