| dc.description.abstract | Very Long Instruction Word (VLIW) processors are well known for their high compute capacity and simple hardware. Because of these properties, these processors are very popular in the embedded processing domain. With the architecture of superscalar processors approaching its optimal complexity limit, there has been a shift in focus towards VLIW processors. A major hindrance to the successful use of VLIW processors for computing in the general-purpose domain is their inherent object code incompatibility across different generations of the processor family. VLIW processors are not object code compatible because their micro-architecture is exposed to the compiler, and the latter generates hardware-specific binaries exploiting maximal available instruction-level parallelism (ILP) in the application. The binaries are architecture-specific—even a minor change in the hardware configuration of a new generation renders existing binaries suboptimal.
The object code compatibility problem of the VLIW processor family needs to be addressed if they are to be employed in the general-purpose domain because of the legacy software from industry and academia. Another reason for the same is that with the rapid popularity of the Internet and the use of VLIW processors for media processing, there is a need for VLIW processors to be able to execute available binaries of the applications as plug-ins. Some of the existing methods of achieving object code compatibility in VLIW processors are Split Issue Technique, Dynamic Rescheduling, and Binary Translation. Split issue technique is a hardware solution to address the compatibility problem and works on the principle of dynamic scheduling of code in VLIW processors. It is an effective solution to handle the same but is against the philosophy of keeping the hardware of VLIW processors extremely simple. Moreover, ILP that can be exploited by this scheme is limited by the window of execution of the code.
Software solutions are Dynamic Rescheduling and Binary Translation. Complex operating system support for Dynamic Rescheduling often offsets the performance advantage of such schemes. Binary translation is carried out with the help of a software layer which is the only mode of communication between the application and the underlying hardware. The application remains under the illusion that it is getting executed on its parent hardware platform.
With a view to execute binary translated code exploiting maximal ILP on a migrant VLIW architecture, a new scheduling scheme is proposed in this thesis named Speculative Trace Scheduling. This scheduling scheme consists of two phases. In the first phase, the application is divided into a number of decision trees, and in the second phase, these decision trees are split into their “probable execution paths” and these paths are scheduled on the processor. These probable execution paths are termed “traces” in this thesis. Traces are carved with the help of the tool SpliTree developed for this project. During the formation of these traces, decision points are removed from the body of the trace and extra code is inserted at the tail of the same to check for correct conditions. Removal of decision points exposes more ILP and the operations are scheduled as soon as their data dependency is met. Through our simulations, we observe a considerable gain in schedule lengths of the applications achieved by this scheme.
During the formation of traces, each trace is annotated with the probability of its execution which is based on the profile information of the application (if available). During execution, first the instructions from the highest probability trace are issued. As the check for correct trace is done at the end of the trace, the penalty on a trace misprediction is the length of the trace. Special hardware support is required to save the state of the processor at checkpoints such as correctly executed trace boundaries so that the processor can roll back on a misprediction. Hardware support required in our scheme is different from that of the split issue technique. It is required to assist the processor to roll back to the correct processor state on occurrence of exceptions and assist in scheduling done by the software. On the other hand, in the split issue technique, complex compiler support is required along with complex hardware to do dynamic run-time scheduling which is similar to superscalar processors. Scheduling is not done by the hardware in our scheme as is the case with the split issue technique. The simulations are performed on Philips TriMedia SDE version 2.0. The performance achieved by speculative trace scheduling is 1.41× over that based on decision trees.
The performance that can be achieved by speculative trace scheduling is limited by the penalty incurred on a trace misprediction. Misprediction penalty depends on the frequency of mispredictions. Frequency of mispredictions is minimal in our speculative trace scheduling because the traces are speculatively scheduled based on their probability of execution obtained by application profiling. Further reduction in misprediction penalty is obtained when the stall cycles of the processor on a cache miss and/or page fault during the execution of a trace are overlapped with the execution of the next probable trace of the same decision tree. On a misprediction, the processor can start the execution of the next probable trace which is already ahead in its execution after rolling back to the checkpoint state. This, in effect, leads to the reduction of the overall misprediction penalty. On the other hand, if in the end it is known that the trace executed was the correct one, then the execution proceeds as normal and the execution of the next probable trace (which was going on) is abandoned.
In establishing the complexity-effective architectural support for multithreaded speculative trace scheduling for VLIW processors, we carry out the validation of the mathematical model for such type of architecture in our thesis. The evaluation of this mathematical model shows an overall reduction of 8.39% over the misprediction penalty incurred when the stall cycles are not overlapped with the execution of the next probable trace. | |