Integrated Scheduling For Clustered VLIW Processors
MetadataShow full item record
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). Various clustered VLIW configurations, connectivity types, and inter-cluster communication models present different performance trade-offs to a scheduler. The scheduler is responsible for resolving the conflicting requirements of exploiting the parallelism offered by the hardware and limiting the communication among clusters to achieve better performance. Earlier proposals for cluster scheduling fall into two main categories, viz., phase-decoupled scheduling and phase-coupled scheduling and they focus on clustered architectures which provide inter-cluster communication by an explicit inter-cluster copy operation. However, modern commercial clustered architectures provide snooping capabilities (apart from the support for inter-cluster communication using an explicit MV operation) by allowing some of the functional units to read operands from the register file of some of the other clusters without any extra delay. The phase-decoupled approach of scheduling suffers from the well known phase-ordering problem which becomes severe for such a machine model (with snooping) because communication and resource constraints are tightly coupled and thus are exposed only during scheduling. Tight integration of communication and resource constraints further requires taking into account the resource and communication requirements of other instructions ready to be scheduled in the current cycle while binding an instruction, in order to carry out effective binding. However, earlier proposals on integrated scheduling consider instructions and clusters for binding using a fixed order and thus they show different widely varying performance characteristics in terms of execution time and code size. Other shortcomings of earlier integrated algorithms (that lead to suboptimal cluster scheduling decisions) are due to non-consideration of future communication (that may arise due to a binding) and functional unit binding. In this thesis, we propose a pragmatic scheme and also a generic graph matching based framework for cluster scheduling based on a generic and realistic clustered machine model. The proposed scheme effectively utilizes the exact knowledge of available communication slots, functional units, and load on different clusters as well as future resource and communication requirements known only at schedule time to attain significant performance improvement without code size penalty over earlier algorithms. The proposed graph matching based framework for cluster scheduling resolves the phase-ordering and fixed-ordering problem associated with scheduling on clustered VLIW architectures. The framework provides a mechanism to exploit the slack of instructions by dynamically varying the freedom available in scheduling an instruction and hence the cost of scheduling an instruction using different alternatives to reduce the inter-cluster communication. An experimental evaluation of the proposed framework and some of the earlier proposals is presented in the context of a state-of-art commercial clustered architecture.