Efficient Compilation Of Stream Programs Onto Multi-cores With Accelerators

Udupa, Abhishek

dc.contributor.advisor	Govindarajan, R
dc.contributor.author	Udupa, Abhishek
dc.date.accessioned	2010-12-30T05:37:45Z
dc.date.accessioned	2018-07-31T04:40:04Z
dc.date.available	2010-12-30T05:37:45Z
dc.date.available	2018-07-31T04:40:04Z
dc.date.issued	2010-12-30
dc.date.submitted	2009
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/971
dc.description.abstract	Over the past two decades, microprocessor manufacturers have typically relied on wider issue widths and deeper pipelines to obtain performance improvements for single threaded applications. However, in the recent years, with power dissipation and wire delays becoming primary design constraints, this approach can no longer be effectively used to yield performance improvements. Thus process designers and vendors are universally moving towards multi-core designs. Examples for these are the commodity general purpose multi-core processors, the CellBE accelerator from IBM and the Graphics Processing Units from NVIDIA and ATI. Although these many and multi-core architectures can provide enormous performance benefits, it is difficult to program for them due to the complexity of writing explicitly parallel code. The ubiquity of computationally intensive media processing applications makes it imperative to consider new programming frameworks and languages that can express parallelism in an easy, portable manner. The StreamIt programming language has been proposed to efficiently exploit parallelism at various levels on general purpose multi-core architectures and stream processors and allow media processing and DSP application to be developed in an easy and portable fashion. The StreamIt model allows programmers to specify a program as a set of filters connected by FIFO communication channels. The graphs thus specified by the StreamIt programs describe task, data and pipeline parallelism which can be potentially exploited on modern Graphics Processing Units (GPUs), which have emerged as powerful, commodity stream processors, which support abundant parallelism in hardware. The first part of this thesis deals with the challenges in mapping StreamIt programs to GPUs and proposes an efficient technique to software pipeline the execution of stream Programs on GPUs. We formulate this problem—both scheduling and assignment of filters to processors—as an efficient Integer Linear Program(ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. We have evaluated our approach on a platform equipped with an NVIDIA GeForce 8800 GTS 512 GPU and our approach yields a (geometric) mean speedup of 5.02X, with a maximum speedup of 36.83X across a set of StreamIt benchmarks, with the speedup measured relative to an optimized single threaded CPU execution. While the approach of software pipelining the execution of stream programs on GPUs is efficient and performs well, it does not utilize the CPU cores to perform useful computation. Further, it does not support programs with stateful filters, which are essentially filters that are not data parallel owing to a dependence between each successive firing that is carried through the implicit state of the filter. The second part of the thesis aims at addressing these issues and describes a novel method to orchestrate the execution of a StreamIt program on the multiple cores of a system and GPUs in a synergistic manner. The proposed approach identifies, using profiling, the relative benefits of executing a task on the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers, the limited DMA bandwidth available and the required buffer layout transformations associated with the partitioning, as an integrated Integer Linear Program(ILP) which can then be solved by an ILP solver. Since solving an ILP is NP-Hard in the general case and may thus require a large amount of time, we also propose an efficient heuristic algorithm for the work partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solutions to the ILP formulation on an average across the benchmark suite, while requiring 2–3 orders of magnitude less time than the ILP approach. The partitioned tasks are then software pipelined to execute on the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with eight CPU cores, out of which four were used, and a GeForce 8800 GTS512 GPU show a(geometric) mean speed up of 6.84X with a maximum of 51.96X over a single threaded CPU execution across a set of StreamIt benchmarks.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G23383	en_US
dc.subject	Compilers	en_US
dc.subject	Stream Programs	en_US
dc.subject	Partitioning Algorithm	en_US
dc.subject	Stream Programs - Execution	en_US
dc.subject	Stream Programs - Compilation	en_US
dc.subject	Graphics Processing Units (GPUs)	en_US
dc.subject	Streaming Multiprocessors (SMs)	en_US
dc.subject	Accelerators	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Efficient Compilation Of Stream Programs Onto Multi-cores With Accelerators	en_US
dc.type	Thesis	en_US
dc.degree.name	MSc Engg	en_US
dc.degree.level	Masters	en_US
dc.degree.discipline	Faculty of Engineering	en_US

Files in this item

Name:: G23383.pdf
Size:: 2.508Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [545]

Show simple item record

Efficient Compilation Of Stream Programs Onto Multi-cores With Accelerators

Files in this item

This item appears in the following Collection(s)

Related items

Benchmarking and Scheduling Strategies for Distributed Stream Processing ﻿

Efficient Frequent Closed Itemset Algorithms With Applications To Stream Mining And Classification ﻿

New Approaches And Experimental Studies On - Alegebraic Attacks On Stream Ciphers ﻿

Benchmarking and Scheduling Strategies for Distributed Stream Processing

Efficient Frequent Closed Itemset Algorithms With Applications To Stream Mining And Classification

New Approaches And Experimental Studies On - Alegebraic Attacks On Stream Ciphers