Posts
Jan, 28
Scheduling on Manycore and Heterogeneous Graphics Processors
Through custom software schedulers that distribute work differently than built-in hardware schedulers, data-parallel and heterogenous architectures can be retargeted towards irregular task-parallel graphics workloads. This dissertation examines the role of a GPU scheduler and how it may schedule complicated workloads onto the GPU for efficient parallel processing. This dissertation examines the scheduler through three different […]
Jan, 28
Automatic Resource-Constrained Static Task Parallelization
This thesis intends to show how to efficiently exploit the parallelism present in applications in order to enjoy the performance benefits that multiprocessors can provide, using a new automatic task parallelization methodology for compilers. The key characteristics we focus on are resource constraints and static scheduling. This methodology includes the techniques required to decompose applications […]
Jan, 28
GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications
While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is challenging due to their massive parallelism, which […]
Jan, 28
Performance-Correctness Challenges in Emerging Heterogeneous Multicore Processors
We are witnessing a tremendous amount of change in the design of the modern microprocessor. With dozens of CPU cores on-chip recent multicore processors, the search for thread-level parallelism (TLP) is more significant than ever. In parallel, a very different processor architecture has emerged that aims to extract parallelism at an entirely different scale. Originally […]
Jan, 28
Autotuning Programs with Algorithmic Choice
The process of optimizing programs and libraries, both for performance and quality of service, can be viewed as a search problem over the space of implementation choices. This search is traditionally manually conducted by the programmer and often must be repeated when systems, tools, or requirements change. The overriding goal of this work is to […]
Jan, 26
gem5-gpu: A Heterogeneous CPU-GPU Simulator
gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems. It builds on gem5, a modular fullsystem CPU simulator, and GPGPU-Sim, a detailed GPGPU simulator. gem5-gpu routes most memory accesses through Ruby, which is a highly configurable memory system in gem5. By doing this, it is able to simulate many system configurations, ranging from […]
Jan, 26
A Dynamic Offload Scheduler for spatial multitasking on Intel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor appears and it fully supports multitasking, but it does not automatically ensure high performance in this case. A conventional task level resource allocation scheduler could be used, but a processor utilization of the Xeon Phi is low because of idle time on the Xeon Phi. In this paper, we propose a […]
Jan, 26
Platform-Specific Optimization and Mapping of Stencil Codes through Refinement
A straightforward implementation of an algorithm in a general-purpose programming language does usually not deliver peak performance: compilers often fail to automatically tune the code for certain hardware peculiarities like memory hierarchy or vector execution units. Manually tuning the code is firstly error-prone as well as time-consuming and secondly taints the code by exposing those […]
Jan, 26
Computing Best Possible Pseudo-Solutions to Interval Linear Systems of Equations
In the paper, we consider interval linear algebraic systems of equations Ax = b, with an interval matrix A and interval right-hand side vector b, as a model of imprecise systems of linear algebraic equations of the same form. We propose a new regularization procedure that reduces the solution of the imprecise linear system to […]
Jan, 26
Low-latency Image Recognition with GPU-accelerated Convolutional Networks for Web-based Services
In this work, we describe an application of convolutional networks to object classification and detection in images. The task of image based object recognition is surveyed in the first chapter. Its application in internet advertisement is one of the main motivations of this work. The architecture of the convolutional networks is described in details in […]
Jan, 26
Optimizing Stencil Computations for NVIDIA Kepler GPUs
We present a series of optimization techniques for stencil computations on NVIDIA Kepler GPUs. Stencil computations with regular grids had been ported to the older generations of NVIDIA GPUs with significant performance improvements thanks to the higher memory bandwidth than conventional CPU-only systems. However, because of the architectural changes introduced with the latest generation of […]
Jan, 26
Hybrid strategy for stencil computations on the APU
Stencil computations are very regular and well adapted to GPU execution. However, the PCI-E bus that connects a discrete GPU to the system memory has a relatively low bandwidth when compared to the GPU compute power. The AMD APU architecture contains both CPU and GPU on the same chip and shared memory between them, which […]