Posts
Jan, 29
A Detailed GPU Cache Model Based on Reuse Distance Theory
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means […]
Jan, 29
Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators
The use of linear algebra routines is fundamental to many areas of computational science, yet their implementation in software still forms the main computational bottleneck in many widely used algorithms. In machine learning and computational statistics, for example, the use of Gaussian distributions is ubiquitous, and routines for calculating the Cholesky decomposition, matrix inverse and […]
Jan, 29
Consolidating Applications for Energy Efficiency in Heterogeneous Computing Systems
By scheduling multiple applications with complementary resource requirements on a smaller number of compute nodes, we aim to improve performance, resource utilization, energy consumption, and energy efficiency simultaneously. In addition to our naive consolidation approach, which already achieves the aforementioned goals, we propose a new energy efficiency-aware (EEA) scheduling policy and compare its performance with […]
Jan, 29
Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors
Wideband channelization is a computationally intensive task within software-defined radio (SDR). To support this task, the underlying hardware should provide high performance and allow flexible implementations. Traditional solutions use field-programmable gate arrays (FPGAs) to satisfy these requirements. While FPGAs allow for flexible implementations, realizing a FPGA implementation is a difficult and time-consuming process. On the […]
Jan, 29
On the Programmability and Performance of Heterogeneous Platforms
General-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures presents many challenges to the domain scientist, including device selection, programming model, and level of investment in optimization. All of these choices influence the […]
Jan, 29
A Performance Criteria for parallel Computation on basis of block size using CUDA Architecture
GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device. Multiplication of matrices of large order can be computed in few seconds using GPU based on CUDA Architecture. A modern GPU consists of 16 highly threaded streaming multiprocessors (SMs). GPU named Fermi consists of 32 SMs. These are computing intensive devices. […]
Jan, 29
Impact of communication times on mixed CPU/GPU applications scheduling using KAAPI
High Performance Computing machines use more and more Graphical Processing Units as they are very efficient for homogeneous computation such as matrix operations. However before using these accelerators, one has to transfer data from the processor to them. Such a transfer can be slow. In this report, our aim is to study the impact of […]
Jan, 28
Scheduling on Manycore and Heterogeneous Graphics Processors
Through custom software schedulers that distribute work differently than built-in hardware schedulers, data-parallel and heterogenous architectures can be retargeted towards irregular task-parallel graphics workloads. This dissertation examines the role of a GPU scheduler and how it may schedule complicated workloads onto the GPU for efficient parallel processing. This dissertation examines the scheduler through three different […]
Jan, 28
Automatic Resource-Constrained Static Task Parallelization
This thesis intends to show how to efficiently exploit the parallelism present in applications in order to enjoy the performance benefits that multiprocessors can provide, using a new automatic task parallelization methodology for compilers. The key characteristics we focus on are resource constraints and static scheduling. This methodology includes the techniques required to decompose applications […]
Jan, 28
GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications
While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is challenging due to their massive parallelism, which […]
Jan, 28
Performance-Correctness Challenges in Emerging Heterogeneous Multicore Processors
We are witnessing a tremendous amount of change in the design of the modern microprocessor. With dozens of CPU cores on-chip recent multicore processors, the search for thread-level parallelism (TLP) is more significant than ever. In parallel, a very different processor architecture has emerged that aims to extract parallelism at an entirely different scale. Originally […]
Jan, 28
Autotuning Programs with Algorithmic Choice
The process of optimizing programs and libraries, both for performance and quality of service, can be viewed as a search problem over the space of implementation choices. This search is traditionally manually conducted by the programmer and often must be repeated when systems, tools, or requirements change. The overriding goal of this work is to […]