11340

Posts

Jan, 28

Autotuning Programs with Algorithmic Choice

The process of optimizing programs and libraries, both for performance and quality of service, can be viewed as a search problem over the space of implementation choices. This search is traditionally manually conducted by the programmer and often must be repeated when systems, tools, or requirements change. The overriding goal of this work is to […]
Jan, 26

gem5-gpu: A Heterogeneous CPU-GPU Simulator

gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems. It builds on gem5, a modular fullsystem CPU simulator, and GPGPU-Sim, a detailed GPGPU simulator. gem5-gpu routes most memory accesses through Ruby, which is a highly configurable memory system in gem5. By doing this, it is able to simulate many system configurations, ranging from […]
Jan, 26

A Dynamic Offload Scheduler for spatial multitasking on Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor appears and it fully supports multitasking, but it does not automatically ensure high performance in this case. A conventional task level resource allocation scheduler could be used, but a processor utilization of the Xeon Phi is low because of idle time on the Xeon Phi. In this paper, we propose a […]
Jan, 26

Platform-Specific Optimization and Mapping of Stencil Codes through Refinement

A straightforward implementation of an algorithm in a general-purpose programming language does usually not deliver peak performance: compilers often fail to automatically tune the code for certain hardware peculiarities like memory hierarchy or vector execution units. Manually tuning the code is firstly error-prone as well as time-consuming and secondly taints the code by exposing those […]
Jan, 26

Computing Best Possible Pseudo-Solutions to Interval Linear Systems of Equations

In the paper, we consider interval linear algebraic systems of equations Ax = b, with an interval matrix A and interval right-hand side vector b, as a model of imprecise systems of linear algebraic equations of the same form. We propose a new regularization procedure that reduces the solution of the imprecise linear system to […]
Jan, 26

Low-latency Image Recognition with GPU-accelerated Convolutional Networks for Web-based Services

In this work, we describe an application of convolutional networks to object classification and detection in images. The task of image based object recognition is surveyed in the first chapter. Its application in internet advertisement is one of the main motivations of this work. The architecture of the convolutional networks is described in details in […]
Jan, 26

Optimizing Stencil Computations for NVIDIA Kepler GPUs

We present a series of optimization techniques for stencil computations on NVIDIA Kepler GPUs. Stencil computations with regular grids had been ported to the older generations of NVIDIA GPUs with significant performance improvements thanks to the higher memory bandwidth than conventional CPU-only systems. However, because of the architectural changes introduced with the latest generation of […]
Jan, 26

Hybrid strategy for stencil computations on the APU

Stencil computations are very regular and well adapted to GPU execution. However, the PCI-E bus that connects a discrete GPU to the system memory has a relatively low bandwidth when compared to the GPU compute power. The AMD APU architecture contains both CPU and GPU on the same chip and shared memory between them, which […]
Jan, 26

Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

The Active Appearance Model (AAM) is one of the most powerful model-based object detecting and tracking methods that has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern Graphics Processing Units (GPUs) that feature a […]
Jan, 26

GPU acceleration of Newton’s method for large systems of polynomial equations in double double and quad double arithmetic

In order to compensate for the higher cost of double double and quad double arithmetic when solving large polynomial systems, we investigate the application of NVIDIA Tesla C2050, K20C, and K40 general purpose graphics processing units. As the dimension equals several thousands, the cost to compute one QR decomposition is sufficiently large so that the […]
Jan, 26

GPU Monte Carlo scatter calculations for Cone Beam Computed Tomography

A GPU Monte Carlo code for x-ray photon transport has been implemented and extensively tested. The code is intended for scatter compensation of cone beam computed tomography images. The code was tested to agree with other well known codes within 5% for a set of simple scenarios. The scatter compensation was also tested using an […]
Jan, 25

A High-productivity Framework for Multi-GPU computation of Mesh-based applications

The paper proposes a high-productivity framework for multi-GPU computation of mesh-based applications. In order to achieve high performance on these applications, we have to introduce complicated optimized techniques for GPU computing, which requires relatively-high cost of implementation. Our framework automatically translates user-written functions that update a grid point and generates both GPU and CPU code. […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: