5844

Posts

Oct, 1

A Comprehensive Performance Comparison of CUDA and OpenCL

This paper presents a comprehensive performance comparison between CUDA and OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world ones. We make an extensive analysis of the performance gaps taking into account programming models, optimization strategies, architectural details, and underlying compilers. Our results show that, for most applications, CUDA performs at most […]
Oct, 1

Accelerating Vector Calculations on GPU

Multicore computational accelerators such as Graphics Processor Units (GPUs) became common for gaining high-performance computing on a larger scale. Programming GPUs requires detailed knowledge of the underlying architecture in order to get maximum performance. In this paper we present solution of vector distance calculation on NVIDIA’s parallel computing architecture CUDA (Common Unified Device Architecture), where […]
Oct, 1

Large Scale DNA Sequence Alignment and Kernel Method Implemented with GPUs

Large Scale DNA sequence alignment and Kernel method in molecular biology play critical roles in bioinformatics. Both of which are successfully implemented on the brook+ platform with AMD’s GPUs. Aiming at the characters of graphical stream processors, we propose internal and external approach cooperatively to promote the performance of the two algorithms. The experiments show […]
Oct, 1

Interactive Soft Tissue for Surgical Simulation

Medical simulation has the potential to revolutionise the training of medical practitioners. Advantages include reduced risk to patients, increased access to rare scenarios and virtually unlimited repeatability. However, in order to fulfil its potential, medical simulators require techniques to provide realistic user interaction with the simulated patient. Specifically, compelling real-time simulations that allow the trainee […]
Oct, 1

Image registration on GPU

Image registration is a fundamental step in many applications involving image analysis. It consists of optimizing a similarity metric to find a spatial transformation to match two images (in 3D). It has application in medical images to build atlases (registering a population), or to align a patient to a template to detect pathologies. The main […]
Sep, 30

Exploring The Latency and Bandwidth Tolerance of CUDA Applications

CUDA applications represent a new body of parallel programs. Although several paradigms exist for programming distributed systems and many-core processors, many users struggle to achieve a program that is scalable across systems with different hardware characteristics. This paper explores the scalability of CUDA applications on systems with varying interconnect latencies, hiding a hardware detail from […]
Sep, 30

Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look […]
Sep, 30

Real-Time Handling of GPU Interrupts in LITMUSRT

Graphics processing units (GPUs) are becoming increasingly important in today’s platforms as their increased generality allows for them to be used as powerful co-processors. However, unlike standard CPUs, GPUs are treated as I/O devices and require the use of interrupts to facilitate communication with the CPU. Interrupts cause delays in the execution of real-time tasks, […]
Sep, 30

Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control

Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to improve the data locality for the next period of execution. In this work, we examine […]
Sep, 30

Stack-less SIMT reconvergence at low cost

Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, […]
Sep, 30

A PTX Code Generator for LLVM

Today’s GPGPU architectures and corresponding high level programming languages like CUDA replace the traditionally restricted GPU pipelines. Proprietary compilers allow to translate these languages into native GPU assembly. Unfortunately, these compilers are non-customizable and restricted to static compilation. High performant application currently require particular manual optimizations. To overcome these cumbersome manual optimizations, this thesis develops […]
Sep, 30

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a twodimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering-small fixed […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: