Posts
Feb, 20
Fast Exact String Matching on the GPU
We present a string-matching program that runs on the GPU. Our program, Cmatch, achieves a speedup of as much as 35x on a recent GPU over the equivalent CPU-bound version. String matching has a long history in computational biology with roots in finding similar proteins and gene sequences in a database of known sequences. The […]
Feb, 20
Program Optimization Study on a 128-Core GPU
The newest generations of graphics processing unit (GPU) architecture, such as the NVIDIA GeForce 8-series, feature new interfaces that improve programmability and generality over previous GPU generations. Using NVIDIA’s Compute Unified Device Architecture (CUDA), the GPU is presented to developers as a flexible parallel architecture. This flexibility introduces the opportunity to perform a wide variety […]
Feb, 20
How GPUs Can Improve the Quality of Magnetic Resonance Imaging
In magnetic resonance imaging (MRI), nonCartesian scan trajectories are advantageous in a wide variety of emerging applications. Advanced reconstruction algorithms that operate directly on non-Cartesian scan data using optimality criteria such as least-squares (LS) can produce significantly better images than conventional algorithms that apply a fast Fourier transform (FFT) after interpolating the scan data onto […]
Feb, 20
MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores
The CUDA programming model, which is based on an extended ANSI C language and a runtime environment, allows the programmer to specify explicitly data parallel computation. NVIDIA developed CUDA to open the architecture of their graphics accelerators to more general applications, but did not provide an efficient mapping to execute the programming model on any […]
Feb, 20
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on […]
Feb, 20
XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines
There are two avenues for many-core machines to gain higher performance: increasing the number of processors, and increasing the number of vector units in one SIMD processor. A truly scalable algorithm should take advantage of both. However, most past research on scalable memory allocators scales well with the number of processors, but poorly with the […]
Feb, 20
Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications
We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are […]
Feb, 19
Accelerating Particle Image Velocimetry Using Hybrid Architectures
High Performance Computing (HPC) applications are mapped to a cluster of multi-core processors communicating using high speed interconnects. More computational power is harnessed with the addition of hardware accelerators such as Graphics Processing Unit (GPU) cards and Field Programmable Gate Arrays (FPGAs). Particle Image Velocimetry (PIV) is an embarrassingly parallel application that can benefit from […]
Feb, 19
Programmability: Design Costs and Payoffs using AMD GPU Streaming Languages and Traditional Multi-Core Libraries
GPGPUs and multi-core processors have come to the forefront of interest in scientific computing. Graphics processors have become programmable, allowing exploitation of their large amounts of memory bandwidth and thread level parallelism in general purpose computing. This paper explores these two architectures, the languages used to program them, and the optimizations used to maximize performance […]
Feb, 19
Decoupled Access/Execute Metaprogramming for GPU-Accelerated Systems
We describe the evaluation of several implementations of a simple image processing filter on an NVIDIA GTX 280 card. Our experimental results show that performance depends significantly on low-level details such as data layout and iteration space mapping which complicate code development and maintenance. We propose extending a CUDA or OpenCL like model with decoupled […]
Feb, 19
Compiler Support for High-level GPU Programming
We design a high-level abstraction of CUDA, called hiCUDA, using compiler directives. It simplifies the tasks in porting sequential applications to NVIDIA GPUs. This paper focuses on the design and implementation of a source-to-source compiler that translates a hiCUDA program into an equivalent CUDA program, and shows that the performance of CUDA code generated by […]
Feb, 19
High Performance Relevance Vector Machine on GPUs
The Relevance Vector Machine (RVM) algorithm has been widely utilized in many applications, such as machine learning, image pattern recognition, and compressed sensing. However, the RVM algorithm is computationally expensive. We seek to accelerate the RVM algorithm computation for time sensitive applications by utilizing massively parallel accelerators such as GPUs. In this paper, the computation […]