Posts
Feb, 20
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on […]
Feb, 20
XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines
There are two avenues for many-core machines to gain higher performance: increasing the number of processors, and increasing the number of vector units in one SIMD processor. A truly scalable algorithm should take advantage of both. However, most past research on scalable memory allocators scales well with the number of processors, but poorly with the […]
Feb, 20
Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications
We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are […]
Feb, 19
Accelerating Particle Image Velocimetry Using Hybrid Architectures
High Performance Computing (HPC) applications are mapped to a cluster of multi-core processors communicating using high speed interconnects. More computational power is harnessed with the addition of hardware accelerators such as Graphics Processing Unit (GPU) cards and Field Programmable Gate Arrays (FPGAs). Particle Image Velocimetry (PIV) is an embarrassingly parallel application that can benefit from […]
Feb, 19
Programmability: Design Costs and Payoffs using AMD GPU Streaming Languages and Traditional Multi-Core Libraries
GPGPUs and multi-core processors have come to the forefront of interest in scientific computing. Graphics processors have become programmable, allowing exploitation of their large amounts of memory bandwidth and thread level parallelism in general purpose computing. This paper explores these two architectures, the languages used to program them, and the optimizations used to maximize performance […]
Feb, 19
Decoupled Access/Execute Metaprogramming for GPU-Accelerated Systems
We describe the evaluation of several implementations of a simple image processing filter on an NVIDIA GTX 280 card. Our experimental results show that performance depends significantly on low-level details such as data layout and iteration space mapping which complicate code development and maintenance. We propose extending a CUDA or OpenCL like model with decoupled […]
Feb, 19
Compiler Support for High-level GPU Programming
We design a high-level abstraction of CUDA, called hiCUDA, using compiler directives. It simplifies the tasks in porting sequential applications to NVIDIA GPUs. This paper focuses on the design and implementation of a source-to-source compiler that translates a hiCUDA program into an equivalent CUDA program, and shows that the performance of CUDA code generated by […]
Feb, 19
High Performance Relevance Vector Machine on GPUs
The Relevance Vector Machine (RVM) algorithm has been widely utilized in many applications, such as machine learning, image pattern recognition, and compressed sensing. However, the RVM algorithm is computationally expensive. We seek to accelerate the RVM algorithm computation for time sensitive applications by utilizing massively parallel accelerators such as GPUs. In this paper, the computation […]
Feb, 19
A Generic Approach for Developing Highly Scalable Particle-Mesh Codes for GPUs
We present a general framework for GPU-based low-latency data transfer schemes that can be used for a variety of particle-mesh algorithms [8]. This framework allows to hide the latency of the data transfer between GPU-accelerated computing nodes by interleaving it with the kernel execution on the GPU. We discuss as an example the fully relativistic […]
Feb, 19
GPU Accelerated Scalable Parallel Random Number Generators
SPRNG (Scalable Parallel Random Number Generators) is widely used in computational science applications, particularly on parallel systems. The LFG and LCG are two frequently used random number generators in this library. In this paper, LFG and LCG are implemented on GPUs in CUDA. As a library for providing random number to GPU scientific applications, GASPRNG […]
Feb, 19
Faster File Matching using GPGPUs
We address the problem of file matching by modifying the MD6 algorithm that is best suited to take advantage of GPU computing. MD6 is a cryptographic hash function that is tree-based and highly parallelizable. When the message M is available initially, the hashing operations can be initiated at different starting points within the message and […]
Feb, 19
Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms
The Cauchy variant of the Reed-Solomon algorithm is implemented on accelerator platforms including GPGPU, FPGA, CellBE and ClearSpeed as well as on a x86 multi-core system. The sustained throughput performance and kernel rates are measured for a 5+3 Reed-Solomon schema. To compare the different technology platforms an efficiency is introduced and the platforms are categorized […]