3086

Posts

Feb, 21

APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our […]
Feb, 20

Final Project Implementing Extremely Randomized Trees in CUDA

In this paper, we present an implementation of extremely randomized trees (ERT), a supervised machine learning algorithm utilizing decision tree ensembles, in CUDA, nVidia’s GPU parallel programming extensions for C/C++. We describe the CUDA programming model and NVIDIA GPU architectures and explain the design tradeoffs that we made to exploit various forms of parallelism available […]
Feb, 20

Architecting graphics processors for non-graphics compute acceleration

This paper discusses the emergence of graphics processing units (GPUs) that contain architecture features for accelerating non-graphics (or GPGPU) applications. It provides an introduction for those interested in undertaking research at the intersection of manycore computing and GPU architecture. First, the motivation for using GPUs for non-graphics processing rather than developing specialized hardware is outlined. […]
Feb, 20

Design Space Exploration for GPU-Based Architecture

Recent advances in Graphics Processing Units (GPUs) provide opportunities to exploit GPUs for non-graphics applications. Scientific computation is inherently parallel, which is a good candidate to utilize the computing power of GPUs. This report investigates QR factorization, which is an important building block of scientific computation. We analyze different mapping mtheods of QR factorization on […]
Feb, 20

Fast Exact String Matching on the GPU

We present a string-matching program that runs on the GPU. Our program, Cmatch, achieves a speedup of as much as 35x on a recent GPU over the equivalent CPU-bound version. String matching has a long history in computational biology with roots in finding similar proteins and gene sequences in a database of known sequences. The […]
Feb, 20

Program Optimization Study on a 128-Core GPU

The newest generations of graphics processing unit (GPU) architecture, such as the NVIDIA GeForce 8-series, feature new interfaces that improve programmability and generality over previous GPU generations. Using NVIDIA’s Compute Unified Device Architecture (CUDA), the GPU is presented to developers as a flexible parallel architecture. This flexibility introduces the opportunity to perform a wide variety […]
Feb, 20

How GPUs Can Improve the Quality of Magnetic Resonance Imaging

In magnetic resonance imaging (MRI), nonCartesian scan trajectories are advantageous in a wide variety of emerging applications. Advanced reconstruction algorithms that operate directly on non-Cartesian scan data using optimality criteria such as least-squares (LS) can produce significantly better images than conventional algorithms that apply a fast Fourier transform (FFT) after interpolating the scan data onto […]
Feb, 20

MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores

The CUDA programming model, which is based on an extended ANSI C language and a runtime environment, allows the programmer to specify explicitly data parallel computation. NVIDIA developed CUDA to open the architecture of their graphics accelerators to more general applications, but did not provide an efficient mapping to execute the programming model on any […]
Feb, 20

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on […]
Feb, 20

XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines

There are two avenues for many-core machines to gain higher performance: increasing the number of processors, and increasing the number of vector units in one SIMD processor. A truly scalable algorithm should take advantage of both. However, most past research on scalable memory allocators scales well with the number of processors, but poorly with the […]
Feb, 20

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are […]
Feb, 19

Accelerating Particle Image Velocimetry Using Hybrid Architectures

High Performance Computing (HPC) applications are mapped to a cluster of multi-core processors communicating using high speed interconnects. More computational power is harnessed with the addition of hardware accelerators such as Graphics Processing Unit (GPU) cards and Field Programmable Gate Arrays (FPGAs). Particle Image Velocimetry (PIV) is an embarrassingly parallel application that can benefit from […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: