high performance computing on graphics processing units: hgpu.org

Posts

Feb, 21

Data parallel loop statement extension to CUDA: GpuC

In recent years, Graphics Processing Units (GPUs) have emerged as a powerful accelerator for general-purpose computations. GPUs are attached to every modern desktop and laptop host CPU as graphics accelerators. GPUs have over a hundred cores with lots of parallelism. Initially, they were used only for graphics applications such as image processing and video games. […]

CUDA

Feb, 21

APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our […]

Feb, 20

Final Project Implementing Extremely Randomized Trees in CUDA

In this paper, we present an implementation of extremely randomized trees (ERT), a supervised machine learning algorithm utilizing decision tree ensembles, in CUDA, nVidia’s GPU parallel programming extensions for C/C++. We describe the CUDA programming model and NVIDIA GPU architectures and explain the design tradeoffs that we made to exploit various forms of parallelism available […]

CUDA

Feb, 20

Architecting graphics processors for non-graphics compute acceleration

This paper discusses the emergence of graphics processing units (GPUs) that contain architecture features for accelerating non-graphics (or GPGPU) applications. It provides an introduction for those interested in undertaking research at the intersection of manycore computing and GPU architecture. First, the motivation for using GPUs for non-graphics processing rather than developing specialized hardware is outlined. […]

Feb, 20

Design Space Exploration for GPU-Based Architecture

Recent advances in Graphics Processing Units (GPUs) provide opportunities to exploit GPUs for non-graphics applications. Scientific computation is inherently parallel, which is a good candidate to utilize the computing power of GPUs. This report investigates QR factorization, which is an important building block of scientific computation. We analyze different mapping mtheods of QR factorization on […]

CUDA

Feb, 20

Fast Exact String Matching on the GPU

We present a string-matching program that runs on the GPU. Our program, Cmatch, achieves a speedup of as much as 35x on a recent GPU over the equivalent CPU-bound version. String matching has a long history in computational biology with roots in finding similar proteins and gene sequences in a database of known sequences. The […]

CUDA

Feb, 20

Program Optimization Study on a 128-Core GPU

The newest generations of graphics processing unit (GPU) architecture, such as the NVIDIA GeForce 8-series, feature new interfaces that improve programmability and generality over previous GPU generations. Using NVIDIA’s Compute Unified Device Architecture (CUDA), the GPU is presented to developers as a flexible parallel architecture. This flexibility introduces the opportunity to perform a wide variety […]

CUDA

Feb, 20

How GPUs Can Improve the Quality of Magnetic Resonance Imaging

In magnetic resonance imaging (MRI), nonCartesian scan trajectories are advantageous in a wide variety of emerging applications. Advanced reconstruction algorithms that operate directly on non-Cartesian scan data using optimality criteria such as least-squares (LS) can produce significantly better images than conventional algorithms that apply a fast Fourier transform (FFT) after interpolating the scan data onto […]

CUDA

Feb, 20

MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores

The CUDA programming model, which is based on an extended ANSI C language and a runtime environment, allows the programmer to specify explicitly data parallel computation. NVIDIA developed CUDA to open the architecture of their graphics accelerators to more general applications, but did not provide an efficient mapping to execute the programming model on any […]

CUDA

Feb, 20

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on […]

CUDA

•

OpenCL

Feb, 20

XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines

There are two avenues for many-core machines to gain higher performance: increasing the number of processors, and increasing the number of vector units in one SIMD processor. A truly scalable algorithm should take advantage of both. However, most past research on scalable memory allocators scales well with the number of processors, but poorly with the […]

CUDA

Feb, 20

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Data parallel loop statement extension to CUDA: GpuC

APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

Final Project Implementing Extremely Randomized Trees in CUDA

Architecting graphics processors for non-graphics compute acceleration

Design Space Exploration for GPU-Based Architecture

Fast Exact String Matching on the GPU

Program Optimization Study on a 128-Core GPU

How GPUs Can Improve the Quality of Magnetic Resonance Imaging

MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Recent source codes

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

LC Framework

pplx-garden: Perplexity open source garden for inference technology

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

OpScanner

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Most viewed papers (last 30 days)