high performance computing on graphics processing units: hgpu.org

Posts

Oct, 22

Accelerating Component-Based Dataflow Middleware with Adaptivity and Heterogeneity

This dissertation presents research into the development of high performance dataflow middleware and applications on heterogeneous, distributed-memory supercomputers. We present coarse-grained state-of-the-art ad-hoc techniques for optimizing the performance of real-world, data-intensive applications in biomedical image analysis and radar signal analysis on clusters of computational nodes equipped with multi-core microprocessors and accelerator processors, such as the […]

CUDA

Oct, 22

Implementing a Preconditioned Iterative Linear Solver Using Massively Parallel Graphics Processing Units

The research conducted in this thesis provides a robust implementation of a preconditioned iterative linear solver on programmable graphic processing units (GPUs). Solving a large, sparse linear system is the most computationally demanding part of many widely used power system analysis. This thesis presents a detailed study of iterative linear solvers with a focus on […]

CUDA

Oct, 22

GPU performance prediction using parametrized models

Compilation on modern architectures has become an increasingly difficult challenge with the evolution of computers and computing needs. In particular, programmers expect the compiler to produce optimized code for a variety of hardware, making the most of their theoretical performance. For years this was not a problem because hardware vendors consistently delivered increases in clock […]

CUDA

Oct, 22

CUDA Application Design and Development

As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with […]

CUDA

Oct, 22

Accelerating molecular docking and binding site mapping using FPGAs and GPUs

Computational accelerators such as Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) possess tremendous compute capabilities and are rapidly becoming viable options for effective high performance computing (HPC). In addition to their huge computational power, these architectures provide further benefits of reduced size and power dissipation. Despite their immense raw capabilities, achieving overall […]

CUDA

Oct, 22

Hardware Transactional Memory for GPU Architectures

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks […]

CUDA

•

OpenCL

Oct, 22

Low-Impact Profiling of Streaming, Heterogeneous Applications

Computer engineers are continually faced with the task of translating improvements in fabrication process technology (i.e., Moore’s Law) into architectures that allow computer scientists to accelerate application performance. As feature-size continues to shrink, architects of commodity processors are designing increasingly more cores on a chip. While additional cores can operate independently with some tasks (e.g. […]

Oct, 22

Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems

Checkpointing is an effective fault tolerant technique to improve the reliability of large scale parallel computing systems. However, checkpointing causes a large number of computation nodes to store a huge amount of data into file system simultaneously. It does not only require a huge storage space to store system state, but also brings a tremendous […]

OpenCL

Oct, 22

Parallelization of the distinct lattice spring model

The distinct lattice spring model (DLSM) is a newly developed numerical tool for modeling rock dynamics problems, i.e. dynamic failure and wave propagation. In this paper, parallelization of DLSM is presented. With the development of parallel computing technologies in both hardware and software, parallelization of a code is becoming easier than before. There are many […]

Oct, 22

Mapping Iterative Medical Imaging Algorithm on Cell Accelerator

Algebraic reconstruction techniques require about half the number of projections as that of Fourier backprojection methods, which makes these methods safer in terms of required radiation dose. Algebraic reconstruction technique (ART) and its variant OS-SART (ordered subset simultaneous ART) are techniques that provide faster convergence with comparatively good image quality. However, the prohibitively long processing […]

Oct, 21

Concurrent Algorithms and Data Structures for Many-Core Processors

The convergence of highly parallel many-core graphics processors with conventional multi-core processors is becoming a reality. To allow algorithms and data structures to scale efficiently on these new platforms, several important factors needs to be considered. (i) The algorithmic design needs to utilize the inherent parallelism of the problem at hand. Sorting, which is one […]

Oct, 21

Solving Linear Recurrences on Hybrid GPU Accelerated Manycore Systems

The aim of this paper is to show that linear recurrence systems with constant coefficients can be efficiently solved on hybrid GPU accelerated manycore systems with modern Fermi GPU cards. The main idea is to use the recently developed divideand-conquer algorithm which can be expressed in terms of Level 2 and 3 BLAS operations. The […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Accelerating Component-Based Dataflow Middleware with Adaptivity and Heterogeneity

Implementing a Preconditioned Iterative Linear Solver Using Massively Parallel Graphics Processing Units

GPU performance prediction using parametrized models

CUDA Application Design and Development

Accelerating molecular docking and binding site mapping using FPGAs and GPUs

Hardware Transactional Memory for GPU Architectures

Low-Impact Profiling of Streaming, Heterogeneous Applications

Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems

Parallelization of the distinct lattice spring model

Mapping Iterative Medical Imaging Algorithm on Cell Accelerator

Concurrent Algorithms and Data Structures for Many-Core Processors

Solving Linear Recurrences on Hybrid GPU Accelerated Manycore Systems

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)