high performance computing on graphics processing units: hgpu.org

Posts

Jun, 8

Cryptanalysis of the McEliece Cryptosystem on GPGPUs

The linear code based McEliece cryptosystem is potentially promising as a so-called "post-quantum" public key cryptosystem because thus far it has resisted quantum cryptanalysis, but to be considered secure, the cryptosystem must resist other attacks as well. In 2011, Bernstein et al. introduced the "Ball Collision Decoding" (BCD) attack on McEliece which is a significant […]

CUDA

Jun, 8

Bi-directional Path Tracing on GPU

Computer graphics renderers for creating photo-realistic images use mainly unidirectional path tracing, having good results for scenes without caustics or hard cases. There are also few renderers with bi-directional path tracing implementation, however due to the complexity of the algorithm implementation, they almost exclusively target sequential CPUs. The thesis proposes a way of implementation of […]

CUDA

Jun, 7

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus […]

CUDA

Jun, 7

Implementation of K-shortest Path Algorithm in GPU Using CUDA

K-shortest path algorithm is generalization of the shortest path algorithm. K-shortest path is used in various fields like sequence alignment problem in molecular bioinformatics, robot motion planning, path finding in gene network where speed to calculate paths plays a vital role. Parallel implementation is one of the best ways to fulfill the requirement of these […]

CUDA

Jun, 7

Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code

Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are […]

CUDA

Jun, 7

The implementation and optimization of Bitonic sort algorithm based on CUDA

This paper describes in detail the bitonic sort algorithm,and implements the bitonic sort algorithm based on cuda architecture. At the same time,we conduct two effective optimization of implementation details according to the characteristics of the GPU, which greatly improve the efficiency. Finally,we survey the optimized Bitonic sort algorithm on the GPU with the speedup of […]

CUDA

Jun, 7

A Parallel Implementation of the Galerkin Method for Solving Partial Differential Equations on a Triangular Mesh

Finite Element Methods are techniques for estimating solutions to boundary value problems for partial differential equations from an approximating subspace. These methods are based on weak or variational forms of the BVP that require less of the problem functions than what the original PDE would suggest in terms of order of differentiability and continuity. In […]

OpenCL

Jun, 5

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such systems, and offers functional portability. It does, however, suffer from poor performance portability, code tuned for one device must be re-tuned to achieve good […]

OpenCL

Jun, 5

Accelerated Nodal Discontinuous Galerkin Simulations for Reverse Time Migration with Large Clusters

Improving both accuracy and computational performance of numerical tools is a major challenge for seismic imaging and generally requires specialized implementations to make full use of modern parallel architectures. We present a computational strategy for reverse-time migration (RTM) with accelerator-aided clusters. A new imaging condition computed from the pressure and velocity fields is introduced. The […]

CUDA

•

OpenCL

Jun, 5

Blocks and Fuel: Frameworks for deep learning

We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano’s symbolic computational graph, and providing an extensive set of utilities to […]

CUDA

Jun, 5

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of […]

CUDA

Jun, 5

Fast algorithms and efficient GPU implementations for the Radon transform and the back-projection operator represented as convolution operators

The Radon transform and its adjoint, the back-projection operator, can both be expressed as convolutions in log-polar coordinates. Hence, fast algorithms for the application of the operators can be constructed by using FFT, if data is resampled at log-polar coordinates. Radon data is typically measured on an equally spaced grid in polar coordinates, and reconstructions […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Cryptanalysis of the McEliece Cryptosystem on GPGPUs

Bi-directional Path Tracing on GPU

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Implementation of K-shortest Path Algorithm in GPU Using CUDA

Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code

The implementation and optimization of Bitonic sort algorithm based on CUDA

A Parallel Implementation of the Galerkin Method for Solving Partial Differential Equations on a Triangular Mesh

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Accelerated Nodal Discontinuous Galerkin Simulations for Reverse Time Migration with Large Clusters

Blocks and Fuel: Frameworks for deep learning

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

Fast algorithms and efficient GPU implementations for the Radon transform and the back-projection operator represented as convolution operators

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)