high performance computing on graphics processing units: hgpu.org

Posts

Jun, 7

Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code

Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are […]

CUDA

Jun, 7

The implementation and optimization of Bitonic sort algorithm based on CUDA

This paper describes in detail the bitonic sort algorithm,and implements the bitonic sort algorithm based on cuda architecture. At the same time,we conduct two effective optimization of implementation details according to the characteristics of the GPU, which greatly improve the efficiency. Finally,we survey the optimized Bitonic sort algorithm on the GPU with the speedup of […]

CUDA

Jun, 7

A Parallel Implementation of the Galerkin Method for Solving Partial Differential Equations on a Triangular Mesh

Finite Element Methods are techniques for estimating solutions to boundary value problems for partial differential equations from an approximating subspace. These methods are based on weak or variational forms of the BVP that require less of the problem functions than what the original PDE would suggest in terms of order of differentiability and continuity. In […]

OpenCL

Jun, 5

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such systems, and offers functional portability. It does, however, suffer from poor performance portability, code tuned for one device must be re-tuned to achieve good […]

OpenCL

Jun, 5

Accelerated Nodal Discontinuous Galerkin Simulations for Reverse Time Migration with Large Clusters

Improving both accuracy and computational performance of numerical tools is a major challenge for seismic imaging and generally requires specialized implementations to make full use of modern parallel architectures. We present a computational strategy for reverse-time migration (RTM) with accelerator-aided clusters. A new imaging condition computed from the pressure and velocity fields is introduced. The […]

CUDA

•

OpenCL

Jun, 5

Blocks and Fuel: Frameworks for deep learning

We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano’s symbolic computational graph, and providing an extensive set of utilities to […]

CUDA

Jun, 5

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of […]

CUDA

Jun, 5

Fast algorithms and efficient GPU implementations for the Radon transform and the back-projection operator represented as convolution operators

The Radon transform and its adjoint, the back-projection operator, can both be expressed as convolutions in log-polar coordinates. Hence, fast algorithms for the application of the operators can be constructed by using FFT, if data is resampled at log-polar coordinates. Radon data is typically measured on an equally spaced grid in polar coordinates, and reconstructions […]

CUDA

Jun, 5

7th International Conference on Signal Processing Systems (ICSPS), 2015

Topics: Adaptive Filtering & Signal Processing Ad-Hoc and Sensor Networks Analog and Mixed Signal Processing Array Signal Processing Audio and Electroacoustics Audio/Speech Processing and Coding Bioimaging and Signal Processing Biometrics & Authentification Biosignal Processing & Understanding Communication and Broadband Networks Communication Signal processing Computer Vision & Virtual Reality Cryptography and Network Security Design and Implementation […]

Jun, 3

A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems

Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and resistive RAM, offer several advantages and challenges when compared to conventional memory technologies, such as DRAM and magnetic hard disk drives (HDDs). In this paper, we present a survey of software techniques that have been proposed to exploit the advantages […]

Jun, 1

Genetically Improved BarraCUDA

BarraCUDA is a C program which uses the BWA algorithm in parallel with nVidia CUDA to align short next generation DNA sequences against a reference genome. The genetically improved (GI) code is up to three times faster on short paired end reads from The 1000 Genomes Project and 60 percent more accurate on a short […]

CUDA

Jun, 1

Research on the fast Fourier transform of image based on GPU

Study of general purpose computation by GPU (Graphics Processing Unit) can improve the image processing capability of micro-computer system. This paper studies the parallelism of the different stages of decimation in time radix 2 FFT algorithm, designs the butterfly and scramble kernels and implements 2D FFT on GPU. The experiment result demonstrates the validity and […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code

The implementation and optimization of Bitonic sort algorithm based on CUDA

A Parallel Implementation of the Galerkin Method for Solving Partial Differential Equations on a Triangular Mesh

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Accelerated Nodal Discontinuous Galerkin Simulations for Reverse Time Migration with Large Clusters

Blocks and Fuel: Frameworks for deep learning

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

Fast algorithms and efficient GPU implementations for the Radon transform and the back-projection operator represented as convolution operators

7th International Conference on Signal Processing Systems (ICSPS), 2015

A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems

Genetically Improved BarraCUDA

Research on the fast Fourier transform of image based on GPU

Recent source codes

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)