high performance computing on graphics processing units: hgpu.org

Posts

Oct, 29

CLOP: A Multi-stage Compiler to Seamlessly Embed Heterogeneous Code

Heterogeneous programming complicates software development. We present CLOP, a platform that embeds code targeting heterogeneous compute devices in a convenient and clean way, allowing unobstructed data flow between the host code and the devices, reducing the amount of source code by an order of magnitude. The CLOP compiler uses the standard facilities of the D […]

OpenCL

Oct, 29

Approximation of BEM matrices using GPGPUs

The efficiency of boundary element methods depends crucially on the time required for setting up the stiffness matrix. The far-field part of the matrix can be approximated by compression schemes like the fast multipole method or $mathcal{H}$-matrix techniques. The near-field part is typically approximated by special quadrature rules like the Sauter-Schwab technique that can handle […]

OpenCL

Oct, 29

GPU Ray-Traced Collision Detection for Cloth Simulation

We propose a method to perform collision detection with cloths with ray-tracing. Our method is able to perform collision detection between cloths and volumetric objects (rigid or deformable) as well as collision detection between cloths (including auto-collision). Our method casts rays between objects to perform collision detection, and an inversion-handling algorithm is introduced to correct […]

OpenCL

Oct, 29

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e.g. speech utterances or handwritten documents. While word embedding has been demoed as a powerful representation for characterizing the statistical properties of natural language. In this study, we propose to use BLSTM-RNN with word embedding for […]

CUDA

Oct, 27

CFP: Fourth International Workshop on OpenCL (IWOCL 2016)

* Call for Papers * Now in its fourth year, the International Workshop on OpenCL (IWOCL) will be hosted by TU Wien in Vienna, Austria, at the C3 Convention Center on April 19th – 21st 2016. April 19th is reserved for an Advanced Hands On OpenCL tutorial with April 20th – 21st consisting of a […]

Oct, 27

The 1st International SYCL Workshop (SYCL), 2016

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16 Barcelona, Spain Sunday, 13th March, 2016 http://conf.researchr.org/track/PPoPP-2016/SYCL-2016-papers SYCL (sɪkəl – as in sickle) is a royalty-free, cross-platform C++ abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL, while adding the ease-of-use and flexibility of C++. For example, SYCL enables single source development where […]

Oct, 27

Evaluation of the Stability and Performance of a Multi-Stage Riemann Solver in Relativistic Hydrodynamic Simulations

The work deals with assessing the quality of a multi-stage Riemann solver for relativistic hydrodynamic simulations of heavy-ion collisions. The physical system is described using hydrodynamic conservation laws and then solved numerically. Because of the nature of such hydrodynamic simulations the numerical method has to cope with problems containing both strong discontinuities and smooth solutions, […]

CUDA

Oct, 27

Pairwise Sequence Alignment with Gaps with GPU

In this paper we consider the pair-wise sequence alignment problem with gaps, which is motivated by the resequencing problem that requires to assemble short reads sequences into a genome sequence by referring to a reference sequence. The problem has been studied before for single gap and bounded number of gaps. For single gap, there was […]

CUDA

•

OpenCL

Oct, 27

Compiling and Optimizing Java 8 Programs for GPU Execution

GPUs can enable significant performance improvements for certain classes of data parallel applications and are widely used in recent computer systems. However, GPU execution currently requires explicit low-level operations such as 1) managing memory allocations and transfers between the host system and the GPU, 2) writing GPU kernels in a low-level programming model such as […]

CUDA

•

OpenCL

Oct, 27

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

High-level languages such as Java increase both productivity and portability with productive language features such as managed runtime, type safety, and precise exception semantics. Additionally, Java 8 provides parallel stream APIs with lambda expressions to facilitate parallel programming for mainstream users of multi-core CPUs and many-core GPUs. These high-level APIs avoid the complexity of writing […]

CUDA

•

OpenCL

Oct, 27

Overlap fermions on GPUs

We report on our efforts to implement overlap fermions on NVIDIA GPUs using CUDA, commenting on the algorithms used, implemetation details, and the performance of our code.

CUDA

Oct, 25

ZNN – A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines

Convolutional networks (ConvNets) have become a popular approach to computer vision. It is important to accelerate ConvNet training, which is computationally costly. We propose a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs. Applying Brent’s theorem to the task dependency graph implies that linear speedup […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

CLOP: A Multi-stage Compiler to Seamlessly Embed Heterogeneous Code

Approximation of BEM matrices using GPGPUs

GPU Ray-Traced Collision Detection for Cloth Simulation

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network

CFP: Fourth International Workshop on OpenCL (IWOCL 2016)

The 1st International SYCL Workshop (SYCL), 2016

Evaluation of the Stability and Performance of a Multi-Stage Riemann Solver in Relativistic Hydrodynamic Simulations

Pairwise Sequence Alignment with Gaps with GPU

Compiling and Optimizing Java 8 Programs for GPU Execution

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

Overlap fermions on GPUs

ZNN – A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)