high performance computing on graphics processing units: hgpu.org

Posts

Aug, 8

OpenCL-accelerated object classification in video streams using Spatial Pooler of Hierarchical Temporal Memory

We present a method to classify objects in video streams using a brain-inspired Hierarchical Temporal Memory (HTM) algorithm. Object classification is a challenging task where humans still significantly outperform machine learning algorithms due to their unique capabilities. We have implemented a system which achieves very promising performance in terms of recognition accuracy. Unfortunately, conducting more […]

OpenCL

Aug, 5

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs

Adaptive Mesh Refinement methods reduce computational requirements of problems by increasing resolution for only areas of interest. However, in practice, efficient AMR implementations are difficult considering that the mesh hierarchy management must be optimized for the underlying hardware. Architecture complexity of GPUs can render efficient AMR to be particularity challenging in GPU-accelerated supercomputers. This paper […]

CUDA

Aug, 4

Parallel experiments with RARE-BLAS

Numerical reproducibility failures rise in parallel computation because of the non-associativity of floating-point summation. Optimizations on massively parallel systems dynamically modify the floating-point operation order. Hence, numerical results may change from one run to another. We propose to ensure reproducibility by extending as far as possible the IEEE-754 correct rounding property to larger operation sequences. […]

Aug, 4

RETURNN: The RWTH Extensible Training framework for Universal Recurrent Neural Networks

In this work we release our extensible and easily configurable neural network training software. It provides a rich set of functional layers with a particular focus on efficient training of recurrent neural network topologies on multiple GPUs. The source of the software package is public and freely available for academic research purposes and can be […]

CUDA

Aug, 4

TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

We have developed a task-parallel runtime system, called TREES, that is designed for high performance on CPU/GPU platforms. On platforms with multiple CPUs, Cilk’s "work-first" principle underlies how task-parallel applications can achieve performance, but work-first is a poor fit for GPUs. We build upon work-first to create the "work-together" principle that addresses the specific strengths […]

OpenCL

Aug, 4

A survey of sparse matrix-vector multiplication performance on large matrices

We contribute a third-party survey of sparse matrix-vector (SpMV) product performance on industrial-strength, large matrices using: (1) The SpMV implementations in Intel MKL, the Trilinos project (Tpetra subpackage), the CUSPARSE library, and the CUSP library, each running on modern architectures. (2) NVIDIA GPUs and Intel multi-core CPUs (supported by each software package). (3) The CSR, […]

CUDA

Aug, 4

Programming Embedded Manycore: Refinement and Optimizing Compilation of a Parallel Action Language for Hierarchical State Machines

Modeling languages propose convenient abstractions and transformations to handle the com- plexity of today’s embedded systems. Based on the formalism of Hierarchical State Machine, they enable the expression of hierarchical control parallelism. However, they face two importants challenges when it comes to model data-intensive applications: no unified approach that also accounts for data-parallel actions; and […]

OpenCL

Aug, 4

A Gb/s Parallel Block-based Viterbi Decoder for Convolutional Codes on GPU

In this paper, we propose a parallel block-based Viterbi decoder (PBVD) on the graphic processing unit (GPU) platform for the decoding of convolutional codes. The decoding procedure is simplified and parallelized, and the characteristic of the trellis is exploited to reduce the metric computation. Based on the compute unified device architecture (CUDA), two kernels with […]

CUDA

Aug, 1

The ANTAREX Approach to Autotuning and Adaptivity for Energy Efficient HPC Systems

The ANTAREX project aims at expressing the application self-adaptivity through a Domain Specific Language (DSL) and to run-time manage and autotune applications for green and heterogeneous High Performance Computing (HPC) systems up to Exascale. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application […]

OpenCL

Aug, 1

Drug Drug Interaction Extraction from Biomedical Literature Using Syntax Convolutional Neural Network

MOTIVATION: Detecting drug-drug interaction (DDI) has become a vital part of public health safety. Therefore, using text mining techniques to extract DDIs from biomedical literature has received great attentions. However, this research is still at an early stage and its performance has much room to improve. RESULTS: In this paper, we present a syntax convolutional […]

CUDA

Aug, 1

3D visualization of astronomy data cubes using immersive displays

We report on an exploratory project aimed at performing immersive 3D visualization of astronomical data, starting with spectral-line radio data cubes from galaxies. This work is done as a collaboration between the Department of Physics and Astronomy and the Department of Computer Science at the University of Manitoba. We are building our prototype using the […]

OpenGL

Aug, 1

Automatic Loop Partitioning for Heterogeneous Systems

In this work, we implement a tool that automatically partitions loops and then executes these partitions on heterogeneous systems. Partitioning a loop is the process of dividing a loop to form two or more new loops, each iterating over a portion of the original loops iteration space. A heterogeneous system is a system that is […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

OpenCL-accelerated object classification in video streams using Spatial Pooler of Hierarchical Temporal Memory

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs

Parallel experiments with RARE-BLAS

RETURNN: The RWTH Extensible Training framework for Universal Recurrent Neural Networks

TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

A survey of sparse matrix-vector multiplication performance on large matrices

Programming Embedded Manycore: Refinement and Optimizing Compilation of a Parallel Action Language for Hierarchical State Machines

A Gb/s Parallel Block-based Viterbi Decoder for Convolutional Codes on GPU

The ANTAREX Approach to Autotuning and Adaptivity for Energy Efficient HPC Systems

Drug Drug Interaction Extraction from Biomedical Literature Using Syntax Convolutional Neural Network

3D visualization of astronomy data cubes using immersive displays

Automatic Loop Partitioning for Heterogeneous Systems

Recent source codes

XaaS containers

microSYCL: SYCL micro-benchmarks repository

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)