high performance computing on graphics processing units: hgpu.org

Posts

Nov, 1

Classification Performance of Convolutional Neural Networks

The purpose of this thesis is to determine the performance of convolutional neural networks in classifications per millisecond, not training or accuracy, for the GTX960 and the TegraX1. This is done through varying parameters of the convolutional neural networks and using the Python framework Theano’s function profiler to measure the time taken for different networks. […]

Nov, 1

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

Recurrent neural networks (RNNs) have achieved state-of-the-art performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e.g., possibly beyond the memory capacity of a GPU device) and its training will become very inefficient. In this work, we […]

Nov, 1

Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System

In this paper, we present a data decomposition method for multi-dimensional data, aiming at realizing multi graphics processing unit (GPU) acceleration of a compute unified device architecture (CUDA) code written for a single GPU. Our multi-dimensional method extends a previous method that deals with one-dimensional (1-D) data. The method performs a sample run of selected […]

CUDA

Nov, 1

Programming Heterogeneous Systems from an Image Processing DSL

Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality. But creating, "programming,"and integrating this hardware into a hardware/software system is difficult. We address this problem by extending the image processing language, Halide, so users can specify which portions of […]

CUDA

•

OpenGL

Nov, 1

Performance Optimization of 3-D Lattice Boltzmann Flow Solver on a GPU

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature, it is a promising candidate for a parallel implementation on a GPU. The LBM, however, is heavily dataintensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase incurs […]

CUDA

Nov, 1

Design and Analysis of Soft-Error Resilience Mechanisms for GPU Register File

Modern graphics processing units (GPUs) are using increasingly larger register file (RF) which occupies a large fraction of GPU core area and is very frequently accessed. This makes RF vulnerable to soft-errors (SE). In this paper, we present two techniques for improving SE resilience of GPU RF. First, we propose compressing the RF values for […]

Oct, 29

The 2nd International SYCL Workshop (SYCL 2017), 2017

Call for Papers The 2nd International SYCL Workshop (SYCL 2017) Held in conjunction with ACM PPoPP 2017, Austin, Texas – February 4-7, 2017 https://codeplaysoftware.github.io/sycl-ppopp2017 SYCL (sɪkəl – as in sickle) is a royalty-free, cross-platform Khronos specification facilitating a C++ abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL, while adding the […]

Oct, 29

AQsort: Scalable Multi-Array In-Place Sorting with OpenMP

A new multi-threaded variant of the quicksort algorithm called AQsort and its C++/OpenMP implementation are presented. AQsort operates in place and was primarily designed for high-performance computing (HPC) runtime environments. It can work with multiple arrays at once; such a functionality is frequently required in HPC and cannot be accomplished with standard C pointer-based or […]

Oct, 29

Hetero-Mark, A Benchmark Suite for CPU-GPU Collaborative Computing

Graphics Processing Units (GPUs) can easily outperform CPUs in processing large-scale data parallel workloads, but are considered weak in processing serialized tasks and communicating with other devices. Pursuing a CPU-GPU collaborative computing model which takes advantage of both devices could provide an important breakthrough in realizing the full performance potential of heterogeneous computing. In recent […]

OpenCL

Oct, 29

GPflow: A Gaussian process library using TensorFlow

GPflow is a Gaussian process library that uses TensorFlow for its core computations and Python for its front end. The distinguishing features of GPflow are that it uses variational inference as the primary approximation method, provides concise code through the use of automatic differentiation, has been engineered with a particular emphasis on software testing and […]

CUDA

Oct, 29

GPU Performance Modeling and Optimization

The last decade has witnessed the blooming emergence of general-purpose Graphic-Processing-Unit computing (GPGPU). With the exponential growth of cores and threads in a modern GPU processor, how to analyze and optimize its performance becomes a grand challenge. In this thesis, as the modeling part, we propose an analytic model for throughput-oriented parallel processors. The model […]

CUDA

Oct, 29

GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling

The tree method is a widely implemented algorithm for collisionless N-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate N-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named GOTHIC, which […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Classification Performance of Convolutional Neural Networks

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System

Programming Heterogeneous Systems from an Image Processing DSL

Performance Optimization of 3-D Lattice Boltzmann Flow Solver on a GPU

Design and Analysis of Soft-Error Resilience Mechanisms for GPU Register File

The 2nd International SYCL Workshop (SYCL 2017), 2017

AQsort: Scalable Multi-Array In-Place Sorting with OpenMP

Hetero-Mark, A Benchmark Suite for CPU-GPU Collaborative Computing

GPflow: A Gaussian process library using TensorFlow

GPU Performance Modeling and Optimization

GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling

Recent source codes

XaaS containers

microSYCL: SYCL micro-benchmarks repository

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)