high performance computing on graphics processing units: hgpu.org

Posts

Feb, 3

GPGPU and MIC in Accelerated Cluster for Remote Sensed Image Processing Software

Processing of Earth observation remotely sensed images requires more and more powerful computing facilities. Since a few years, GPGPU (General Purpose processing on Graphics Processing Units) technology has been used to perform massively parallel calculations. The French Space Agency (CNES) has then made a portage of some IAS to assess their performance using this type […]

CUDA

•

OpenCL

Feb, 3

On the Accelerating of Two-dimensional Smart Laplacian Smoothing on the GPU

This paper presents a GPU-accelerated implementation of two-dimensional Smart Laplacian smoothing. This implementation is developed under the guideline of our paradigm for accelerating Laplacianbased mesh smoothing [13]. Two types of commonly used data layouts, Array-of-Structures (AoS) and Structure-of-Arrays (SoA) are used to represent triangular meshes in our implementation. Two iteration forms that have different choices […]

CUDA

Feb, 3

Scaling Recurrent Neural Network Language Models

This paper investigates the scaling properties of Recurrent Neural Network Language Models (RNNLMs). We discuss how to train very large RNNs on GPUs and address the questions of how RNNLMs scale with respect to model size, training-set size, computational costs and memory. Our analysis shows that despite being more costly to train, RNNLMs obtain much […]

CUDA

Feb, 2

Multi-GPU Support on Shared Memory System using Directive-based Programming Model

Existing and emerging studies show that using single Graphics Processing Units (GPUs) can lead to obtaining significant performance gains. These devices have tremendous processing capabilities. We should be able to achieve further orders of performance speedup if we use more than just one GPU. Heterogeneous processors consisting of multiple CPUs and GPUs offer immense potential […]

CUDA

Feb, 2

Characterizing and Enhancing Global Memory Data Coalescing on GPUs

Effective parallel programming for GPUs requires careful attention to several factors, including ensuring coalesced access of data from global memory. There is a need for tools that can provide feedback to users about statements in a GPU kernel where non-coalesced data access occurs, and assistance in fixing the problem. In this paper, we address both […]

CUDA

Feb, 2

Performance Analysis and Optimization of Hermite Methods on NVIDIA GPUs Using CUDA

In this thesis we present the first, to our knowledge, implementation and performance analysis of Hermite methods on GPU accelerated systems. We give analytic background for Hermite methods; give implementations of the Hermite methods on traditional CPU systems as well as on GPUs; give the reader background on basic CUDA programming for GPUs; discuss performance […]

CUDA

Feb, 2

Reliable Initialization of GPU-enabled Parallel Stochastic Simulations Using Mersenne Twister for Graphics Processors

Parallel stochastic simulations tend to exploit more and more computing power and they are now also developed for General Purpose Graphics Process Units (GP-GPUs). Consequently, they need reliable random sources to feed their applications. We propose a survey of the current Pseudo Random Numbers Generators (PRNG) available on GPU. We give a particular focus to […]

CUDA

Feb, 2

Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

We present a library for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. The library is based on the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. Acting as matrix library developers, using this model we do not have to explicitly deal with distribution of work and data or communication between computational nodes […]

CUDA

Feb, 2

Montblanc: GPU accelerated Radio Interferometer Measurement Equations in support of Bayesian Inference for Radio Observations

We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters […]

CUDA

Feb, 1

Optimized Data Transfers Based on the OpenCL Event Management Mechanism

In standard OpenCL programming, hosts such as CPUs are supposed to control their compute devices such as GPUs. Since compute devices are dedicated to kernel computation, only hosts can execute several kinds of data transfers such as inter-node communication and file access. These data transfers require one host to simultaneously play two or more roles […]

OpenCL

Feb, 1

In-Memory Data Analytics on Coupled CPU-GPU Architectures

In the big data era, in-memory data analytics is an effective means of achieving high performance data processing and realizing the value of data in a timely manner. Efforts in this direction have been spent on various aspects, including in-memory algorithmic designs and system optimizations. In this paper, we propose to develop the next-generation in-memory […]

OpenCL

Feb, 1

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly. Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs. While GPUs provide large numbers of compute resources, the resources needed for memory intensive workloads are more scarce. […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPGPU and MIC in Accelerated Cluster for Remote Sensed Image Processing Software

On the Accelerating of Two-dimensional Smart Laplacian Smoothing on the GPU

Scaling Recurrent Neural Network Language Models

Multi-GPU Support on Shared Memory System using Directive-based Programming Model

Characterizing and Enhancing Global Memory Data Coalescing on GPUs

Performance Analysis and Optimization of Hermite Methods on NVIDIA GPUs Using CUDA

Reliable Initialization of GPU-enabled Parallel Stochastic Simulations Using Mersenne Twister for Graphics Processors

Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

Montblanc: GPU accelerated Radio Interferometer Measurement Equations in support of Bayesian Inference for Radio Observations

Optimized Data Transfers Based on the OpenCL Event Management Mechanism

In-Memory Data Analytics on Coupled CPU-GPU Architectures

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)