high performance computing on graphics processing units: hgpu.org

Posts

Jul, 16

Out-of-core singular value decomposition

Singular value decomposition (SVD) is a standard matrix factorization technique that produces optimal low-rank approximations of matrices. It has diverse applications, including machine learning, data science and signal processing. However, many common problems involve very large matrices that cannot fit in the main memory of commodity computers, making it impractical to use standard SVD algorithms […]

Jul, 14

On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation

Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPUaccelerated code on such systems should be "write once, run anywhere," regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due […]

CUDA

•

OpenCL

Jul, 14

HashGraph – Scalable Hash Tables Using A Sparse Graph Data Structure

Hash tables are ubiquitous and used in a wide range of applications for efficient probing of large and unsorted data. If designed properly, hash-tables can enable efficients look ups in a constant number of operations or commonly referred to as O(1) operations. As data sizes continue to grow and data becomes less structured (as is […]

CUDA

Jul, 14

A Translation Framework from RVC-CAL Dataflow Programs to OpenCL/SYCL based Implementations

Conventional programming languages nowadays still rely on sequential Models of Computation (MoC). However, the hardware makes more and more use of parallelism to increase the performance, e.g. an increasing number of cores. Nevertheless, programming languages, that still rely on sequential MoCs are not well suited to completely utilise this hardware. Dataflow programming languages like RVC-CAL […]

OpenCL

Jul, 14

Implementation of high speed hash function Keccak on GPU

Nowadays, a hash function is used for password management. The hash function is desired to possess the following three characteristics: Pre-Image Resistance, Second Pre-Image Resistance, and Collision Resistance. They are set on the assumption that it is computationally difficult to find the original message from a given hash value. However, the security level of the […]

CUDA

Jul, 14

Profiling based Out-of-core Hybrid Method for Large Neural Networks

GPUs are widely used to accelerate deep learning with NNs (NNs). On the other hand, since GPU memory capacity is limited, it is difficult to implement efficient programs that compute large NNs on GPU. To compute NNs exceeding GPU memory capacity, data-swapping method and recomputing method have been proposed in existing work. However, in these […]

CUDA

Jul, 10

GPU-based Parallel Computation Support for Stan

This paper details an extensible OpenCL framework that allows Stan to utilize heterogeneous compute devices. It includes GPU-optimized routines for the Cholesky decomposition, its derivative, other matrix algebra primitives and some commonly used likelihoods, with more additions planned for the near future. Stan users can now benefit from speedups offered by GPUs with little effort […]

OpenCL

Jul, 10

Optimizing Xeon Phi for Interactive Data Analysis

The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory […]

Jul, 10

PANNA: Properties from Artificial Neural Network Architectures

Prediction of material properties from first principles is often a computationally expensive task. Recently, artificial neural networks and other machine learning approaches have been successfully employed to obtain accurate models at a low computational cost by leveraging existing example data. Here, we present a software package "Properties from Artificial Neural Network Architectures" (PANNA) that provides […]

Jul, 7

Exploring Portability and Performance of OpenCL FPGA Kernels on Intel HARPv2

FPGAs offer a heterogenous compute solution to the continuous desire for increased performance by enabling the creation of applicationspecific hardware that accelerates computation. While the barrier to entry has historically been steep, advances in High Level Synthesis (HLS) are making FPGAs more accessible. Specifically, the Intel FPGA OpenCL SDK allows software designers to abstract away […]

OpenCL

Jul, 7

Efficient Spatial Anti-Aliasing Rendering for Line Joins on Vector Maps

The spatial anti-aliasing technique for line joins (intersections of the road segments) on vector maps is exclusively crucial to visual experience and system performance. Due to limitations of OpenGL API, one common practice to achieve the anti-aliased effect is splicing multiple triangles at varying scale levels to approximate the fan-shaped line joins. However, this approximation […]

OpenGL

Jul, 7

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Out-of-core singular value decomposition

On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation

HashGraph – Scalable Hash Tables Using A Sparse Graph Data Structure

A Translation Framework from RVC-CAL Dataflow Programs to OpenCL/SYCL based Implementations

Implementation of high speed hash function Keccak on GPU

Profiling based Out-of-core Hybrid Method for Large Neural Networks

GPU-based Parallel Computation Support for Stan

Optimizing Xeon Phi for Interactive Data Analysis

PANNA: Properties from Artificial Neural Network Architectures

Exploring Portability and Performance of OpenCL FPGA Kernels on Intel HARPv2

Efficient Spatial Anti-Aliasing Rendering for Line Joins on Vector Maps

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)