high performance computing on graphics processing units: hgpu.org

Posts

Oct, 4

Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

Sparse matrix-vector multiplication (SpMV) is an important operation in computational science, and needs be accelerated because it often represents the dominant cost in many widely-used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different […]

CUDA

Oct, 4

APL on GPUs: A TAIL from the Past, Scribbled in Futhark

This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. […]

OpenCL

Oct, 4

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Convolutional Neural Networks (CNNs) have gained significant traction in the field of machine learning, particularly due to their high accuracy in visual recognition. Recent works have pushed the performance of GPU implementations of CNNs to significantly improve their classification and training times. With these improvements, many frameworks have become available for implementing CNNs on both […]

OpenCL

Oct, 4

Explicit Fourth-Order Runge-Kutta Method on Intel Xeon Phi Coprocessor

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge-Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using […]

Oct, 4

Training a Feedback Loop for Hand Pose Estimation

We propose an entirely data-driven approach to estimating the 3D pose of a hand given a depth image. We show that we can correct the mistakes made by a Convolutional Neural Network trained to predict an estimate of the 3D pose by using a feedback loop. The components of this feedback loop are also Deep […]

CUDA

Sep, 30

GPU-based timetable generation

Throughout an academic year, educational institutions need to generate hundreds of different timetables, this complex task demands a considerable amount of time and human resources.In the past, timetable generation was handmade, in current days as this task complexity increases, it is performed by specialized software which allows to reduce time and costs.Since nearly 10 years […]

CUDA

Sep, 30

Programming Models and Tools for Many-Core Platforms

The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents […]

OpenCL

Sep, 30

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

This paper presents a theoretical analysis and practical evaluation of the main bottlenecks towards a scalable distributed solution for the training of Deep Neuronal Networks (DNNs). The presented results show, that the current state of the art approach, using data-parallelized Stochastic Gradient Descent (SGD), is quickly turning into a vastly communication bound problem. In addition, […]

CUDA

Sep, 30

Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs

Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of […]

OpenCL

Sep, 30

Combining Belief Propagation and Successive Cancellation List Decoding of Polar Codes on a GPU Platform

The decoding performance of polar codes strongly depends on the decoding algorithm used, while also the decoder throughput and its latency mainly depend on the decoding algorithm. In this work, we implement the powerful successive cancellation list (SCL) decoder on a GPU and identify the bottlenecks of this algorithm with respect to parallel computing and […]

CUDA

Sep, 27

Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives

It is demonstrated how the non-proprietary OpenACC standard of compiler directives may be used to compactly and efficiently accelerate the rate-determining steps of two of the most routinely applied many-body methods of electronic structure theory, namely the second-order M{o}ller-Plesset (MP2) model in its resolution-of-the-identity (RI) approximated form and the (T) triples correction to the coupled […]

Sep, 27

FastCollect: Offloading Generational Garbage Collection to Integrated GPUs

Generational Mark-Sweep Garbage Collection is a widely used garbage collection technique. However, the garbage collector has poor execution efficiency for large programs. Aggressive collection causes execution pauses in the program, while reducing the collection frequency leads to memory wastage. In this work, we develop FastCollect, a parallel version of the generational mark-sweep garbage collector running […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

APL on GPUs: A TAIL from the Past, Scribbled in Futhark

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Explicit Fourth-Order Runge-Kutta Method on Intel Xeon Phi Coprocessor

Training a Feedback Loop for Hand Pose Estimation

GPU-based timetable generation

Programming Models and Tools for Many-Core Platforms

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs

Combining Belief Propagation and Successive Cancellation List Decoding of Polar Codes on a GPU Platform

Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives

FastCollect: Offloading Generational Garbage Collection to Integrated GPUs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)