high performance computing on graphics processing units: hgpu.org

Posts

Oct, 21

How to distribute most efficiently a computation intensive calculation on an Android device to external compute units with an Android API

Is transferring computation intensive calculations to external compute-units the next trend? This master’s thesis researches if it is worth the effort to transfer a matrix multiplication from an Android phone to a System-on-Chip (SoC), using Bluetooth or WebSocket as communication protocols. The SoC solution used in this work is an Intel Altera Cyclone V based […]

OpenCL

Oct, 21

Computation of gray-level co-occurrence matrix based on CUDA and its optimization

As in various fields like scientific research and industrial application, the computation time optimization is becoming a task that is of increasing importance because of its highly parallel architecture. The graphics processing unit is regarded as a powerful engine for application programs that demand fairly high computation capabilities. Based on this, an algorithm was introduced […]

CUDA

Oct, 15

Accelerating Genomics Research with OpenCL and FPGAs

With the rapid decrease in gene sequencing costs due to the emergence of second-generation sequencing equipment, the availability of genome sequence data is increasing dramatically. The ability to correlate the variations among genomes is enabling advances in a wide range of medical research and personalized care. Because each human genome comprises more than three billion […]

OpenCL

Oct, 15

Flexible FPGA design for FDTD using OpenCL

Compared to classical HDL designs, generating FPGA with high-level synthesis from an OpenCL specification promises easier exploration of different design alternatives and, through ready-to-use infrastructure and common abstractions for host and memory interfaces, easier portability between different FPGA families. In this work, we evaluate the extent of this promise. To this end, we present a […]

OpenCL

Oct, 15

Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions

The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software redevelopment cost for each type or generation of hardware. To mitigate this cost, a performance portable programming system is required. This work presents my solution to the performance portability problem. I argue […]

OpenCL

Oct, 15

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

We present Synkhronos, an extension to Theano for multi-GPU computations leveraging data parallelism. Our framework provides automated execution and synchronization across devices, allowing users to continue to write serial programs without risk of race conditions. The NVIDIA Collective Communication Library is used for high-bandwidth inter-GPU communication. Further enhancements to the Theano function interface include input […]

CUDA

Oct, 15

SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes

The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing […]

CUDA

Oct, 3

Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors

In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, […]

CUDA

Oct, 3

FPGA implementation of a Convolutional Neural Network for "Wake up word" detection

The popularity of machine learning has increased dramatically in the last years and the possible applications varies from web search, speech recognition, object detection, etc. A big part of this development is due to the use of Convolutional Neural Networks (CNNs), where high performance Graphics Processing Units (GPUs) has been the most popular device. This […]

OpenCL

Oct, 3

An Efficient Load Balancing Method for Tree Algorithms

Nowadays, multiprocessing is mainstream with exponentially increasing number of processors. Load balancing is, therefore, a critical operation for the efficient execution of parallel algorithms. In this paper we consider the fundamental class of tree-based algorithms that are notoriously irregular, and hard to load-balance with existing static techniques. We propose a hybrid load balancing method using […]

Oct, 3

Computing Treewidth on the GPU

We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an O*(2^n)-time algorithm that explores the elimination orderings of the graph using a Held-Karp like dynamic programming approach. We use Bloom filters to detect […]

OpenCL

Oct, 3

Performance Evaluation of Container-based Virtualization for High Performance Computing Environments

Virtualization technologies have evolved along with the development of computational environments since virtualization offered needed features at that time such as isolation, accountability, resource allocation, resource fair sharing and so on. Novel processor technologies bring to commodity computers the possibility to emulate diverse environments where a wide range of computational scenarios can be run. Along […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

How to distribute most efficiently a computation intensive calculation on an Android device to external compute units with an Android API

Computation of gray-level co-occurrence matrix based on CUDA and its optimization

Accelerating Genomics Research with OpenCL and FPGAs

Flexible FPGA design for FDTD using OpenCL

Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes

Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors

FPGA implementation of a Convolutional Neural Network for "Wake up word" detection

An Efficient Load Balancing Method for Tree Algorithms

Computing Treewidth on the GPU

Performance Evaluation of Container-based Virtualization for High Performance Computing Environments

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)