high performance computing on graphics processing units: hgpu.org

Posts

Dec, 15

Easy-to-Use On-the-Fly Binary Program Acceleration on Many-Cores

This paper introduces Binary Acceleration At Runtime (BAAR), an easy-to-use on-the-fly binary acceleration mechanism which aims to tackle the problem of enabling existent software to automatically utilize accelerators at runtime. BAAR is based on the LLVM Compiler Infrastructure and has a client-server architecture. The client runs the program to be accelerated in an environment which […]

Dec, 15

Performance Comparison of GPUs with a Genetic Algorithm based on CUDA

Generally genetic algorithm (GA) has disadvantage of taking a lot of computation time, and it is worth reducing the execution time while keeping good quality and result. Comparative experiments are conducted with one CPU and four GPUs using CUDA (Compute Unified Device Architecture) and generational GA. We implement the fitness functions of the GA which […]

CUDA

Dec, 15

Bamboo: Automatic Translation of MPI Source into a Latency-Tolerant Form

Communication remains a significant barrier to scalability on distributed-memory systems. At present, the trend in architectural system design, which focuses on enhancing node performance, exacerbates the communication problem, since the relative cost of communication grows as the computation rate increases. This problem will be more pronounced at the exascale, where computational rates will be orders […]

CUDA

Dec, 14

Heuristics for Conversion Process of GPU’s Kernels for Multiples Kernels with Concurrent Optimization Divergence

Graphics Processing Units have been created with the objective of accelerating the construction and processing of graphic images. In its historical evolution line, concerned with the large computational capacity inherent, these devices started to be used for general purposes. However, the design of the GPUs don’t work well with divergent algorithms, mainly conditionals and repetitions. […]

CUDA

Dec, 14

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity […]

CUDA

Dec, 14

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

The worth of finding a general solution for nonsymmetric eigenvalue problems is specified in many areas of engineering and science computations, such as reducing noise to have a quiet ride in automotive industrial engineering or calculating the natural frequency of a bridge in civil engineering. The main objective of this thesis is to design a […]

CUDA

Dec, 14

Graph Processing on GPU

Graph mining and data management has become a significant area because more and more new applications to various data mining problems in social networking, computational biology, chemical data analysis and drug discovery are emerging recently. Although traditional mining methods have been extended to process graphs, many graph applications still confront huge challenges due to continuous […]

CUDA

Dec, 13

C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++

Capitalize on the faster GPU processors in today’s computers with the C++ AMP code library—and bring massive parallelism to your project. With this practical book, experienced C++ developers will learn parallel programming fundamentals with C++ AMP through detailed examples, code snippets, and case studies. Learn the advantages of parallelism and get best practices for harnessing […]

Dec, 12

Real-Time Grasp Detection Using Convolutional Neural Networks

We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks. Our network performs single-stage regression to graspable bounding boxes without using standard sliding window or region proposal techniques. The model outperforms state-of-the-art approaches by 14 percentage points and runs at 13 frames per second on a GPU. Our network can […]

CUDA

Dec, 12

A Survey Paper on Solving TSP using Ant Colony Optimization on GPU

Ant Colony Optimization (ACO) is meta-heuristic algorithm inspired from nature to solve many combinatorial optimization problem such as Travelling Salesman Problem (TSP). There are many versions of ACO used to solve TSP like, Ant System, Elitist Ant System, Max-Min Ant System, Rank based Ant System algorithm. For improved performance, these methods can be implemented in […]

CUDA

Dec, 12

cuLGT: Lattice Gauge Fixing on GPUs

We adopt CUDA-capable Graphic Processing Units (GPUs) for Landau, Coulomb and maximally Abelian gauge fixing in 3+1 dimensional SU(3) and SU(2) lattice gauge field theories. A combination of simulated annealing and overrelaxation is used to aim for the global maximum of the gauge functional. We use a fine grained degree of parallelism to achieve the […]

CUDA

Dec, 12

Compiler-Level Explicit Cache for a GPGPU Programming Framework

GPU is widely used for high-performance computing. However, standard programming framework such as CUDA and OpenCL requires low-level specifications, thus programming is difficult and the performance is not portable. Therefore, we are developing a new framework named MESI-CUDA. Providing virtual shared variables accessible from both CPU and GPU, MESI-CUDA hides complex memory architecture and eliminates […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Easy-to-Use On-the-Fly Binary Program Acceleration on Many-Cores

Performance Comparison of GPUs with a Genetic Algorithm based on CUDA

Bamboo: Automatic Translation of MPI Source into a Latency-Tolerant Form

Heuristics for Conversion Process of GPU’s Kernels for Multiples Kernels with Concurrent Optimization Divergence

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

Graph Processing on GPU

C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++

Real-Time Grasp Detection Using Convolutional Neural Networks

A Survey Paper on Solving TSP using Ant Colony Optimization on GPU

cuLGT: Lattice Gauge Fixing on GPUs

Compiler-Level Explicit Cache for a GPGPU Programming Framework

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)