high performance computing on graphics processing units: hgpu.org

Posts

Sep, 7

Pegasus: coordinated scheduling for virtualized accelerator-based systems

Heterogeneous multi-cores–platforms comprised of both general purpose and accelerator cores–are becoming increasingly common. While applications wish to freely utilize all cores present on such platforms, operating systems continue to view accelerators as specialized devices. The Pegasus system described in this paper uses an alternative approach that offers a uniform resource usage model for all cores […]

CUDA

Sep, 7

GPU-Based approaches for multiobjective local search algorithms. A case study: the flowshop scheduling problem

Multiobjective local search algorithms are efficient methods to solve complex problems in science and industry. Even if these heuristics allow to significantly reduce the computational time of the solution search space exploration, this latter cost remains exorbitant when very large problem instances are to be solved. As a result, the use of graphics processing units […]

Sep, 7

Automatic CPU-GPU communication management and optimization

The performance benefits of GPU parallelism can be enormous, but unlocking this performance potential is challenging. The applicability and performance of GPU parallelizations is limited by the complexities of CPU-GPU communication. To address these communications problems, this paper presents the first fully automatic system for managing and optimizing CPU-GPU communcation. This system, called the CPU-GPU […]

CUDA

Sep, 7

High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs

The visualization of molecular orbitals (MOs) is important for analyzing the results of quantum chemistry simulations. The functions describing the MOs are computed on a three-dimensional lattice, and the resulting data can then be used for plotting isocontours or isosurfaces for visualization as well as for other types of analyses. Existing software packages that render […]

CUDA

Sep, 7

MacroSS: macro-SIMDization of streaming applications

SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application domains by exploiting data-level parallelism, it is very challenging for compilers and also programmers to identify and transform parts […]

Sep, 7

CUDA-level performance with python-level productivity for Gaussian mixture model applications

Typically, scientists with computational needs prefer to use high-level languages such as Python or MATLAB; however, large computationally-intensive problems must eventually be recoded in a low level language such as C or Fortran by expert programmers in order to achieve sufficient performance. In addition, multiple strategies may exist for mapping a problem onto parallel hardware […]

CUDA

•

OpenCL

Sep, 7

Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

Heterogeneous multicore processors have emerged as an energy- and area-efficient architectural solution to improving performance for domain-specific applications such as those with a plethora of data-level parallelism. These processors typically contain a large number of small, compute-centric cores for acceleration while keeping one or two high-performance ILP cores on the die to guarantee single-thread performance. […]

Sep, 7

Operating systems must support GPU abstractions

This paper argues that lack of OS support for GPU abstractions fundamentally limits the usability of GPUs in many application domains. OSes offer abstractions for most common resources such as CPUs, input devices, and file systems. In contrast, OSes currently hide GPUs behind an awkward ioctl interface, shifting the burden for abstractions onto user libraries […]

Sep, 7

A code-based analytical approach for using separate device coprocessors in computing systems

Special hardware accelerators like FPGAs and GPUs are commonly introduced into a computing system as a separate device. Consequently, the accelerator and the host system do not share a common memory. Sourcing out the data to the additional hardware thus introduces a communication penalty. Based on a combination of a program’s source code and execution […]

Sep, 7

GPU-based asynchronous particle swarm optimization

This paper describes our latest implementation of Particle Swarm Optimization (PSO) with simple ring topology for modern Graphic Processing Units (GPUs). To achieve both the fastest execution time and the best performance, we designed a parallel version of the algorithm, as fine-grained as possible, without introducing explicit synchronization mechanisms among the particles’ evolution processes. The […]

CUDA

Sep, 7

The future of microprocessors

Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors. Microprocessors-single-chip computers-are the building blocks of the information world. Their performance has grown 1,000-fold over the past 20 years, driven by transistor speed and energy scaling, as well as by microarchitecture advances that exploited the transistor density gains from Moore’s […]

Sep, 7

Energy-efficient mechanisms for managing thread context in throughput processors

Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Pegasus: coordinated scheduling for virtualized accelerator-based systems

GPU-Based approaches for multiobjective local search algorithms. A case study: the flowshop scheduling problem

Automatic CPU-GPU communication management and optimization

High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs

MacroSS: macro-SIMDization of streaming applications

CUDA-level performance with python-level productivity for Gaussian mixture model applications

Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

Operating systems must support GPU abstractions

A code-based analytical approach for using separate device coprocessors in computing systems

GPU-based asynchronous particle swarm optimization

The future of microprocessors

Energy-efficient mechanisms for managing thread context in throughput processors

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)