high performance computing on graphics processing units: hgpu.org

Posts

Sep, 16

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for […]

OpenCL

Sep, 16

Programmable and Scalable Architecture for Graphics Processing Units

Graphics processing is an application area with high level of parallelism at the data level and at the task level. Therefore, graphics processing units (GPU) are often implemented as multiprocessing systems with high performance floating point processing and application specific hardware stages for maximizing the graphics throughput. In this paper we evaluate the suitability of […]

OpenCL

•

OpenGL

Sep, 16

Searching for Concurrent Design Patterns in Video Games

The transition to multicore architectures has dramatically underscored the necessity for parallelism in software. In particular, while new gaming consoles are by and large multicore, most existing video game engines are essentially sequential and thus cannot easily take advantage of this hardware. In this paper we describe techniques derived from our experience parallelizing an open-source […]

Sep, 16

A Light-Weight Approach to Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures

When targeting hardware accelerators and reconfigurable processing units, the question of programmability arises, i.e. how different implementations of individual, configuration-specific functions are provided. Conventionally, this is resolved either at compilation time with a specific hardware environment being targeted, by initialization routines at program start, or decision trees at run-time. Such technique are, however, hardly applicable […]

Sep, 16

Optimizations and Performance of a Robotics Grasping Algorithm Described in Geometric Algebra

The usage of Conformal Geometric Algebra leads to algorithms that can be formulated in a very clear and easy to grasp way. But it can also increase the performance of an implementation because of its capabilities to be computed in parallel. In this paper we show how a grasping algorithm for a robotic arm is […]

CUDA

Sep, 16

Parallel Medical Image Reconstruction: From Graphics Processors to Grids

We present a variety of possible parallelization approaches for a real-world case study using several modern parallel and distributed computer architectures. Our case study is a production-quality, time-intensive algorithm for medical image reconstruction used in computer tomography. We describe how this algorithm can be parallelized for the main kinds of contemporary parallel architectures: shared-memory multiprocessors, […]

Sep, 16

A Generic Approach to Topic Models

This article contributes a generic model of topic models. To define the problem space, general characteristics for this class of models are derived, which give rise to a representation of topic models as "mixture networks", a domain-specific compact alternative to Bayesian networks. Besides illustrating the interconnection of mixtures in topic models, the benefit of this […]

Sep, 16

Implicit and dynamic trees for high performance rendering

Recent advances in GPU architecture and programmability have enabled the computation of ray casted or ray traced images at interactive frame rates. However, the rapid performance gains of the hardware cannot by themselves address the challenge posed by the steady growth in the geometric and temporal complexity of computer graphics datasets. In this paper we […]

CUDA

Sep, 16

Fast Monte Carlo Simulation for Patient-specific CT/CBCT Imaging Dose Calculation

Recently, X-ray imaging dose from computed tomography (CT) or cone beam CT (CBCT) scans has become a serious concern. Patient-specific imaging dose calculation has been proposed for the purpose of dose management. While Monte Carlo (MC) dose calculation can be quite accurate for this purpose, it suffers from low computational efficiency. In response to this […]

CUDA

Sep, 15

Analytical motion blur rasterization with compression

We present a rasterizer, based on time-dependent edge equations, that computes analytical visibility in order to render accurate motion blur. The theory for doing the computations in a rasterization framework is derived in detail, and then implemented. To keep the frame buffer requirements low, we also present a new oracle-based compression algorithm for the time […]

Sep, 15

Processing data streams with hard real-time constraints on heterogeneous systems

Data stream processing applications such as stock exchange data analysis, VoIP streaming, and sensor data processing pose two conflicting challenges: short per-stream latency — to satisfy the milliseconds-long, hard real-time constraints of each stream, and high throughput — to enable efficient processing of as many streams as possible. High-throughput programmable accelerators such as modern GPUs […]

CUDA

Sep, 15

Strategies for preparing computer science students for the multicore world

Multicore computers have become standard, and the number of cores per computer is rising rapidly. How does the new demand for understanding of parallel computing impact computer science education? In this paper, we examine several aspects of this question: (i) What parallelism body of knowledge do todaya’s students need to learn? (ii) How might these […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Programmable and Scalable Architecture for Graphics Processing Units

Searching for Concurrent Design Patterns in Video Games

A Light-Weight Approach to Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures

Optimizations and Performance of a Robotics Grasping Algorithm Described in Geometric Algebra

Parallel Medical Image Reconstruction: From Graphics Processors to Grids

A Generic Approach to Topic Models

Implicit and dynamic trees for high performance rendering

Fast Monte Carlo Simulation for Patient-specific CT/CBCT Imaging Dose Calculation

Analytical motion blur rasterization with compression

Processing data streams with hard real-time constraints on heterogeneous systems

Strategies for preparing computer science students for the multicore world

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)