high performance computing on graphics processing units: hgpu.org

Posts

May, 9

Enabling task-level scheduling on heterogeneous platforms

OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. […]

OpenCL

May, 4

Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems

Graphics processing units (GPUs) have been widely used as accelerators in large-scale heterogeneous computing systems. However, current programming models can only support the utilization of local GPUs. When using non-local GPUs, programmers need to explicitly call API functions for data communication across computing nodes. As such, programming GPUs in large-scale computing systems is more challenging […]

OpenCL

May, 2

GPU Acceleration for the C++ Standard Template Library

Modern programmers must exploit parallelism for performance gains, possibly through the use of an attached or on-chip GPU. To take advantage of the GPU in C++ programs, the programmer must use either a new language (CUDA or OpenCL) or an external library (Thrust). Rather than requiring that programmers learn new tools, modify existing code, and […]

CUDA

•

OpenCL

Apr, 25

Radio Astronomy Beam Forming on Many-Core Architectures

Traditional radio telescopes use large steel dishes to observe radio sources. The largest radio telescope in the world, LOFAR, uses tens of thousands of fixed, omnidirectional antennas instead, a novel design that promises ground-breaking research in astronomy. Where traditional telescopes use custom-built hardware, LOFAR uses software to do signal processing in real time. This leads […]

CUDA

•

OpenCL

Apr, 25

An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs

This paper presents a novel work-distribution strategy for GPUs, that efficiently convolves radio-telescope data onto a grid, one of the most time-consuming processing steps to create a sky image. Unlike existing work-distribution strategies, this strategy keeps the number of device-memory accesses low, without incurring the overhead from sorting or searching within telescope data. Performance measurements […]

CUDA

•

OpenCL

Apr, 25

The Bones Source-to-Source Compiler Manual

Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers. To address the challenge of parallel programming, we introduce […]

CUDA

•

OpenCL

Apr, 25

Comparison of Different Parallel Implementaions of the 2+1-Dimensional KPZ Model and the 3-Dimensional KMC Model

We show that efficient simulations of the Kardar-Parisi-Zhang interface growth in 2 + 1 dimensions and of the 3-dimensional Kinetic Monte Carlo of thermally activated diffusion can be realized both on GPUs and modern CPUs. In this article we present results of different implementations on GPUs using CUDA and OpenCL and also on CPUs using […]

CUDA

•

OpenCL

Apr, 21

Multicore Processing for Classification and Clustering Algorithms

Data Mining algorithms such as classification and clustering are the future of computation, though multidimensional data-processing is required. People are using multicore processors with GPU’s. Most of the programming languages doesn’t provide multiprocessing facilities and hence wastage of processing resources. Clustering and classification algorithms are more resource consuming. In this paper we have shown strategies […]

CUDA

•

OpenCL

Apr, 19

Algorithm Construction for GPGPU

Today every personal computer and almost every work-related computer has a GPU powerful enough to be used as a supplementary computational device. One framework which enables utilization of this is called OpenCL. We asked the question how one writes efficient algorithms on these GPGPU devices. We found that there are two major ways to run […]

OpenCL

Apr, 18

Maximize Performance on GPUs Using the Rake-based Optimization: A Case Study

In this paper, we analyze the trade-offs encountered when minimizing the total execution time using the rake-based applications on GPUs. We use clustering data streams as a case study, and present a rake-based implementation for it, making it more efficient in terms of memory usage. In order to maximize performance for different problem sizes and […]

CUDA

•

OpenCL

Apr, 17

Auto-tuning interactive ray tracing using an analytical GPU architecture model

This paper presents a method for auto-tuning interactive ray tracing on GPUs using a hardware model. Getting full performance from modern GPUs is a challenging task. Workloads which require a guaranteed performance over several runs must select parameters for the worst performance of all runs. Our method uses an analytical GPU performance model to predict […]

OpenCL

Apr, 16

Fast GPU-based fluid simulations using SPH

Graphical Processing Units (GPUs) are massive floating-point stream processors, and through the recent development of tools such as CUDA and OpenCL it has become possible to fully utilize them for scientific computing. We have developed an open-source CUDA-based acceleration framework for 3D Computational Fluid Dynamics (CFD) using Smoothed Particle Hydrodynamics (SPH). This paper describes the […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Enabling task-level scheduling on heterogeneous platforms

Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems

GPU Acceleration for the C++ Standard Template Library

Radio Astronomy Beam Forming on Many-Core Architectures

An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs

The Bones Source-to-Source Compiler Manual

Comparison of Different Parallel Implementaions of the 2+1-Dimensional KPZ Model and the 3-Dimensional KMC Model

Multicore Processing for Classification and Clustering Algorithms

Algorithm Construction for GPGPU

Maximize Performance on GPUs Using the Rake-based Optimization: A Case Study

Auto-tuning interactive ray tracing using an analytical GPU architecture model

Fast GPU-based fluid simulations using SPH

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)