high performance computing on graphics processing units: hgpu.org

Posts

Apr, 3

A New GPU-Based Neighbor Search Algorithm for Fluid Simulations

Fluid simulations based on Smoothed Particle Hydrodynamics (SPH) have been widely used for generating complex motion of fluid. However,implementation of searching particle neighbors on graphics processing unit (GPU) can not be satisfied till now. In this paper, we present a new grid-based neighbor search method on GPU for GPU-based SPH fluid simulation. Using this new […]

Apr, 3

Efficient Parallel Algorithm for Nonlinear Dimensionality Reduction on GPU

Advances in nonlinear dimensionality reduction provide a way to understand and visualize the underlying structure of complex data sets. The performance of large-scale nonlinear dimensionality reduction is of key importance in data mining, machine learning, and data analysis. In this paper, we concentrate on improving the performance of nonlinear dimensionality reduction using large-scale data sets […]

Apr, 3

Accelerate Smoothed Particle Hydrodynamics using GPU

Physic-based fluid simulation is used extensively nowadays; however the traditional serial algorithm can’t satisfy the real-time requirement due to its complexity and computeintensive. The development of modern GPU makes this possible. In this paper, a Smoothed Particle Hydrodynamics (SPH) method for incompressible fluid was implemented using CUDA on GPU. Since the algorithm was executed on […]

CUDA

Apr, 3

GPU acceleration of MOLAR for HRRT List-Mode OSEM reconstructions

The Siemens ECAT HRRT PET scanner has the potential to produce images of the human brain with spatial resolution better than 3 mm. MOLAR (a motion-compensation OSEM List-mode Algorithm for Resolution-recovery) was developed to provide reconstructions of HRRT data with the best possible accuracy and precision. However, a computer cluster is required to generate reconstructions […]

CUDA

Apr, 3

A Light-weight API for Portable Multicore Programming

Multicore nodes have become ubiquitous in just a few years. At the same time, writing portable parallel software for multicore nodes is extremely challenging. Widely available programming models such as OpenMP and Pthreads are not useful for devices such as graphics cards, and more flexible programming models such as RapidMind are only available commercially. OpenCL […]

CUDA

•

OpenCL

Apr, 3

Mobile visual computing

Summary form only given. I will talk about camera phones, how you can use camera as a sensor that gives natural access to the information about the real world around you (mobile augmented reality) and how you can combine general computation capability to combine several input images into better or more interesting output images (mobile […]

OpenCL

•

OpenGL

Apr, 3

Energy consumption of Graphic Processing Units with respect to automotive use-cases

With the introduction of API’s like CUDA, Stream+ or OpenCL, modern Graphics Processing Units (GPU’s) can be easily employed for general purpose computing. Plus, their comparatively low price per GFLOP makes them interesting candidates for coprocessors in future embedded Electronic Control Units (ECUs). Yet, as car manufacturers thrive to reduce the Thermal Design Power (TDP) […]

CUDA

•

OpenCL

Apr, 2

Throughput-Effective On-Chip Networks for Manycore Accelerators

As the number of cores and threads in manycore compute accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This paper explores throughput-effective network-on-chips (NoC) for future manycore accelerators that employ bulk-synchronous parallel (BSP) programming models such as CUDA and OpenCL. A hardware optimization is “throughput-effective” if […]

CUDA

Apr, 2

MARC: A Many-Core Approach to Reconfigurable Computing

We present a Many-core Approach to Reconfigurable Computing (MARC), enabling efficient high-performance computing for applications expressed using parallel programming models such as OpenCL. The MARC system exploits abundant special FPGA resources such as distributed block memories and DSP blocks to implement complete single-chip high efficiency many-core micro architectures. The key benefits of MARC are that […]

OpenCL

Apr, 2

Real-time particle filtering with heuristics for 3D motion capture by monocular vision

Particle filtering is known as a robust approach for motion tracking by vision, at the cost of heavy computation in a high dimensional pose space. In this work, we describe a number of heuristics that we demonstrate to jointly improve robustness and real-time for motion capture. 3D human motion capture by monocular vision without markers […]

OpenCL

Apr, 2

Parallel discrete wavelet transform using the Open Computing Language: a performance and portability study

The discrete wavelet transform (DWT) is a powerful signal processing technique used in the JPEG 2000 image compression standard. The multi-resolution sub-band encoding provided by DWT allows for higher compression ratios, avoids blocking artifacts and enables progressive transmission of images. However, these advantages come at the expense of additional computational complexity. Achieving real-time or interactive […]

OpenCL

Apr, 2

Parallel implementation of the Finite-Difference Time-Domain method in Open Computing Language

In this paper we evaluate the usability and performance of Open Computing Language (OpenCL) targeted for implementation of the Finite-Difference Time-Domain (FDTD) method. The simulation speed was compared to implementations based on alternative techniques of parallel processor programming. Moreover, the portability of OpenCL FDTD code between modern computing architectures was assessed. The average speed of […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A New GPU-Based Neighbor Search Algorithm for Fluid Simulations

Efficient Parallel Algorithm for Nonlinear Dimensionality Reduction on GPU

Accelerate Smoothed Particle Hydrodynamics using GPU

GPU acceleration of MOLAR for HRRT List-Mode OSEM reconstructions

A Light-weight API for Portable Multicore Programming

Mobile visual computing

Energy consumption of Graphic Processing Units with respect to automotive use-cases

Throughput-Effective On-Chip Networks for Manycore Accelerators

MARC: A Many-Core Approach to Reconfigurable Computing

Real-time particle filtering with heuristics for 3D motion capture by monocular vision

Parallel discrete wavelet transform using the Open Computing Language: a performance and portability study

Parallel implementation of the Finite-Difference Time-Domain method in Open Computing Language

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)