high performance computing on graphics processing units: hgpu.org

Posts

Nov, 29

Electric polarizability of hadrons with overlap fermions on multi-GPUs

Electric polarizability is an important parameter for the internal structure of hadrons. Previous studies of polarizabilities have been done at relatively heavy pion masses, leaving the chiral region largely unexplored. In this report, we use overlap fermions which are known to be computationally demanding to properly capture the chiral dynamics. We present an implementation strategy […]

Nov, 29

A GPU-based survey for millisecond radio transients using ARTEMIS

Astrophysical radio transients are excellent probes of extreme physical processes originating from compact sources within our Galaxy and beyond. Radio frequency signals emitted from these objects provide a means to study the intervening medium through which they travel. Next generation radio telescopes are designed to explore the vast unexplored parameter space of high time resolution […]

Nov, 28

On the numerical sensitivity of computer simulations on hybrid and parallel computing systems

Simulation results depend not only on the precision of the floating point arithmetic with respect to the numerical accuracy of the results. They are also sensitive to differences of floating point arithmetic implementations of different hybrid and parallel computing systems such as CPUs, GPUs, dedicated processors like the Cell processor or the GRAPE special-purpose computer […]

CUDA

Nov, 28

Accelerating the Hough Transform with CUDA on Graphics Processing Units

Circle detection has been widely applied in image processing applications. Hough transform, the most popular method of shape detection, normally takes a long time to achieve reasonable results, especially for large images. Such performance makes it almost impossible to conduct real-time image processing with sequential algorithms on community computers. Recently, NVIDIA developed CUDA programming paradigm […]

CUDA

Nov, 28

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution […]

CUDA

Nov, 28

Anytime Algorithms for GPU Architectures

Most algorithms are run-to-completion and provide one answer upon completion and no answer if interrupted before completion. On the other hand, anytime algorithms have a monotonic increasing utility with the length of execution time. Our investigation focuses on the development of time-bounded anytime algorithms on Graphics Processing Units (GPUs) to trade-off the quality of output […]

CUDA

Nov, 28

A hybrid parallel framework for computational solid mechanics

A novel, hybrid parallel C++ framework for computational solid mechanics is developed and presented. The modular and extensible design of this framework allows it to support a wide variety of numerical schemes including discontinuous Galerkin formulations and higher order methods, multiphysics problems, hybrid meshes made of different types of elements and a number of different […]

CUDA

Nov, 28

Molecular Dynamics Simulation Based on Hadoop MapReduce

Molecular Dynamics (MD) simulation is a computationally intensive application used in multiple fields. It can exploit a distributed environment due to inherent computational parallelism. However, most of the existing implementations focus on performance enhancement. They may not provide fault-tolerance for every time-step. MapReduce is a framework first proposed by Google for processing huge amounts of […]

CUDA

Nov, 28

Computation of the Spatial Impulse Response for Ultrasonic Fields on the Graphics Processing Units (GPU)

The goal of the internship was to develop a linear wave-based simulation of ultrasonic fields. The theory was based on the Tupholme-Stepanishen formalism explained in the Jensen course for calculating pulsed ultrasound field. The Field II Simulation Program developed at the Technical University of Denmark does that simulation but the program runs slowly due to […]

CUDA

Nov, 28

Efficient Parallel Nonnegative Least Squares on Multicore Architectures

We parallelize a version of the active-set iterative algorithm derived from the original works of Lawson and Hanson [Solving Least Squares Problems, Prentice-Hall, 1974] on multicore architectures. This algorithm requires the solution of an unconstrained least squares problem in every step of the iteration for a matrix composed of the passive columns of the original […]

CUDA

Nov, 28

Parallel Pseudo-Random Number Generation

This is a preliminary report on parallel pseudo-random number generation. It was written under tight time constraints, so makes no claim to being an exhaustive survey of the field, which is already extensive, and in a state of flux as new computer architectures are introduced.

CUDA

•

OpenCL

Nov, 28

Domain-Specific Optimizations Supporting Real-Time Image Compression

The work focuses on utilization of massivelly parallel processors for image compression acceleration. The text of the work studies GPU architecture, common GPU programming frameworks, and domain specific languages providing higher-level programming abstraction. The aim of the PhD thesis is to contribute to the effective software development for massively parallel processors through a domain specific […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Electric polarizability of hadrons with overlap fermions on multi-GPUs

A GPU-based survey for millisecond radio transients using ARTEMIS

On the numerical sensitivity of computer simulations on hybrid and parallel computing systems

Accelerating the Hough Transform with CUDA on Graphics Processing Units

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

Anytime Algorithms for GPU Architectures

A hybrid parallel framework for computational solid mechanics

Molecular Dynamics Simulation Based on Hadoop MapReduce

Computation of the Spatial Impulse Response for Ultrasonic Fields on the Graphics Processing Units (GPU)

Efficient Parallel Nonnegative Least Squares on Multicore Architectures

Parallel Pseudo-Random Number Generation

Domain-Specific Optimizations Supporting Real-Time Image Compression

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)