high performance computing on graphics processing units: hgpu.org

Posts

Nov, 3

A Mutable Hardware Abstraction to Replace Threads

Ever since first digital images appeared, computer scientists all over the world have been trying to computationally estimate their similarity. So far, no solution as good as human brain was found. This paper presents another technique that tackles with this issue, using singular value decomposition – a matrix factorization method which extracts main features of […]

Nov, 3

Parallelization of the Generalized Hough Transform on GPU

Programs developed under the Compute Unified Device Architecture (CUDA) obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among threads and a high value of processor occupancy, i.e. the ratio of active threads, are indispensable. However, in […]

CUDA

Nov, 3

True 4D Image Denoising on the GPU

The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose (noisy) computed tomography (CT) data. While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, […]

CUDA

Nov, 3

Performance Portability of a GPU Enabled Factorization with the DAGuE Framework

Performance portability is a major challenge faced today by developers on heterogeneous high performance computers, consisting of an interconnect, memory with nonuniform access, many-cores and accelerators like GPUs. Recent studies have successfully demonstrated that dense linear algebra operations can be efficiently handled by runtime systems using a DAG representation. In this work, we present the […]

Nov, 3

GrIP: A Framework for Experiments with Screen Space Algorithms

We present the extensible post processing framework GrIP, usable for experimenting with screen space-based graphics algorithms in arbitrary applications. The user can easily implement new ideas as well as add known operators as components to existing ones. Through a well-defined interface, operators are realized as plugins that are loaded at run-time. Operators can be combined […]

CUDA

Nov, 3

Applicability of GPU Computing for Efficient Merge in In-Memory Databases

Column oriented in-memory databases typically use dictionary compression to reduce the overall storage space and allow fast lookup and comparison. However, there is a high performance cost for updates since the dictionary, used for compression, has to be recreated each time records are created, updated or deleted. This has to be taken into account for […]

CUDA

Nov, 3

Accelerating Multi-Scale Flows for LDDKBM Diffeomorphic Registration

Registrations in medical imaging and computational anatomy can be obtained using the Large Deformation Diffeomorphic Kernel Bundle Mapping (LDDKBM) framework. This provides a registration algorithm with a solid mathematical foundation while incorporating regularization of deformation at multiple scales. Because the variational formulation of LDDKBM implies a heavy computational burden in the search for optimal registrations, […]

CUDA

•

OpenCL

Nov, 3

Topology Optimization with Unstructured Meshes on Graphics Processing Units (GPUs)

The present work investigates the feasibility of nite element methods and topology optimization for unstructured meshes in massively parallel computer architectures, more speci cally on Graphics Processing Units or GPUs. Algorithms for every step in these methods are proposed and benchmarked with varied results. The ultimate goal of this work is to speed up the […]

CUDA

Nov, 3

Efficient Quicksort and 2D Convex Hull for CUDA, and MSIMD as a Realistic Model of Massively Parallel Computations

In recent years CUDA has become a major architecture for multithreaded computations. Unfortunately, its potential is not yet being commonly utilized because many fundamental problems have no practical solutions for such machines. Our goal is to establish a hybrid multicore/parallel theoretical model that represents well architectures like NVIDIA CUDA, Intel Larabee, and OpenCL as well […]

CUDA

Nov, 3

Colour flux-tubes in static Pentaquark and Tetraquark systems

The colour fields created by the static tetraquark and pentaquark systems are computed in quenched SU(3) lattice QCD, with gauge invariant lattice operators, in a 24^3 x 48 lattice at beta=6.2. We generate our quenched configurations with GPUs, and detail the respective benchmanrks in different SU(N) groups. While at smaller distances the coulomb potential is […]

Nov, 2

A Comparison of Many-threaded Differential Evolution and Genetic Algorithms on CUDA

The recent time has seen the rise of consumer grade massively parallel environments. Powerful GPUs and multi-core processors became widely available and easy to use programming APIs such as nVidia CUDA, OpenCL, and DirectCompute simplify the development of applications that can utilize them. In this environment, the nature inspired metaheuristics can be in suitable cases […]

CUDA

Nov, 2

Multi-view Rendering Approach for Cloud-based Gaming Services

In order to render hundreds or thousands of views for multi-user games on a cloud-based gaming at interactive rates, we need a solution which is both scalable and efficient.We present a new cloud-based gaming service system which supports multiple viewpoint rendering for visualizing a 3D game scene dataset at the same time for the multi-user […]

high performance computing on graphics processing units: hgpu.org

Posts

A Mutable Hardware Abstraction to Replace Threads

Parallelization of the Generalized Hough Transform on GPU

True 4D Image Denoising on the GPU

Performance Portability of a GPU Enabled Factorization with the DAGuE Framework

GrIP: A Framework for Experiments with Screen Space Algorithms

Applicability of GPU Computing for Efficient Merge in In-Memory Databases

Accelerating Multi-Scale Flows for LDDKBM Diffeomorphic Registration

Topology Optimization with Unstructured Meshes on Graphics Processing Units (GPUs)

Efficient Quicksort and 2D Convex Hull for CUDA, and MSIMD as a Realistic Model of Massively Parallel Computations

Colour flux-tubes in static Pentaquark and Tetraquark systems

A Comparison of Many-threaded Differential Evolution and Genetic Algorithms on CUDA

Multi-view Rendering Approach for Cloud-based Gaming Services

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)