high performance computing on graphics processing units: hgpu.org

Posts

Mar, 24

2014 5th International Conference on Software and Computing Technology, ICSCT 2014

Submission Deadline: 2014-06-10 Publication: All accepted papers that are registered and presented of ICSCT 2014 will be published in WIT Transactions on Information and Communication Technologies (ISSN: 1743-3517), which will be indexed by EI Compendex, Scopus and ISI. Call for Paper: AI and Knowledge based software engineering Artificial Intelligence Aspect-orientation and feature interaction Business Process […]

Mar, 24

The orthorectified technology for UAV aerial remote sensing image based on the Programmable GPU

The increasingly heterogeneous modern hardware landscape is forcing database vendors to rethink basic design decisions: With more and more architectures to support, the traditional approach of building on hand-tuned operators might simply become too cost- and labor-intensive. With this problem in mind, we introduced the notion of a hardware-oblivious database engine, which avoids device-specific optimizations […]

CUDA

Mar, 24

Demonstrating Self-Learning Algorithm Adaptivity in a Hardware-Oblivious Database Engine

OpenCL

Mar, 24

High-Performance Image Synthesis for Radio Interferometry

A radio interferometer indirectly measures the intensity distribution of the sky over the celestial sphere. Since measurements are made over an irregularly sampled Fourier plane, synthesising an intensity image from interferometric measurements requires substantial processing. Furthermore there are distortions that have to be corrected. In this thesis, a new high-performance image synthesis tool (imaging tool) […]

CUDA

Mar, 23

An unsupervised parallel genetic cluster algorithm for graphics processing units

During times of stock market turbulence, monitoring the intraday clustering behaviour of financial instruments allows one to better understand market characteristics and systemic risks. While genetic algorithms provide a versatile methodology for identifying such clusters, serial implementations are computationally intensive and can take a long time to converge to the global optimum. We implement a […]

CUDA

Mar, 23

Computer Vision Accelerators for Mobile Systems based on OpenCL GPGPU Co-Processing

In this paper, we present an OpenCL-based heterogeneous implementation of a computer vision algorithm — image inpainting-based object removal algorithm — on mobile devices. To take advantage of the computation power of the mobile processor, the algorithm workflow is partitioned between the CPU and the GPU based on the profiling results on mobile devices, so […]

OpenCL

Mar, 23

Fast GPGPU-Based Elliptic Curve Scalar Multiplication

This paper presents a fast implementation to compute the scalar multiplication of elliptic curve points based on a General-Purpose computing on Graphics Processing Units (GPGPU) approach. A GPU implementation using Dan Bernstein’s Curve25519, an elliptic curve over a 255-bit prime field complying with the new 128-bit security level, computes the scalar multiplication in less than […]

OpenCL

Mar, 22

Accelerating Low-Fidelity Aerodynamic Codes

Low-fidelity aerodynamic codes, including panel, lifting line and vortex lattice methods, are used in the preliminary aerodynamic studies in the early stages of the aircraft design and constitute an important and compute-intensive part of aircraft design process. This preliminary design phase is usually very time consuming as it involves parametric studies counting tens of Excellent […]

CUDA

Mar, 22

CUDA Implementation of a Lattice Boltzmann Method and Code Optimization

We study fluid flow in a 2D lid driven cavity for large Reynolds numbers using multirelaxation time – Lattice Boltzmann Method(LBM). LBM is an alternative to conventional CFD methods that solve Navier-Stokes equations to simulate incompressible fluid dynamics. In LBM, one solves the linearized Boltzmann equation on a discrete lattice to study spatio-temporal evolution of […]

CUDA

Mar, 22

Applying GPU Dynamic Parallelism to High-Performance Normalization of Gene Expressions

High-density oligonucleotide microarrays allow several millions of genetic markers in a single experiment to be observed. Current bioinformatics tools for gene expression quantile data normalization are unable to process such huge data sets. In parallel with this reality, the huge volume of molecular data produced by current high-throughput technologies in modern molecular biology has increased […]

CUDA

Mar, 21

GPU Accelerated Process Planning For CNC-Machined Parts:Industrial Components to Bone Implants

For manufacturing a part using conventional 3-Axis CNC machining process, one must determine a set of machining orientations. Generally this process planning task is carried out manually by the machinist, considering decision parameters such as part visibility, machinability, machining depths, tool geometry, etc. In this work, we modelled this as a Linear optimization problem; the […]

CUDA

Mar, 21

Concurrent learning of a Probabilistic Graphical Model on the GPU

We introduce an algorithm for determining optimal transition paths between given configurations. The solution is obtained by solving variational equations for Freidlin–Wentzell action functionals. One of the applications of the method presented is a system controlling motion and redeployment between unit’s formations. The efficiency of the algorithm has been evaluated in a simple sandbox environment […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

2014 5th International Conference on Software and Computing Technology, ICSCT 2014

The orthorectified technology for UAV aerial remote sensing image based on the Programmable GPU

Demonstrating Self-Learning Algorithm Adaptivity in a Hardware-Oblivious Database Engine

High-Performance Image Synthesis for Radio Interferometry

An unsupervised parallel genetic cluster algorithm for graphics processing units

Computer Vision Accelerators for Mobile Systems based on OpenCL GPGPU Co-Processing

Fast GPGPU-Based Elliptic Curve Scalar Multiplication

Accelerating Low-Fidelity Aerodynamic Codes

CUDA Implementation of a Lattice Boltzmann Method and Code Optimization

Applying GPU Dynamic Parallelism to High-Performance Normalization of Gene Expressions

GPU Accelerated Process Planning For CNC-Machined Parts:Industrial Components to Bone Implants

Concurrent learning of a Probabilistic Graphical Model on the GPU

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)