high performance computing on graphics processing units: hgpu.org

Posts

Sep, 28

From OpenCL to Gates: the FFT

The FFT plays a fundamental role in OFDM programmable digital baseband communication systems under the SDR context. The core nature of this algorithm marks it as a primary target for acceleration. Since long frame lengths of the FFT are desirable in order to achieve higher bitrates, the computational complexity becomes even more significant. In this […]

OpenCL

Sep, 28

Performance Improvement of Optical Algorithms on Multicore Platforms

ASML is one of the world’s largest suppliers of lithography systems for the semiconductor industry. ASML designs and develops machines that are used to print circuits on silicon wafers, to produce IC chips. These circuits have to be printed with accuracy of up to 2nm. For this purpose, the machines incorporate several measurement systems. The […]

CUDA

Sep, 28

Efficient Computation of k-Nearest Neighbour Graphs for Large High-Dimensional Data Sets on GPU Clusters

This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG) construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs) and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU). The method is applicable to […]

CUDA

Sep, 27

Clustering on GPU – A Brief Survey

Clustering, as a process of partitioning data elements with similar properties, is an essential task in many application areas. Due to technological advances, the amount as well as the dimensionality of data sets in general is steadily growing. Graphics Processing Units in today’s desktops can be thought of as a high performance parallel processor. As […]

CUDA

Sep, 27

CUD@ASP: Experimenting with GPUs in ASP solving

This paper illustrates the design and implementation of a prototype ASP solver that is capable of exploiting the parallelism offered by general purpose graphical processing units (GPGPUs). The solver is based on a basic conflict-driven search algorithm. The core of the solving process develops on the CPU, while most of the activities, such as literal […]

CUDA

Sep, 27

OpenCL Parallel Programming Development Cookbook

Welcome to the OpenCL Parallel Programming Development Cookbook! Whew, that was more than a mouthful. This book was written by a developer, that’s me, and for a developer, hopefully that’s you. This book will look familiar to some and distinct to others. It is a result of my experience with OpenCL, but more importantly in […]

OpenCL

Sep, 27

GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs

Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel […]

CUDA

Sep, 27

Java with Auto-Parallelization on Graphics Coprocessing Architecture

GPU-based many-core accelerators have gained a footing in supercomputing. Their widespread adoption yet hinges on better parallelization and load scheduling techniques to utilize the hybrid system of CPU and GPU cores easily and efficiently. This paper introduces a new userfriendly compiler framework and runtime system, dubbed Japonica, to help Java applications harness the full power […]

CUDA

Sep, 27

Multi-Scale, Multi-Level, Heterogeneous Features Extraction and Classification of Volumetric Medical Images

This paper articulates a novel method for the heterogeneous feature extraction and classification directly on volumetric images, which covers multi-scale point feature, multi-scale surface feature, multi-level curve feature, and blob feature. To tackle the challenge of complex volumetric inner structure and diverse feature forms, our technical solution hinges upon the integrated approach of locally-defined diffusion […]

CUDA

Sep, 27

Trellis: Portability Across Architectures with a High-level Framework

The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level frameworks with architecture-specific optimizations, […]

CUDA

Sep, 25

OpenCL Task Partitioning in the Presence of GPU Contention

Heterogeneous multi- and many-core systems are increasingly prevalent in the desktop and mobile domains. On these systems it is common for programs to compete with co-running programs for resources. While multi-task scheduling for CPUs is a well-studied area, how to partitioning and map computing tasks onto the hetergeneous system in the presence of GPU contention […]

OpenCL

Sep, 25

Large Graphs on multi-GPUs

We studied the problem of developing an efficient BFS algorithm to explore large graphs having billions of nodes and edges. The size of the problem requires a parallel computing architecture. We proposed a new algorithm that performs a distributed BFS and the corresponding implementation on multiGPUs clusters. As far as we know, this is the […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

From OpenCL to Gates: the FFT

Performance Improvement of Optical Algorithms on Multicore Platforms

Efficient Computation of k-Nearest Neighbour Graphs for Large High-Dimensional Data Sets on GPU Clusters

Clustering on GPU – A Brief Survey

CUD@ASP: Experimenting with GPUs in ASP solving

OpenCL Parallel Programming Development Cookbook

GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs

Java with Auto-Parallelization on Graphics Coprocessing Architecture

Multi-Scale, Multi-Level, Heterogeneous Features Extraction and Classification of Volumetric Medical Images

Trellis: Portability Across Architectures with a High-level Framework

OpenCL Task Partitioning in the Presence of GPU Contention

Large Graphs on multi-GPUs

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)