high performance computing on graphics processing units: hgpu.org

Posts

Aug, 24

Viability of Feature Detection on Sony Xperia Z3 using OpenCL

CONTEXT: Embedded platforms GPUs are reaching a level of performance comparable to desktop hardware. Therefore it becomes interesting to apply Computer Vision techniques to modern smartphones.The platform holds different challenges, as energy use and heat generation can be an issue depending on load distribution on the device. OBJECTIVES: We evaluate the viability of a feature […]

OpenCL

Aug, 24

Performance Evaluations of Document-Oriented Databases using GPU and Cache Structure

Document-oriented databases are popular databases, in which users can store their documents in a schema-less manner and perform search queries for them. They have been widely used for web applications that process a large collection of documents because of their high scalability and rich functions. One of major functions of documentoriented databases is a string […]

CUDA

Aug, 24

Scheduling for new computing platforms with GPUs

More and more computers use hybrid architectures combining multi-core processors (CPUs) and hardware accelerators like GPUs (Graphics Processing Units). These hybrid parallel platforms require new scheduling strategies. This work is devoted to a characterization of this new type of scheduling problems. The most studied objective in this work is the minimization of the makespan, which […]

OpenCL

Aug, 24

Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators

Since the beginning of the 2000s, the raw performance of processors stopped its exponential increase. The modern graphic processing units (GPUs) have been designed as array of hundreds or thousands of compute units. The GPUs’ compute capacity quickly leads them to be diverted from their original target to be used as accelerators for general purpose […]

CUDA

•

OpenCL

Aug, 24

Semi-Global Filtering of Airborne LiDAR Data for Fast Extraction of Digital Terrain Models

Automatic extraction of ground points, called filtering, is an essential step in producing Digital Terrain Models from airborne LiDAR data. Scene complexity and computational performance are two major problems that should be addressed in filtering, especially when processing large point cloud data with diverse scenes. This paper proposes a fast and intelligent algorithm called Semi-Global […]

CUDA

Aug, 21

A CPU and GPU Heterogeneous Processing of Multimedia Data by using OpenCL

In recent times, it has become possible to parallelize many multimedia applications using multicore platforms such as CPUs and GPUs. In this paper, we propose a parallel processing approach for a multimedia application by using both the CPU and GPU. Instead of distributing the parallelizable workload to either the CPU or GPU, we distribute the […]

OpenCL

Aug, 21

Locality Aware Work-Stealing Based Scheduling in Hybrid CPU-GPU

We study work-stealing based scheduling on a cluster of nodes with CPUs and GPUs. In particular, we evaluate locality aware scheduling in the context of distributed shared memory style programming, where the user is oblivious to data placement. Our runtime maintains a distributed map of data resident on various nodes and uses it to estimate […]

CUDA

Aug, 21

A Parallelizing Matlab Compiler Framework and Run time for Heterogeneous Systems

Compute-intensive applications incorporate ever increasing data processing requirements on hardware systems. Many of these applications have only recently become feasible thanks to the increasing computing power of modern processors. The Matlab language is uniquely situated to support the description of these compute-intensive scientific applications, and consequently has been continuously improved to provide increasing computational support […]

Aug, 21

Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420

Computer vision algorithms are becoming increasingly important in mobile, embedded, and wearable devices and applications. These compute-intensive workloads are challenging to implement with good performance and power-efficiency. In many applications, implementing critical portions of computer vision workloads on a general-purpose graphics processing unit (GPU) is an attractive solution. Qualcomm enables programming of the Adreno GPU […]

OpenCL

Aug, 21

GPU computing with OpenCL to model 2D elastic wave propagation: exploring memory usage

Graphics processing units (GPUs) have become increasingly powerful in recent years. Programs exploring the advantages of this architecture could achieve large performance gains and this is the aim of new initiatives in high performance computing. The objective of this work is to develop an efficient tool to model 2D elastic wave propagation on parallel computing […]

OpenCL

Aug, 18

OpenCL-Based Design of an FPGA Accelerator for Phase-Based Correspondence Matching

This paper proposes a Field Programmable Gate Array (FPGA) implementation of the stereo correspondence matching using Phase-Only Correlation (POC). The use of high-accuracy stereo correspondence matching based on POC makes it possible to measure accurate 3D shape of an object using stereo vision. The drawback of the POC-based approach is its high computational cost. To […]

OpenCL

Aug, 18

Parallelizing a high-order WENO scheme for complicated flow structures on GPU and MIC

As a conservative, high-order accurate, shock-capturing method, weighted essentially non-oscillatory (WENO) scheme have been widely used to effectively resolve complicated flow structures in computational fluid dynamics (CFD) simulations. However, using a high-order WENO scheme can be highly time-consuming, which greatly limits the CFD application’s performance efficiency. In this paper, we present various parallel strategies base […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

Viability of Feature Detection on Sony Xperia Z3 using OpenCL

Performance Evaluations of Document-Oriented Databases using GPU and Cache Structure

Scheduling for new computing platforms with GPUs

Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators

Semi-Global Filtering of Airborne LiDAR Data for Fast Extraction of Digital Terrain Models

A CPU and GPU Heterogeneous Processing of Multimedia Data by using OpenCL

Locality Aware Work-Stealing Based Scheduling in Hybrid CPU-GPU

A Parallelizing Matlab Compiler Framework and Run time for Heterogeneous Systems

Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420

GPU computing with OpenCL to model 2D elastic wave propagation: exploring memory usage

OpenCL-Based Design of an FPGA Accelerator for Phase-Based Correspondence Matching

Parallelizing a high-order WENO scheme for complicated flow structures on GPU and MIC

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)