high performance computing on graphics processing units: hgpu.org

Posts

Oct, 19

Efficient fine grained shared buffer management for multiple OpenCL devices

OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing […]

OpenCL

Oct, 19

Construction of a Virtual Cluster by Integrating PCI Pass-Through for GPU and InfiniBand Virtualization in Cloud

At present, NVIDIA’s CUDA can support programmers to develop highly parallel applications. It utilizes some parallel construct concepts: hierarchical thread blocks, shared memory, and barrier synchronization. CUDA development programs can be used to achieve amazing acceleration. The graphics processor is able to play an important role in cloud computing in a cluster environment, because it […]

CUDA

Oct, 19

Early Experiences in Running Many-Task Computing Workloads on GPGPUs

This work aims to enable Swift to efficiently use accelerators (such as NVIDIA GPUs) to further accelerate a wide range of applications. This work presents preliminary results in the costs associated with managing and launching concurrent kernels on NVIDIA Kepler GPUs. We expect our results to be applicable to several XSEDE resources, such as Forge, […]

CUDA

Oct, 18

VDBSCAN+: Performance Optimization Based on GPU Parallelism

Spatial data mining techniques enable the knowledge extraction from spatial databases. However, the high computational cost and the complexity of algorithms are some of the main problems in this area. This work proposes a new algorithm referred to as VDBSCAN+, which derived from the algorithm VDBSCAN (Varied Density Based Spatial Clustering of Applications with Noise) […]

CUDA

Oct, 18

Progressive Photon Mapping on GPUs

Physically based rendering using ray tracing is capable of producing realistic images of much higher quality than other methods. However, the computational costs associated with exploring all paths of light are huge; it can take hours to render high quality images of complex scenes. Using graphics processing units has emerged as a popular way to […]

CUDA

Oct, 18

OpenACC-based Snow Simulation

In recent years, the GPU platform has risen in popularity in high performance computing due to its cost effectiveness and high computing power offered through its many parallel cores. The GPUs computing power can be harnessed using the low-level GPGPU programming APIs CUDA and OpenCL. While both CUDA and OpenCL gives the programmer fine-grained control […]

CUDA

•

OpenCL

Oct, 18

Heterogeneous Clustering with Homogeneous Code: Accelerate MPI Applications Without Code Surgery Using Intel Xeon Phi Coprocessors

This paper reports on our experience with a heterogeneous cluster execution environment, in which a distributed parallel application utilizes two types of compute devices: those employing general-purpose processors, and those based on computing accelerators known as Intel Xeon Phi coprocessors. Unlike general-purpose graphics processing units (GPGPUs), Intel Xeon Phi coprocessors are able to execute native […]

Oct, 18

Towards Code Generation from Design Models for Embedded Systems on Heterogeneous CPU-GPU Platforms

The complexity of modern embedded systems is ever increasing and the selection of target platforms is shifting from homogeneous to more heterogeneous and powerful configurations. In our previous works, we exploited the power of model-driven techniques to deal with such complexity by enabling the automatic generation of full-fledged functional code from UML models enriched with […]

CUDA

Oct, 18

Heterogeneous FTDT for Seismic Processing

In the early days of computing, scientific calculations were done by specialized hardware. More recently, increasingly powerful CPUs took over and have been dominant for a long time. Now though, scientific computation is not only for the general CPU environment anymore. GPUs are specialized processors with their own memory hierarchy requiring more effort to program, […]

CUDA

Oct, 18

Efficient SVM Training Using Parallel Primal-Dual Interior Point Method on GPU

The training of SVM can be viewed as a Convex Quadratic Programming (CQP) problem which becomes difficult to be solved when dealing with the large scale data sets. Traditional methods such as Sequential Minimal Optimization (SMO) for SVM training is used to solve a sequence of small scale sub-problems, which costs a large amount of […]

CUDA

Oct, 18

Using CUDA GPU to Accelerate the Ant Colony Optimization Algorithm

Graph Processing Units (GPUs) have recently evolved into a super multi-core and a fully programmable architecture. In the CUDA programming model, the programmers can simply implement parallelism ideas of a task on GPUs. The purpose of this paper is to accelerate Ant Colony Optimization (ACO) for Traveling Salesman Problems (TSP) with GPUs. In this paper, […]

CUDA

Oct, 18

Dynamic Load Balancing in GPU-Based Systems – Early Experiments

The dynamic load-balancing framework in Charm++/AMPI, developed at the University of Illinois, is based on using processor virtualization to allow thread migration across processors. This framework has been successfully applied to many scientific applications in the past, such as BRAMS, NAMD, ChaNGa, and others. Most of these applications use only CPUs to perform their operations. […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Efficient fine grained shared buffer management for multiple OpenCL devices

Construction of a Virtual Cluster by Integrating PCI Pass-Through for GPU and InfiniBand Virtualization in Cloud

Early Experiences in Running Many-Task Computing Workloads on GPGPUs

VDBSCAN+: Performance Optimization Based on GPU Parallelism

Progressive Photon Mapping on GPUs

OpenACC-based Snow Simulation

Heterogeneous Clustering with Homogeneous Code: Accelerate MPI Applications Without Code Surgery Using Intel Xeon Phi Coprocessors

Towards Code Generation from Design Models for Embedded Systems on Heterogeneous CPU-GPU Platforms

Heterogeneous FTDT for Seismic Processing

Efficient SVM Training Using Parallel Primal-Dual Interior Point Method on GPU

Using CUDA GPU to Accelerate the Ant Colony Optimization Algorithm

Dynamic Load Balancing in GPU-Based Systems – Early Experiments

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)