high performance computing on graphics processing units: hgpu.org

Posts

May, 20

An Efficient, Automatic Approach to High Performance Heterogeneous Computing

Users of heterogeneous computing systems face two problems: firstly, understanding the trade-off relationship between the observable characteristics of their applications, such as latency and quality of the result, and secondly, how to exploit knowledge of these characteristics to allocate work to distributed resources efficiently. A domain specific approach addresses both of these problems. By considering […]

OpenCL

May, 19

CHO: Towards a Benchmark Suite for OpenCL FPGA Accelerators

Programming FPGAs with OpenCL-based high-level synthesis frameworks is gaining attention with a number of commercial and research frameworks announced. However, there are no benchmarks for evaluating these frameworks. To this end, we present CHO benchmark suite an extension of CHStone, a commonly used C-based high-level synthesis benchmark suite, for OpenCL. We characterise CHO at various […]

OpenCL

May, 19

Optimizing Full Correlation Matrix Analysis of fMRI Data on Intel Xeon Phi Coprocessors

Full correlation matrix analysis (FCMA) is an unbiased approach for exhaustively studying interactions among brain regions in functional magnetic resonance imaging (fMRI) data from human participants. In order to answer neuro-scientific questions efficiently, we are developing a closedloop analysis system with FCMA on a cluster of nodes with Intel Xeon Phi coprocessors. We have proposed […]

May, 19

Use of modern GPUs in Design Optimization

Graphics Processing Units (GPUs) are a promising alternative hardware to Central Processing Units (CPU) for accelerating applications with a high computational power demand. In many fields researchers are taking advantage of the high computational power present in GPUs to speed up their applications. These applications span from data mining to machine learning and life sciences. […]

May, 19

A GPU-accelerated Navier-Stokes Solver for Steady Turbomachinery Simulations

Any tiny improvement of modern turbomachinery components require nowadays a large amount of design evaluations. Every evaluation runs time consuming simulations. Reducing the computational cost of the simulations allows to run more evaluations, thus reaching a higher design improvement. In this work, an Nvidia Graphics Processing Unit (GPU) of Kepler generation is used to accelerate […]

May, 19

An Interrupt-Driven Work-Sharing For-Loop Scheduler

In this paper we present a parallel for-loop scheduler which is based on work-stealing principles but runs under a completely cooperative scheme. POSIX signals are used by idle threads to interrupt left-behind workers, which in turn decide what portion of their workload can be given to the requester. We call this scheme Interrupt-Driven Work-Sharing (IDWS). […]

May, 18

A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems

As the number of cores on a chip increase and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents as a promising approach to increase effective memory system capacity and also provide performance and energy advantages. […]

May, 18

Workshop on Heterogeneous and Unconventional Cluster Architectures and Applications (HUCAA2015), 2015

====================================================================== CALL FOR PAPERS 4th International Workshop on Heterogeneous and Unconventional Cluster Architectures and Applications (HUCAA 2015) http://www.hucaa-workshop.org/hucaa2015 Sept. 8-11, 2015 – Chicago, IL, US In conjunction with IEEE CLUSTER 2015 IEEE International Conference on Cluster Computing ====================================================================== ABOUT THE WORKSHOP The workshop on Heterogeneous and Unconventional Cluster Architectures and Applications gears to gather recent […]

May, 16

A Fast and Rigorously Parallel Surface Voxelization Technique for GPU-Accelerated CFD Simulations

This paper presents a fast surface voxelization technique for the mapping of tessellated triangular surface meshes to uniform and structured grids that provide a basis for CFD simulations with the lattice Boltzmann method (LBM). The core algorithm is optimized for massively parallel execution on graphics processing units (GPUs) and is based on a unique dissection […]

CUDA

May, 16

Multi-GPU Support on Single Node Using Directive-Based Programming Model

Existing studies show that using single GPU can lead to obtaining significant performance gains. We should be able to achieve further performance speedup if we use more than one GPU. Heterogeneous processors consisting of multiple CPUs and GPUs offer immense potential and are often considered as a leading candidate for porting complex scientific applications. Unfortunately […]

May, 16

Efficient Resource Scheduling for Big Data Processing on Accelerator-based Heterogeneous Systems

The involvement of accelerators is becoming widespread in the field of heterogeneous processing, performing computation tasks through a wide range of applications. In this paper, we examine the heterogeneity in modern computing systems, particularly, how to achieve a good level of resource utilization and fairness, when multiple tasks with different load and computation ratios are […]

OpenCL

May, 16

Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core – MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

An Efficient, Automatic Approach to High Performance Heterogeneous Computing

CHO: Towards a Benchmark Suite for OpenCL FPGA Accelerators

Optimizing Full Correlation Matrix Analysis of fMRI Data on Intel Xeon Phi Coprocessors

Use of modern GPUs in Design Optimization

A GPU-accelerated Navier-Stokes Solver for Steady Turbomachinery Simulations

An Interrupt-Driven Work-Sharing For-Loop Scheduler

A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems

Workshop on Heterogeneous and Unconventional Cluster Architectures and Applications (HUCAA2015), 2015

A Fast and Rigorously Parallel Surface Voxelization Technique for GPU-Accelerated CFD Simulations

Multi-GPU Support on Single Node Using Directive-Based Programming Model

Efficient Resource Scheduling for Big Data Processing on Accelerator-based Heterogeneous Systems

Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)