high performance computing on graphics processing units: hgpu.org

Posts

Aug, 19

Kernel Tuner: A search-optimizing GPU code auto-tuner

A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than other kernel configurations. To obtain highly-efficient kernels it is often required to search vast and discontinuous search spaces that consist of all possible combinations […]

CUDA

•

OpenCL

Aug, 5

GPU schedulers: how fair is fair enough?

Blocking synchronisation idioms, e.g. mutexes and barriers, play an important role in concurrent programming. However, systems with semi-fair schedulers, e.g. graphics processing units (GPUs), are becoming increasingly common. Such schedulers provide varying degrees of fairness, guaranteeing enough to allow some, but not all, blocking idioms. While a number of applications that use blocking idioms do […]

OpenCL

Jul, 1

Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs

Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity and difficulties in optimizing performance. In this paper, we present a […]

OpenCL

Jul, 1

Compiler Fuzzing through Deep Learning

Random program generation – fuzzing – is an effective technique for discovering bugs in compilers but successful fuzzers require extensive development effort for every language supported by the compiler, and often leave parts of the language space untested. We introduce DeepSmith, a novel machine learning approach to accelerating compiler validation through the inference of generative […]

OpenCL

Jun, 24

Synthesis of GPU Programs from High-Level Models

Modern graphics processing units (GPUs) provide high-performance general purpose computation abilities. They have massive parallel architectures that are suitable for executing parallel algorithms and operations. They are also throughput-oriented devices that are optimized to achieve high throughput for stream processing. Designing efficient GPU programs is a notoriously difficult task. The ForSyDe methodology is suitable to […]

OpenCL

Jun, 24

Strategies for the Heterogeneous Execution of Large-Scale Simulations on Hybrid Supercomputers

Massively-parallel devices of various architectures are being adopted by the newest supercomputers to overcome the actual power constraint in the context of the exascale challenge. This progress leads to an increasing hybridisation of HPC systems and makes the design of computing applications a rather complex problem. Therefore, the software efficiency and portability are of crucial […]

OpenCL

Jun, 24

DeepSmith: Compiler Fuzzing through Deep Learning

OpenCL

Jun, 13

Aspect-Driven Mixed-Precision Tuning Targeting GPUs

Writing mixed-precision kernels allows to achieve higher throughput together with outputs whose precision remain within given limits. The recent introduction of native half-precision arithmetic capabilities in several GPUs, such as NVIDIA P100 and AMD Vega 10, contributes to make precision-tuning even more relevant as of late. However, it is not trivial to manually find which […]

OpenCL

Jun, 5

The Third International Workshop on GPU Computing and AI (GCA), 2018

==================================================== The Third International Workshop on GPU Computing and AI (GCA) http://is-candar.org/GCA18/ to be held in conjunction with The Sixth International Symposium on Computing and Networking (CANDAR’18), Hida Takayama, Japan, November 27-30, 2018 http://is-candar.org/ ==================================================== [Introduction] Built for massive parallelism, General Purpose computing on Graphic Processing Unit (GPGPU) has superseded high-performance CPU in several important […]

Jun, 2

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS […]

OpenCL

May, 26

Transformations of High-Level Synthesis Codes for High-Performance Computing

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to […]

OpenCL

May, 12

EngineCL: Usability and Performance in Heterogeneous Computing

Heterogeneous systems composed by a CPU and a set of hardware accelerators have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Kernel Tuner: A search-optimizing GPU code auto-tuner

GPU schedulers: how fair is fair enough?

Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs

Compiler Fuzzing through Deep Learning

Synthesis of GPU Programs from High-Level Models

Strategies for the Heterogeneous Execution of Large-Scale Simulations on Hybrid Supercomputers

DeepSmith: Compiler Fuzzing through Deep Learning

Aspect-Driven Mixed-Precision Tuning Targeting GPUs

The Third International Workshop on GPU Computing and AI (GCA), 2018

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Transformations of High-Level Synthesis Codes for High-Performance Computing

EngineCL: Usability and Performance in Heterogeneous Computing

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)