high performance computing on graphics processing units: hgpu.org

Posts

Apr, 15

G-NET: Effective GPU Sharing in NFV Systems

Network Function Virtualization (NFV) virtualizes software network functions to offer flexibility in their design, management and deployment. Although GPUs have demonstrated their power in significantly accelerating network functions, they have not been effectively integrated into NFV systems for the following reasons. First, GPUs are severely underutilized in NFV systems with existing GPU virtualization approaches. Second, […]

CUDA

Apr, 15

Towards a Unified CPU-GPU code hybridization: A GPU Based Optimization Strategy Efficient on Other Modern Architectures

In this paper, we suggest a different methodology to shorten the code optimization development time while getting a unified code with good performance on different targeted devices. In the scope of this study, experiments are illustrated on a Discontinuous Galerkin code applied to Computational Fluid Dynamics. Tests are performed on CPUs, KNL Xeon-Phi and GPUs […]

CUDA

Apr, 7

Evaluating Performance Tradeoffs on the Radeon Open Compute Platform

GPUs have been shown to deliver impressive computing performance, while also providing high energy efficiency, across a wide range of high-performance and embedded system workloads. However, limited support for efficient communication and synchronization between the CPU and the GPU impacts our ability to fully exploit the benefits of heterogeneous systems. Recently, the Heterogeneous System Architecture […]

OpenCL

Apr, 7

A Survey of Techniques for Improving Security of GPUs

Graphics processing unit (GPU), although a powerful performance-booster, also has many security vulnerabilities. Due to these, the GPU can act as a safe-haven for stealthy malware and the weakest "link" in the security "chain". In this paper, we present a survey of techniques for analyzing and improving GPU security. We classify the works on key […]

Apr, 7

High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases […]

Apr, 7

Verification of Program Parallelization

This thesis presents techniques to improve reliability and prove functional correctness of parallel programs. These requirements are especially crucial in critical systems where system failures endanger human lives, cause substantial economic damages or security breaches. Today’s critical systems are expected to deliver more and more complex and computationally intensive functions. In many cases these cannot […]

OpenCL

Apr, 7

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading high-performance computing architectures — Intel’s Knights Landing processor […]

CUDA

Mar, 31

HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides a mechanism and management of inter-address space communication, and OpenCL provides a way to manage computation and communication within a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the […]

OpenCL

Mar, 31

A Comparison between GPU-based Volume Ray Casting Implementations: Fragment Shader, Compute Shader, OpenCL, and CUDA

Volume rendering is an important area of study in computer graphics, due to its application in areas such as medicine, physic simulations, oil and gas industries, and others. The main used method nowadays for volume rendering is ray casting. Nevertheless, there are a variety of parallel APIs that can be used to implement it. Thus, […]

CUDA

•

OpenCL

•

OpenGL

Mar, 31

Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs

Deep Convolutional Neural Networks have become a Swiss knife in solving critical artificial intelligence tasks. However, deploying deep CNN models for latency-critical tasks remains to be challenging because of the complex nature of CNNs. Recently, FPGA has become a favorable device to accelerate deep CNNs thanks to its high parallel processing capability and energy efficiency. […]

Mar, 31

Python Non-Uniform Fast Fourier Transform (PyNUFFT): An Accelerated Non-Cartesian MRI Package on a Heterogeneous Platform (CPU/GPU)

A Python non-uniform fast Fourier transform (PyNUFFT) package has been developed to accelerate multidimensional non-Cartesian image reconstruction on heterogeneous platforms. Since scientific computing with Python encompasses a mature and integrated environment, the time efficiency of the NUFFT algorithm has been a major obstacle to real-time non-Cartesian image reconstruction with Python. The current PyNUFFT software enables […]

CUDA

•

OpenCL

Mar, 31

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

G-NET: Effective GPU Sharing in NFV Systems

Towards a Unified CPU-GPU code hybridization: A GPU Based Optimization Strategy Efficient on Other Modern Architectures

Evaluating Performance Tradeoffs on the Radeon Open Compute Platform

A Survey of Techniques for Improving Security of GPUs

High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

Verification of Program Parallelization

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

A Comparison between GPU-based Volume Ray Casting Implementations: Fragment Shader, Compute Shader, OpenCL, and CUDA

Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs

Python Non-Uniform Fast Fourier Transform (PyNUFFT): An Accelerated Non-Cartesian MRI Package on a Heterogeneous Platform (CPU/GPU)

Design Principles for Sparse Matrix Multiplication on the GPU

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)