high performance computing on graphics processing units: hgpu.org

Posts

Nov, 17

A Computing Kernel for Network Binarization on PyTorch

Deep Neural Networks have now achieved state-of-the-art results in a wide range of tasks including image classification, object detection and so on. However, they are both computation consuming and memory intensive, making them difficult to deploy on low-power devices. Network binarization is one of the existing effective techniques for model compression and acceleration, but there […]

CUDA

Nov, 17

Compiler-Driven Performance on Heterogeneous Computing Platforms

Modern parallel programming languages such as OpenMP provide simple, portable programming models that support offloading of computation to various accelerator devices. Coupled with the increasing prevalence of heterogeneous computing platforms and the battle for supremacy in the co-processor space, gives rise to additional challenges placed on compiler/runtime vendors to handle the increasing complexity and diversity […]

CUDA

Nov, 17

word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement

Deep learning natural language processing models often use vector word embeddings, such as word2vec or GloVe, to represent words. A discrete sequence of words can be much more easily integrated with downstream neural layers if it is represented as a sequence of continuous vectors. Also, semantic relationships between words, learned from a text corpus, can […]

CUDA

Nov, 17

Deep Learning Based FPGA-CPU Acceleration

The purpose of this project is to continue exploring new ways of accelerating sequential computer code, and finding out if the machine learning techniques available today are able to help us in this task. The core idea is trying to parallelize during run-time (in a way completely transparent to the programmer) the code that’s being […]

Nov, 17

A Highly Parameterizable Framework for Conditional Restricted Boltzmann Machine Based Workloads Accelerated With FPGAs and OpenCL

Conditional Restricted Boltzmann Machine (CRBM) is a promising candidate for a multidimensional system modeling that can learn a probability distribution over a set of data. It is a specific type of an artificial neural network with one input (visible) and one output (hidden) layer. Recently published works demonstrate that CRBM is a suitable mechanism for […]

OpenCL

Nov, 10

Framework for Parallel Kernels Auto-tuning

The result of this thesis is a framework for auto-tuning of parallel kernels which are written in either OpenCL or CUDA language. The framework includes advanced functionality such as support for composite kernels and online auto-tuning. The thesis describes API and internal structure of the framework and presents several examples of its utilization for kernel […]

CUDA

•

OpenCL

Nov, 10

Study of OpenCL Processing Models for FPGA Devices

In our study, we present the results of the implementation of the SHA-512 algorithm in FPGAs. The distinguished element of our work is that we conducted the work using OpenCL for FPGA, which is a relatively new development method for reconfigurable logic. We examine loop unrolling as an OpenCL performance optimization method and compare the […]

OpenCL

Nov, 10

CL-VIS: Visualization Platform for Understanding and Checking the OpenCL Programs

Due to GPU’s improved hardware performance, many researchers have tried to utilize the GPU for computer vision, image processing, cryptography, and artificial intelligence. As results, the GPU could successfully speed up algorithms from tens to hundreds of times in many cases. However, GPU programming is still known to be difficult because of its different characteristics […]

OpenCL

Nov, 10

KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch Parameters Targeting CUDA Programs

In this paper we present KLARAPTOR (Kernel LAunch parameters RAtional Program estimaTOR), a new tool built on top of the LLVM Pass Framework and NVIDIA CUPTI API to dynamically determine the optimal values of kernel launch parameters of a CUDA program P. To be precise, we describe a novel technique to statically build (at the […]

CUDA

Nov, 10

Accelerating Stochastic Simulations on GPUs Using OpenCL

Since first introduced in 2008 with the 1.0 specification, OpenCL has steadily evolved over the decade to increase its support for heterogeneous parallel systems. In this paper, we accelerate stochastic simulation of biochemical reaction networks on modern GPUs (graphics processing units) by means of the OpenCL programming language. In implementing the OpenCL version of the […]

OpenCL

Nov, 9

8th International Workshop on OpenCL, including SYCLCon, 2019

Join us at the 8th International Workshop on OpenCL, including SYCLcon 2020, for three days of talks, workshops and community networking aimed at furthering the collaboration and knowledge sharing amongst the international community of high-performance computing specialist working with OpenCL, SYCL, SPIR and Vulkan Compute. The event provides a rich mix of hands-on tutorials, technical […]

Nov, 3

JSDoop and TensorFlow.js: Volunteer Distributed Web Browser-Based Neural Network Training

In 2019, around 57% of the population of the world has broadband access to the Internet. Moreover, there are 5.9 billion mobile broadband subscriptions, i.e., 1.3 subscriptions per user. So there is an enormous interconnected computational power held by users all around the world. Also, it is estimated that Internet users spend more than six […]

high performance computing on graphics processing units: hgpu.org

Posts

A Computing Kernel for Network Binarization on PyTorch

Compiler-Driven Performance on Heterogeneous Computing Platforms

word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement

Deep Learning Based FPGA-CPU Acceleration

A Highly Parameterizable Framework for Conditional Restricted Boltzmann Machine Based Workloads Accelerated With FPGAs and OpenCL

Framework for Parallel Kernels Auto-tuning

Study of OpenCL Processing Models for FPGA Devices

CL-VIS: Visualization Platform for Understanding and Checking the OpenCL Programs

KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch Parameters Targeting CUDA Programs

Accelerating Stochastic Simulations on GPUs Using OpenCL

8th International Workshop on OpenCL, including SYCLCon, 2019

JSDoop and TensorFlow.js: Volunteer Distributed Web Browser-Based Neural Network Training

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)