high performance computing on graphics processing units: hgpu.org

Posts

Jun, 20

Neural Code Comprehension: A Learnable Representation of Code Semantics

With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program […]

CUDA

•

OpenCL

Jun, 17

Combining Multiple Optimised FPGA-based Pulsar Search Modules Using OpenCL

Field-Programmable Gate Arrays (FPGAs) are widely used in the central signal processing design of the Square Kilometre Array (SKA) as acceleration hardware. The frequency domain acceleration search (FDAS) module is an important part of the SKA1-MID pulsar search engine. To develop for a yet to be finalised hardware, for cross-discipline interoperability and to achieve fast […]

OpenCL

Jun, 17

Dank Learning: Generating Memes Using Deep Neural Networks

We introduce a novel meme generation system, which given any image can produce a humorous and relevant caption. Furthermore, the system can be conditioned on not only an image but also a user-defined label relating to the meme template, giving a handle to the user on meme content. The system uses a pretrained Inception-v3 network […]

Jun, 17

Neural scene representation and rendering

Scene representation – the process of converting visual sensory data into concise descriptions – is a requirement for intelligent behaviour. Recent work has shown that neural networks excel at this task when provided large labelled datasets. However, removing the reliance on human labelling remains an important open problem. To this end, we introduce the Generative […]

OpenGL

Jun, 17

Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL

Field Programmable Gate Arrays (FPGAs) have been widely used for accelerating machine learning algorithms. However, the high design cost and time for implementing FPGA-based accelerators using traditional HDL-based design methodologies has discouraged users from designing FPGA-based accelerators. In recent years, a new CAD tool called Intel FPGA SDK for OpenCL (IFSO) allowed fast and efficient […]

OpenCL

Jun, 17

NCRF++: An Open-source Neural Sequence Labeling Toolkit

This paper describes NCRF++, a toolkit for neural sequence labeling. NCRF++ is designed for quick implementation of different neural sequence labeling models with a CRF inference layer. It provides users with an inference for building the custom model structure through configuration file with flexible neural feature design and utilization. Built on PyTorch, the core operations […]

CUDA

Jun, 13

Implementing general matrix-matrix multiplication algorithm on the Intel Xeon Phi Knights Landing Processor

This paper presents the design and implementation of general matrix-matrix multiplication (GEMM) algorithm for the second generation Intel Xeon Phi processor codenamed Knights Landing (KNL). We illustrate several developing guidelines to achieve optimal performance with C programming language and the Advanced Vector Extensions (AVX-512) instruction set. Further, we present several environment variable issues associated with […]

Jun, 13

Assessment of various GPU acceleration strategies in text categorization processing flow

Automatic text categorization presents many difficulties. Modern algorithms are getting better in extracting meaningful information from human language. However, they often significantly increase complexity of computations. This increased demand for computational capabilities can be facilitated by the usage of hardware accelerators like general purpose graphic cards. In this paper we present a full processing flow […]

OpenCL

Jun, 13

Indigo: A Domain-Specific Language for Fast, Portable Image Reconstruction

Linear operators used in iterative methods like conjugate gradient have typically been implemented either as "matrix-driven" subroutines backed by explicit sparse or dense matrices, or as "matrix-free" subroutines that implement specific linear operations directly (e.g. FFTs). The matrix-driven approach is generally more portable because it can target widely available BLAS libraries, but it can be […]

CUDA

Jun, 13

Aspect-Driven Mixed-Precision Tuning Targeting GPUs

Writing mixed-precision kernels allows to achieve higher throughput together with outputs whose precision remain within given limits. The recent introduction of native half-precision arithmetic capabilities in several GPUs, such as NVIDIA P100 and AMD Vega 10, contributes to make precision-tuning even more relevant as of late. However, it is not trivial to manually find which […]

OpenCL

Jun, 13

Efficient Large-scale Approximate Nearest Neighbor Search on OpenCL FPGA

We present a new method for Product Quantization (PQ) based approximated nearest neighbor search (ANN) in high dimensional spaces. Specifically, we first propose a quantization scheme for the codebook of coarse quantizer, product quantizer, and rotation matrix, to reduce the cost of accessing these codebooks. Our approach also combines a highly parallel k-selection method, which […]

OpenCL

Jun, 9

Optimizing Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures

Sparse matrix vector multiplication (SpMV) is one of the most common operations in scientific and high-performance applications, and is often responsible for the application performance bottleneck. While the sparse matrix representation has a significant impact on the resulting application performance, choosing the right representation typically relies on expert knowledge and trial and error. This paper […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Neural Code Comprehension: A Learnable Representation of Code Semantics

Combining Multiple Optimised FPGA-based Pulsar Search Modules Using OpenCL

Dank Learning: Generating Memes Using Deep Neural Networks

Neural scene representation and rendering

Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL

NCRF++: An Open-source Neural Sequence Labeling Toolkit

Implementing general matrix-matrix multiplication algorithm on the Intel Xeon Phi Knights Landing Processor

Assessment of various GPU acceleration strategies in text categorization processing flow

Indigo: A Domain-Specific Language for Fast, Portable Image Reconstruction

Aspect-Driven Mixed-Precision Tuning Targeting GPUs

Efficient Large-scale Approximate Nearest Neighbor Search on OpenCL FPGA

Optimizing Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)