high performance computing on graphics processing units: hgpu.org

Posts

Oct, 8

GPU-Based Computation of 2D Least Median of Squares with Applications to Fast and Robust Line Detection

The 2D Least Median of Squares (LMS) is a popular tool in robust regression because of its high breakdown point: up to half of the input data can be contaminated with outliers without affecting the accuracy of the LMS estimator. The complexity of 2D LMS estimation has been shown to be $Omega(n^2)$ where $n$ is […]

CUDA

Oct, 6

CVC: The Contourlet Video Compression algorithm for real-time applications

Nowadays, real-time video communication over the internet through video conferencing applications has become an invaluable tool in everyone’s professional and personal life. This trend underlines the need for video coding algorithms that provide acceptable quality on low bitrates and can support various resolutions inside the same stream in order to cope with limitations on computational […]

CUDA

Oct, 6

MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing

Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries […]

CUDA

Oct, 6

Parallel Graph Algorithms on the Xeon Phi Coprocessor

Complex networks have received interest in a wide area of applications, ranging from road networks over hyperlink connections in the world wide web to interactions between people. Advanced algorithms are required for the generation as well as visualization of such graphs. In this work two graph algorithms, one for graph generation, the other for graph […]

Oct, 6

Optimizing GPU-accelerated Group-By and Aggregation

The massive parallelism and faster random memory access of Graphics Processing Units (GPUs) promise to further accelerate complex analytics operations such as joins and grouping, but also provide additional challenges to optimizing their performance. There are more implementation alternatives to consider on the GPU, such as exploiting different types of memory on the device and […]

CUDA

Oct, 6

A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs

Array-based languages such as MATLAB and Python (with NumPy) have become very popular for scientific computing. However, the performance of the implementations of these languages is often lacking. For example, some of the implementations are interpreted. Further, these languages were not designed with multi-core CPUs and GPUs in mind and thus don’t take full advantage […]

OpenCL

Oct, 3

Tuned and GPU-accelerated parallel data mining from comparable corpora

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on […]

Oct, 3

Fast Algorithms for Convolutional Neural Networks

We derive a new class of fast algorithms for convolutional neural networks using Winograd’s minimal filtering algorithms. Specifically we derive algorithms for network layers with 3×3 kernels, which are the preferred kernel size for image recognition tasks. The best of our algorithms reduces arithmetic complexity up to 4X compared with direct convolution, while using small […]

Oct, 3

Brute-Force k-Nearest Neighbors Search on the GPU

We present a brute-force approach for finding k-nearest neighbors on the GPU for many queries in parallel. Our program takes advantage of recent advances in fundamental GPU computing primitives. We modify a matrix multiplication subroutine in MAGMA library [6] to calculate the squared Euclidean distances between queries and references. The nearest neighbors selection is accomplished […]

CUDA

Sep, 30

Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards

We discuss an approach for solving sparse or dense banded linear systems ${bf A} {bf x} = {bf b}$ on a Graphics Processing Unit (GPU) card. The matrix ${bf A} in {mathbb{R}}^{N times N}$ is possibly nonsymmetric and moderately large; i.e., $10000 leq N leq 500000$. The ${it split and parallelize}$ (${tt SaP}$) approach seeks […]

CUDA

Sep, 29

Performance Testing of GPU-Based Approximate Matching Algorithm on Network Traffic

Insider threat is one of the risks both government and private organizations have to deal with in protecting their important information. Data exfiltration and data leakage resulting from insiders activities can be very difficult to identify and quantify. Unfortunately, existing solutions that efficiently check whether data moving across a network is known to be sensitive […]

CUDA

Sep, 29

The Dynamical Kernel Scheduler – Part 1

Emerging processor architectures such as GPUs and Intel MICs provide a huge performance potential for high performance computing. However developing software using these hardware accelerators introduces additional challenges for the developer such as exposing additional parallelism, dealing with different hardware designs and using multiple development frameworks in order to use devices from different vendors. The […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU-Based Computation of 2D Least Median of Squares with Applications to Fast and Robust Line Detection

CVC: The Contourlet Video Compression algorithm for real-time applications

MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing

Parallel Graph Algorithms on the Xeon Phi Coprocessor

Optimizing GPU-accelerated Group-By and Aggregation

A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs

Tuned and GPU-accelerated parallel data mining from comparable corpora

Fast Algorithms for Convolutional Neural Networks

Brute-Force k-Nearest Neighbors Search on the GPU

Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards

Performance Testing of GPU-Based Approximate Matching Algorithm on Network Traffic

The Dynamical Kernel Scheduler – Part 1

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)