high performance computing on graphics processing units: hgpu.org

Posts

Mar, 10

Pragma Directed Shared Memory Centric Optimizations on GPUs

GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is […]

CUDA

•

OpenCL

Mar, 10

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures

One area of Computing applications which poses significant challenge of performance scalability on Chip Multiprocessors(CMP’s) are Irregular applications. Such applications have very little computation and unpredictable memory access patterns making them memory-bound in contrast to compute-bound applications. Since the gap between processor and memory performance continues to exist, difficulty to hide and decrease this gap […]

CUDA

Mar, 10

Testing fine-grained parallelism for the ADMM on a factor-graph

There is an ongoing effort to develop tools that apply distributed computational resources to tackle large problems or reduce the time to solve them. In this context, the Alternating Direction Method of Multipliers (ADMM) arises as a method that can exploit distributed resources like the dual ascent method and has the robustness and improved convergence […]

CUDA

Mar, 8

D-face: Parallel Implementation of CNN Based Face Classifier using Drone Data On K40 & Jetson TK1

Convolutional Neural Networks (CNNs) are shown to perform very well in the areas such as video surveillance, object classification and face classification. Face classification has become pertinent to numerous applications, especially in this big data era of social platforms and social media. With the usage of unmanned air-borne vehicles like drones, the problem of face […]

CUDA

Mar, 8

Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations

Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, […]

OpenCL

Mar, 8

Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures

The rising pressure to simultaneously improve performance and reduce power consumption is driving more heterogeneity into all aspects of computing devices. However, wide adoption of specialized computing devices such as GPUs and Xeon Phis comes with a programming challenge. A carefully optimized program that is well matched to the target hardware can run many times […]

OpenCL

Mar, 8

A Novel Mapping of Arbitrary Precision Integer Operations to the GPU

With modern processing hardware converging on the physical barrier in terms of transistor size and speed per single core, hardware manufacturers have shifted their focus to improve performance from raw clock power towards parallelization. Solutions to utilize the computation power of GPUs are published and supported by graphics card manufacturers. While there exist solutions for […]

OpenCL

Mar, 7

Topology optimization design of 3D electrothermomechanical actuators by using GPU as a co-processor

The topology optimization method (TOM) requires high computational resources to be solved, especially in multiphysics problems. The high number of computational requirements is because TOM is an iterative technique, in which the iterations go from tens to thousands. Furthermore, at each TOM iteration, it is necessary to execute several routines such as the finite element […]

CUDA

Mar, 5

Performance Analysis of kNN on large datasets using CUDA & Pthreads

Several organizations have large databases which are growing at a rapid rate day by day, which need to be regularly maintained. Content based searches are similar searched based on certain features that are obtained from various multi media data. For various applications like multimedia content retrieval, data mining, pattern recognition, etc., performing the nearest neighbor […]

CUDA

Mar, 5

Fast LZW compression using a GPU

The LZW compression is a well known patented lossless compression method used in Unix file compression utility "compress" and in GIF and TIFF image formats. It converts an input string of characters (or 8-bit unsigned integers) into a string of codes using a code table (or dictionary) that maps strings into codes. Since the code […]

CUDA

Mar, 5

Input Space Splitting for OpenCL

The performance of OpenCL programs suffers from memory and control flow divergence. Therefore, OpenCL compilers employ static analyses to identify non-divergent control flow and memory accesses in order to produce faster code. However, divergence is often input-dependent, hence can be observed for some, but not all inputs. In these cases, vectorizing compilers have to generate […]

OpenCL

Mar, 5

Heterogeneous parallel algorithms for Computational Fluid Dynamics on unstructured meshes

Frontiers of computational fluid dynamics (CFD) are constantly expanding and eagerly demanding more computational resources. Currently, we are experiencing an rapid evolution in the high performance computing systems driven by power consumption constraints. New HPC nodes incorporate accelerators that are used as math co-processors for increasing the throughput and the FLOP per watt ratio. On […]

CUDA

•

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Pragma Directed Shared Memory Centric Optimizations on GPUs

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures

Testing fine-grained parallelism for the ADMM on a factor-graph

D-face: Parallel Implementation of CNN Based Face Classifier using Drone Data On K40 & Jetson TK1

Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations

Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures

A Novel Mapping of Arbitrary Precision Integer Operations to the GPU

Topology optimization design of 3D electrothermomechanical actuators by using GPU as a co-processor

Performance Analysis of kNN on large datasets using CUDA & Pthreads

Fast LZW compression using a GPU

Input Space Splitting for OpenCL

Heterogeneous parallel algorithms for Computational Fluid Dynamics on unstructured meshes

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)