high performance computing on graphics processing units: hgpu.org

Posts

Sep, 2

Performance Evaluation and Tuning of An OpenCL based Matrix Multiplier

Matrix multiplication is one of the fundamental building blocks of numerical linear algebra. It requires computer systems have huge computing capability and consumes much more power as problem size is increased. In this research, an OpenCL-based matrix multiplier is presented. When data are single precision floating-points, compared with the software simulations based on the Intel […]

OpenCL

Sep, 2

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs

Conventional GPU implementations of Strassen’s algorithm (Strassen) typically rely on the existing high-performance matrix multiplication (GEMM), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, "squarish" matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen […]

CUDA

Sep, 2

Full Speed Ahead: 3D Spatial Database Acceleration with GPUs

Many industries rely on visual insights to support decision- making processes in their businesses. In mining, the analysis of drills and geological shapes, represented as 3D geometries, is an important tool to assist geologists on the search for new ore deposits. Aeronautics manipulate high-resolution geometries when designing a new aircraft aided by the numerical simulation […]

CUDA

Sep, 2

A study of integer sorting on multicores

Integer sorting on multicores and GPUs can be realized by a variety of approaches that include variants of distribution-based methods such as radix-sort, comparison-oriented algorithms such as deterministic regular sampling and random sampling parallel sorting, and network-based algorithms such as Batcher’s bitonic sorting algorithm. In this work we present an experimental study of integer sorting […]

Aug, 26

Deep learning: A guide for practitioners in the physical sciences

Machine learning is finding increasingly broad applications in the physical sciences. This most often involves building a model relationship between a dependent, measurable output, and an associated set of controllable, but complicated, independent inputs. We present a tutorial on current techniques in machine learning – a jumping-off point for interested researchers to advance their work. […]

Aug, 26

Optimizing Web Virtual Reality

Performance has always been a key factor in any virtual and augmented reality experience. Since Virtual Reality was conceived, performance has always been the factor that has often slowed down, or at times even halted the adoption of Virtual Reality related technologies. More recently, the hardware advancements have caught up with the development so that […]

OpenGL

Aug, 26

Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU

The trend in computer architectures has for several years been heterogeneous systems consisting of a regular CPU and at least one additional, specialized processing unit, such as a GPU.The different characteristics of the processing units and the requirement of multiple tools and programming languages makes programming of such systems a challenging task. Although there exist […]

CUDA

•

OpenCL

Aug, 26

Performance Evaluation of OpenMP’s Target Construct on GPUs – Exploring Compiler Optimizations

OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU […]

CUDA

Aug, 26

A Qualitative Comparison Study Between Common GPGPU Frameworks

The development of graphic processing units have during the last decade improved significantly in performance while at the same time becoming cheaper. This has developed a new type of usage of the device where the massive parallelism available in modern GPU’s are used for more general purpose computing, also known as GPGPU. Frameworks have been […]

CUDA

•

OpenCL

Aug, 19

Kernel Tuner: A search-optimizing GPU code auto-tuner

A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than other kernel configurations. To obtain highly-efficient kernels it is often required to search vast and discontinuous search spaces that consist of all possible combinations […]

CUDA

•

OpenCL

Aug, 19

Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures

Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication formulation, and direct convolution primarily targeting […]

Aug, 19

CosmoFlow: Using Deep Learning to Learn the Universe at Scale

Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. CosmoFlow uses efficient implementations of 3D convolution and pooling primitives, together with improvements in threading […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Performance Evaluation and Tuning of An OpenCL based Matrix Multiplier

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs

Full Speed Ahead: 3D Spatial Database Acceleration with GPUs

A study of integer sorting on multicores

Deep learning: A guide for practitioners in the physical sciences

Optimizing Web Virtual Reality

Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU

Performance Evaluation of OpenMP’s Target Construct on GPUs – Exploring Compiler Optimizations

A Qualitative Comparison Study Between Common GPGPU Frameworks

Kernel Tuner: A search-optimizing GPU code auto-tuner

Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures

CosmoFlow: Using Deep Learning to Learn the Universe at Scale

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)