high performance computing on graphics processing units: hgpu.org

Posts

Nov, 19

SGPU 2: a runtime system for using large applications on clusters of hybrid nodes

In this article, we consider hybrid architectures that consist of standard CPU cores associated with accelerators (such as GPUs). These architectures are increasingly employed in large computing centers. We develop a strategy designed to deal with hybrid computing architectures from the computing performance and programmability points of view. We focus on hybrid computing clusters that […]

CUDA

Nov, 19

Predictive Modeling and Analysis of OP2 on Distributed Memory GPU Clusters

OP2 is an "active" library framework for the development and solution of unstructured mesh-based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the backend to different multi-core/many-core hardware. This paper presents a summary of a predictive performance analysis and […]

CUDA

Nov, 19

Teaching graphics processing and architecture using a hardware prototyping approach

Since its introduction over two decades ago, graphics hardware has continued to evolve to improve rendering performance and increase programmability. While most undergraduate courses in computer graphics focus on rendering algorithms and programming APIs, we have recently created an undergraduate senior elective course that focuses on graphics processing and architecture, with a strong emphasis on […]

OpenGL

Nov, 19

StreamMR: An Optimized MapReduce Framework for AMD GPUs

MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics processing unit (GPU), but generally focusing on the NVIDIA GPU. Our investigation reveals that the design and mapping of […]

OpenCL

Nov, 18

Design and Implementation of a PTX Emulation Library

Intel co-founder Gordon E. Moore observed in 1965 that transistor density, the number of transistors that could be placed in an integrated circuit per square inch, increased exponentially, doubling roughly every two years. This would be later known as Moore’s Law, correctly predicting the trend that governed computing hardware manufacturing for the late 20th century. […]

Nov, 18

Particle-based Visualization of Large Cosmological Datasets

Large quantities of simulated cosmological particlebased data cause considerable problems when it comes to real-time visualization. This paper considers an out-ofcore approach for solving visualization problems on a single-desktop workstation. The approach proposed in this paper consists of two phases: the data preprocessing and its visualization. During the preprocessing, the cosmological data is hierarchically organized […]

OpenGL

Nov, 18

Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics processors

This paper shows how to build algorithms that use graphics processing units (GPUs) installed in most modern computers to solve dynamic equilibrium models in economics. In particular, we rely on the compute unified device architecture (CUDA) of NVIDIA GPUs. We illustrate the power of the approach by solving a simple real business cycle model with […]

CUDA

Nov, 18

The MOPED framework: Object recognition and pose estimation for manipulation

We present MOPED, a framework for Multiple Object Pose Estimation and Detection that seamlessly integrates single-image and multi-image object recognition and pose estimation in one optimized, robust, and scalable framework. We address two main challenges in computer vision for robotics: robust performance in complex scenes, and low latency for real-time operation. We achieve robust performance […]

Nov, 18

Fast Gather-based Construction of Stereoscopic Images Using Reprojection

We developed a very fast reprojection technique to generate stereoscopic images from a 2D image with depth information. The technique is gather-based and therefore very fast on current graphics hardware. The depth information is sampled at a specific offset which provides the depth to reproject from the left or right camera to the center camera. […]

OpenGL

Nov, 18

Accelerating The Cloud with Heterogeneous Computing

Heterogeneous multiprocessors that combine multiple CPUs and GPUs on a single die are posed to become commonplace in the market. As seen recently from the high performance computing community, leveraging a GPU can yield performance increases of several orders of magnitude. We propose using GPU acceleration to greatly speed up cloud management tasks in VMMs. […]

OpenCL

Nov, 18

Auto-tunable GPU BLAS

OpenCL is fast becoming the preferred framework used to make programs for heterogeneous platforms consisting of at least one CPU and one or more accelerators. The GPU being readily available in almost all computers, it is the most common accelerator in use.Good libraries are important to reduce development time and to make particular development environments, […]

OpenCL

Nov, 18

The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing

Multi2Sim is a simulation framework for heterogeneous computing, including models for superscalar, multithreaded, multicore, and graphics processors. Multi2Sim is an application-only simulator, which allows one or more applications to be run on top of it without booting a guest operating system first. In this chapter, an introduction to Multi2Sim is presented, and it is shown […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

SGPU 2: a runtime system for using large applications on clusters of hybrid nodes

Predictive Modeling and Analysis of OP2 on Distributed Memory GPU Clusters

Teaching graphics processing and architecture using a hardware prototyping approach

StreamMR: An Optimized MapReduce Framework for AMD GPUs

Design and Implementation of a PTX Emulation Library

Particle-based Visualization of Large Cosmological Datasets

Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics processors

The MOPED framework: Object recognition and pose estimation for manipulation

Fast Gather-based Construction of Stereoscopic Images Using Reprojection

Accelerating The Cloud with Heterogeneous Computing

Auto-tunable GPU BLAS

The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)