high performance computing on graphics processing units: hgpu.org

Posts

Apr, 20

Discrete-event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs)

Graphics cards, traditionally designed as accelerators for computer graphics, have evolved to support more general-purpose computation. General Purpose Graphical Processing Units (GPGPUs) are now being used as highly efficient, cost-effective platforms for executing certain simulation applications. While most of these applications belong to the category of timestepped simulations, little is known about the applicability of […]

Apr, 20

An efficient GPU implementation of the revised simplex method

The computational power provided by the massive parallelism of modern graphics processing units (GPUs) has moved increasingly into focus over the past few years. In particular, general purpose computing on GPUs (GPGPU) is attracting attention among researchers and practitioners alike. Yet GPGPU research is still in its infancy, and a major challenge is to rearrange […]

Apr, 20

Tutorial 3: Methodologies and Performance Impacts of General Purpose Computing on GPUs

Graphics Processing Units (GPUs) has been applied to graphics applications to implement realistic perspectives of virtual scenes especially in entertainment market. Due to the demands from the market for creating super high definition scenes with high frame rate that simulates physics phenomenon naturally in visualization applications, the last decade promoted drastic performance improvement of GPUs. […]

Apr, 20

Design and implementation of software-managed caches for multicores with local memory

Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such […]

Apr, 20

Compressing Floating-Point Number Stream for Numerical Applications

A cluster of commodity computers and general-purpose computers with accelerators such as GPGPUs are now common platforms to solve computationally intensive tasks like scientific simulations. Both technologies provide users with high performance at relatively low cost. However, the low bandwidth of interconnect compared to the computing performance hinders efficient operation of both cluster and accelerator […]

Apr, 19

HPP-Controller: An intra-node controller designed for connecting heterogeneous CPUs

Heterogeneity is considered as a solution for supercomputers to scale to petascale. Many systems which are composed of general CPUs and special processing units such as Cells, GPGPUs and FPGAs have been implemented. In these systems, CPU needs interact with special processing units to process data together, thus communications between these heterogeneous processing units become […]

Apr, 19

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. Although U-cores are effective at increasing performance, their benefits can also […]

CUDA

Apr, 19

Parallel Approaches for SWAMP Sequence Alignment

This document is a summary and overview of several approaches to implement the local sequence alignment algorithms known as SWAMP and SWAMP+ on commercially available hardware. Using a Smith-Waterman style of alignment, these parallel algorithms have several innovative extensions that take advantage of the ASC associative computing model while maintaining speed, accuracy, and producing a […]

Apr, 19

A Hybrid Analytical DRAM Performance Model

As process technology scales, the number of transistors that can fit in a unit area has increased exponentially. Processor throughput, memory storage, and memory throughput have all been increasing at an exponential pace. As such, DRAM has become an ever-tightening bottleneck for applications with irregular memory access patterns. Computer architects in industry sometimes use ad […]

CUDA

Apr, 19

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

As semiconductor scaling continues, more transistors can be put onto the same chip despite growing challenges in clock frequency scaling. Stream processor architectures can make effective use of these additional resources for appropriate applications. However, it is important that programmer effort be amortized across future generations of stream processor architectures. Current industry projections suggest a […]

Apr, 19

A Task-centric Memory Model for Scalable Accelerator Architectures

This paper presents a task-centric memory model for 1000-core compute accelerators. Visual computing applications are emerging as an important class of workloads that can exploit 1000-core processors. In these workloads, we observe data sharing and communication patterns that can be leveraged in the design of memory systems for future 1000-core processors. Based on these insights, […]

Apr, 19

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today’s desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Discrete-event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs)

An efficient GPU implementation of the revised simplex method

Tutorial 3: Methodologies and Performance Impacts of General Purpose Computing on GPUs

Design and implementation of software-managed caches for multicores with local memory

Compressing Floating-Point Number Stream for Numerical Applications

HPP-Controller: An intra-node controller designed for connecting heterogeneous CPUs

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

Parallel Approaches for SWAMP Sequence Alignment

A Hybrid Analytical DRAM Performance Model

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

A Task-centric Memory Model for Scalable Accelerator Architectures

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)