high performance computing on graphics processing units: hgpu.org

Posts

Oct, 15

Towards Utilizing Remote GPUs for CUDA Program Execution

The modern CPU has been designed to accelerate serial processing as much as possible. Recently, GPUs have been exploited to solve large parallelizable problems. As fast as a GPU is for general purpose massively parallel computing, some problems require an even larger scale of parallelism and pipelining. However, it has been difficult to scale algorithms […]

CUDA

Oct, 15

Functional High Performance Financial IT

The world of finance faces the computational performance challenge of massively expanding data volumes, extreme response time requirements, and compute-intensive complex (risk) analyses. Simultaneously, new international regulatory rules require considerably more transparency and external auditability of financial institutions, including their software systems. To top it off, increased product variety and customisation necessitates shorter software development […]

OpenCL

Oct, 15

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal performance for programs designed with a CPU memory interface-or no particular memory interface at all!-in mind. […]

CUDA

Oct, 15

Effects of compression on data intensive algorithms

In recent years, the gap between bandwidth and computational throughput has become a major challenge in high performance computing (HPC). Data intensive algorithms are particularly affected. by the limitations of I/O bandwidth and latency. In this thesis project, data compression is explored so that fewer bytes need to be read from disk. The computational capabilities […]

CUDA

Oct, 15

Bandwidth Reduction Through Multithreaded Compression of Seismic Images

One of the main challenges of modern computer systems is to overcome the ever more prominent limitations of disk I/O and memory bandwidth, which today are thousands-fold slower than computational speeds. In this paper, we investigate reducing memory bandwidth and overall I/O and memory access times by using multithreaded compression and decompression of large datasets. […]

CUDA

Oct, 15

Speeding up the MATLAB complex networks package using graphic processors

The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to […]

CUDA

Oct, 15

GPU fluids in production: a compiler approach to parallelism

Fluid effects in films require the utmost flexibility, from manipulating a small lick of flame to art-directing a huge tidal wave. While fluid solvers are increasingly making use of GPU hardware, one of the biggest challenges is taking advantage of this technology without compromising on either adaptability or performance. We developed the Jet toolset comprised […]

Oct, 15

Accelerating code on multi-cores with FastFlow

FastFlow is a programming framework specifically targeting cache-coherent shared-memory multi-cores. It is implemented as a stack of C++ template libraries built on top of lock-free (and memory fence free) synchronization mechanisms. Its philosophy is to combine programmability with performance. In this paper a new FastFlow programming methodology aimed at supporting parallelization of existing sequential code […]

Oct, 15

Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards

In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, it is hard to efficiently […]

CUDA

Oct, 14

An Analysis of Programmer Productivity versus Performance for High Level Data Parallel Programming

Data parallel programming provides an accessible model for exploiting the power of parallel computing elements without resorting to the explicit use of low level programming techniques based on locks, threads and monitors. The emergence of Graphics Processing Units (GPUs) with hundreds or thousands of processing cores has made data parallel computing available to a wider […]

CUDA

Oct, 14

Accelerating Large Scale Image Analyses on Parallel CPU-GPU Equipped Systems

General-purpose graphical processing units (GPGPUs) have transformed high-performance computing over the past decade. Making great computational power available with reduced cost and power consumption overheads, heterogeneous CPU-GPU-equipped systems have helped to make possible the emerging class of exascale data-intensive applications. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of […]

CUDA

Oct, 14

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Towards Utilizing Remote GPUs for CUDA Program Execution

Functional High Performance Financial IT

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems

Effects of compression on data intensive algorithms

Bandwidth Reduction Through Multithreaded Compression of Seismic Images

Speeding up the MATLAB complex networks package using graphic processors

GPU fluids in production: a compiler approach to parallelism

Accelerating code on multi-cores with FastFlow

Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards

An Analysis of Programmer Productivity versus Performance for High Level Data Parallel Programming

Accelerating Large Scale Image Analyses on Parallel CPU-GPU Equipped Systems

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)