high performance computing on graphics processing units: hgpu.org

Posts

Apr, 16

CISE 2014 – Asian Conference on Computer and Information Science and Engineering, CISE 2014

The Asian Conference on Computer and Information Science and Engineering will incorporate all topics within the field of computer and information science. This inaugural event promises to attract experts within the field of computer and information science and engineering, and allow for professors, researchers and university students to collaborate on this ever-growing field.

Apr, 16

Performance-aware component composition for GPU-based systems

This thesis addresses issues associated with efficiently programming modern heterogeneous GPU-based systems, containing multicore CPUs and one or more programmable Graphics Processing Units (GPUs). We use ideas from component-based programming to address programming, performance and portability issues of these heterogeneous systems. Specifically, we present three approaches that all use the idea of having multiple implementations […]

CUDA

Apr, 16

On optimization techniques for the matrix multiplication on hybrid CPU+GPU platforms

The use of auto-tuning techniques in a matrix multiplication routine for hybrid CPU+GPU platforms is analyzed. Basic models of the execution time of the hybrid routine and information obtained during its installation are used to optimize the execution time with a balanced assignation of the computation to the computing components in the heterogeneous system. Satisfactory […]

CUDA

Apr, 16

Dynamic Instrumentation and Optimization for GPU Applications

Parallel architectures like GPUs are a tantalizing compute fabric for performance-hungry developers. While GPUs enable order-of-magnitude performance increases in many data-parallel application domains, writing efficient codes that can actually manifest those increases is a non-trivial endeavor, typically requiring developers to exercise specialized architectural features exposed directly in the programming model. Achieving good performance on GPUs […]

CUDA

Apr, 16

New Efficient Method To Solve Longest Overlap Region Problem For Noncoding DNA Sequence

With early hardware limitations of the GPU (lack of synchronization primitives and limited memory caching mechanisms)can make GPU-based computation inefficient, and emerging DNA sequence technologies open up more opportunities for molecular biology. This paper presents the issues of parallel implementation of longest overlap region Problem on a multiprocessor GPU using the Compute Unified Device Architecture […]

CUDA

Apr, 16

A Way For Accelerating The DNA Sequence Reconstruction Problem By CUDA

Traditionally, we usually utilize the method of shotgun to cut a DNA sequence into pieces and we have to reconstruct the original DNA sequence from the pieces, those are widely used method for DNA assembly. Emerging DNA sequence technologies open up more opportunities for molecular biology. This paper introduce a new method to improve the […]

CUDA

Apr, 14

Fast Burrows Wheeler Compression Using CPU and GPU

In this paper, we present an all-core implementation of Burrows Wheeler Compression algorithm that exploits all computing resources on a system. Our focus is to provide significant benefit to everyday users on common end-to-end applications by exploiting the parallelism of multiple CPU cores and many-core GPU on their machines. The all-core framework is suitable for […]

CUDA

Apr, 14

Scheduling Dataflow Execution Across Multiple Accelerators

Dataflow execution engines such as MapReduce, DryadLINQ and PTask have enjoyed success because they simplify development for a class of important parallel applications. Expressing the computation as a dataflow graph allows the runtime, and not the programmer, to own problems such as synchronization, data movement and scheduling – leveraging dynamic information to inform strategy and […]

CUDA

Apr, 14

A First Order Primal-Dual Algorithm for Nonconvex TV^q Regularization

We propose an efficient first order primal-dual method for solving variational problems with nonconvex regularization such as TV^q. It is based on the recent idea in [1] to reformulate an existing primal-dual algorithm for convex optimization using Moreau’s identity. A systematic comparison to recent state of the art algorithms for nonconvex optimization (iteratively reweighted l1 […]

CUDA

Apr, 14

An Approach to Efficient FEM Simulations on Graphics Processing Units Using CUDA

The paper presents a highly efficient way of simulating the dynamic behavior of deformable objects by means of the finite element method (FEM) with computations performed on Graphics Processing Units (GPU). The presented implementation reduces bottlenecks related to memory accesses by grouping the necessary data per node pairs, in contrast to the classical way done […]

CUDA

Apr, 14

A New Architecture for Games and Simulations Using GPUs

Multi-thread architectures are the current trends for both PCs (multi-core CPUs and GPUs) and game consoles such as the Microsoft Xbox 360 and Sony Playstation 3. GPUs (Graphics Processing Units) have evolved into extremely powerful and flexible processors, allowing its use for processing different data. This advantage can be used in game development to optimize […]

CUDA

•

OpenGL

Apr, 13

Stealing Webpages Rendered on Your Browser by Exploiting GPU Vulnerabilities

Graphics processing units (GPUs) are important components of modern computing devices for not only graphics rendering, but also efficient parallel computations. However, their security problems are ignored despite their importance and popularity. In this paper, we first perform an in-depth security analysis on GPUs to detect security vulnerabilities. We observe that contemporary, widely-used GPUs, both […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

CISE 2014 – Asian Conference on Computer and Information Science and Engineering, CISE 2014

Performance-aware component composition for GPU-based systems

On optimization techniques for the matrix multiplication on hybrid CPU+GPU platforms

Dynamic Instrumentation and Optimization for GPU Applications

New Efficient Method To Solve Longest Overlap Region Problem For Noncoding DNA Sequence

A Way For Accelerating The DNA Sequence Reconstruction Problem By CUDA

Fast Burrows Wheeler Compression Using CPU and GPU

Scheduling Dataflow Execution Across Multiple Accelerators

A First Order Primal-Dual Algorithm for Nonconvex TV^q Regularization

An Approach to Efficient FEM Simulations on Graphics Processing Units Using CUDA

A New Architecture for Games and Simulations Using GPUs

Stealing Webpages Rendered on Your Browser by Exploiting GPU Vulnerabilities

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)