high performance computing on graphics processing units: hgpu.org

Posts

Dec, 12

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators […]

OpenCL

Dec, 10

Transforming C OpenMP Programs for Verification in CIVL

There are numerous way to express parallelism which can make it challenging for developers to verify these programs. Many tools only target a single dialect but the Concurrency Intermediate Verification Language (CIVL) targets MPI, Pthreads, and CUDA. CIVL provides a general concurrency model that can represent pro- grams in a variety of concurrency dialects. CIVL […]

CUDA

•

OpenCL

Dec, 10

MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs

High performance graph analytics are critical for a long list of application domains. In recent years, the rapid advancement of many-core processors, in particular graphical processing units (GPUs), has sparked a broad interest in developing high performance parallel graph programs on these architectures. However, the SIMT architecture used in GPUs places particular constraints on both […]

CUDA

Dec, 10

Join Execution Using Fragmented Columnar Indices on GPU and MIC

The paper describes an approach to the parallel natural join execution on computing clusters with GPU and MIC Coprocessors. This approach is based on a decomposition of natural join relational operator using the column indices and domain-interval fragmentation. This decomposition admits parallel executing the resource-intensive relational operators without data transfers. All column index fragments are […]

CUDA

Dec, 10

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach […]

CUDA

Dec, 10

High Performance Histograms on SIMT and SIMD Architectures

Using the histogram procedure, this work studies performance determining factors in computing in parallel on SIMD and SIMT devices. Modern graphics pro-cessing units (GPUs) support SIMT, multiple threads running the same instruction, whereas central processing units (CPUs) use SIMD, in which one instruction op-erates on multiple operands. As part of this work, a cross-technology framework […]

CUDA

•

OpenCL

Dec, 9

A Parallel Solver for Markov Decision Process in Crowd Simulations

Classic path finding algorithms are not adequate in real world path planning, where environment information is incomplete or dynamic and Markov Decision Processes have been used as an alternative. The problem with the MDP formalism is that its state space grows exponentially with the number of domain variables, and its inference methods grow with the […]

CUDA

Dec, 8

A Semi-Automated Tool Flow for Roofline Anaylsis of OpenCL Kernels on Accelerators

We propose a tool-flow methodology that can be applied to analyze and track the performance of OpenCL applications on heterogeneous platforms. Using a case study on a datacenter representative workload, we evaluate our tool flow on three distinct heterogeneous platforms and demonstrate how it can be employed more widely to provide insight and track attainable […]

OpenCL

Dec, 8

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems, ranging from […]

CUDA

Dec, 8

Towards Memory-Efficient Answering of Tree-Shaped SPARQL Queries using GPUs

We present an idea of efficient query answering over an RDF dataset employing a consumer-grade graphic card for an efficient computation. We consider tree-shaped SPARQL queries and static datasets, to facilitate data mining over RDF graphs in warehouse-like setups. Reasons to see the poster: a) presentation of the approach with examples; b) possibility of discussion […]

OpenCL

Dec, 8

Scaling Deep Learning on Multiple In-Memory Processors

Deep learning methods are proven to be state-of-theart in addressing many challenges in machine learning domains. However, it comes at the cost of high computational requirements and energy consumption. The emergence of Processing In Memory (PIM) with diestacking technology presents an opportunity to speed up deep learning computation and reduce energy consumption by providing low-cost […]

OpenCL

Dec, 8

Nonlinear Dynamic Analysis Efficiency by Using a GPU Parallelization

A graphics processing unit (GPU) parallelization approach was implemented to improve the efficiency of nonlinear dynamic analysis. The GPU parallelization approach speeded up the computation of implicit time integration and reduced total calculation time. In addition, a parallel equations solver is introduced to solve the equation system. Numerical examples of reinforced concrete (RC) frames were […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Transforming C OpenMP Programs for Verification in CIVL

MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs

Join Execution Using Fragmented Columnar Indices on GPU and MIC

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

High Performance Histograms on SIMT and SIMD Architectures

A Parallel Solver for Markov Decision Process in Crowd Simulations

A Semi-Automated Tool Flow for Roofline Anaylsis of OpenCL Kernels on Accelerators

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

Towards Memory-Efficient Answering of Tree-Shaped SPARQL Queries using GPUs

Scaling Deep Learning on Multiple In-Memory Processors

Nonlinear Dynamic Analysis Efficiency by Using a GPU Parallelization

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)