high performance computing on graphics processing units: hgpu.org

Posts

Feb, 23

First Int. Workshop on Pattern Recognition (IWPR 2016), 2016

Publication: Submitted and accepted papers will be published by SPIE. Indexing: Scopus, Ei Compendex, ISI, Inspec, Google Scholar. Sponsored by: University of Toyama, Japan Hosei University, Japan Kogakuin University, Japan Teikyo University, Japan North Carolina Agricultural and Technical State University, USA Hainan University, China Keynote Speakers: Prof. Chiharu Ishll, Hosei University, Japan Prof. Genci Capi,University […]

Feb, 23

Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL

OpenCL is a portable interface that can be used to program cluster nodes with heterogeneous compute devices. The OpenCL specification tightly binds its workflow abstraction, or "command queue", to a specific device for the entire program. For best performance, the user has to find the ideal queue-device mapping at command queue creation time, an effort […]

OpenCL

Feb, 23

VirtCL: a framework for OpenCL device abstraction and management

The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running […]

OpenCL

Feb, 23

Deep Learning At Scale and At Ease

Recently, deep learning techniques have enjoyed success in various multimedia applications, such as image classification and multi-modal data analysis. Large deep learning models are developed for learning rich representations of complex data. There are two challenges to overcome before deep learning can be widely adopted in multimedia and other applications. One is usability, namely the […]

CUDA

Feb, 23

Sparse Convex Optimization on GPUs

Convex optimization is a fundamental mathematical framework used for general problem solving. The computational time taken to optimize problems formulated as Linear Programming, Integer Linear Programming or Quadratic Programming has an immediate impact on countless application fields, and it is critical to determining which problems we will be able to solve in the future. Since […]

CUDA

Feb, 23

Togpu: Automatic Source Transformation from C++ to CUDA using Clang/LLVM

Parallel processing using GPUs provides substantial increases in algorithm performance across many disciplines. As a result serial algorithms are commonly translated to parallel algorithms written in CUDA or OpenCL. To perform this translation a user must first overcome various barriers to entry. These obstacles change depending on the user but in general may include learning […]

CUDA

Feb, 19

Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems

General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler based approach to automatically generate optimized OpenCL code from data-parallel OpenMP programs for GPUs. A key feature of our scheme is that […]

OpenCL

Feb, 19

LN-Annote: An Alternative Approach to Information Extraction from Emails using Locally-Customized Named-Entity Recognition

Personal mobile devices offer a growing variety of personalized services that enrich considerably the user experience. This is made possible by increased access to personal information, which to a large extent is extracted from user email messages and archives. There are, however, two main issues. First, currently these services can be offered only by large […]

OpenCL

Feb, 19

A GPU-based Large-scale Monte Carlo Simulation Method for Systems with Long-range Interactions

In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures. It adopts the sequential updating scheme of Metropolis algorithm, and makes no approximation in the computation of energy. It […]

CUDA

Feb, 19

HeSP: a simulation framework for solving the task scheduling-partitioning problem on heterogeneous architectures

In this paper we describe HeSP, a complete simulation framework to study a general task scheduling-partitioning problem on heterogeneous architectures, which treats recursive task partitioning and scheduling decisions on equal footing. Considering recursive partitioning as an additional degree of freedom, tasks can be dynamically partitioned or merged at runtime for each available processor type, exposing […]

CUDA

•

OpenCL

Feb, 19

Gravitational wave astrophysics, data analysis and multimessenger astronomy

This paper reviews gravitational wave sources and their detection. One of the most exciting potential sources of gravitational waves are coalescing binary black hole systems. They can occur on all mass scales and be formed in numerous ways, many of which are not understood. They are generally invisible in electromagnetic waves, and they provide opportunities […]

CUDA

Feb, 19

The 4th International Symposium on Computing and Networking

Following the success of past ICNC conferences, 2010 in Hiroshima, 2011 in Osaka, and 2012 in Okinawa, and CANDAR symposiums 2013 in Matsuyama, 2014 in Shizuoka, 2015 in Sapporo, CANDAR 2016 will be held in Hiroshima, Japan. CANDAR 2016 will serve as a forum for exchanging the latest findings and experiences ranging from theoretical research […]

high performance computing on graphics processing units: hgpu.org

Posts

First Int. Workshop on Pattern Recognition (IWPR 2016), 2016

Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL

VirtCL: a framework for OpenCL device abstraction and management

Deep Learning At Scale and At Ease

Sparse Convex Optimization on GPUs

Togpu: Automatic Source Transformation from C++ to CUDA using Clang/LLVM

Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems

LN-Annote: An Alternative Approach to Information Extraction from Emails using Locally-Customized Named-Entity Recognition

A GPU-based Large-scale Monte Carlo Simulation Method for Systems with Long-range Interactions

HeSP: a simulation framework for solving the task scheduling-partitioning problem on heterogeneous architectures

Gravitational wave astrophysics, data analysis and multimessenger astronomy

The 4th International Symposium on Computing and Networking

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)