high performance computing on graphics processing units: hgpu.org

Posts

Mar, 15

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model

The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is wellmatched to the target hardware can run multiple times faster and more energy efficiently than one that is not. The problem is complicated by the fact that a program’s input also […]

CUDA

•

OpenCL

Mar, 15

Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural Code

The current trend in next-generation exascale systems goes towards integrating a wide range of specialized (co-)processors into traditional supercomputers. However, the integration of different specialized devices increases the degree of heterogeneity and the complexity in programming such type of systems. Due to the efficiency of heterogeneous systems in terms of Watt and FLOPS per surface […]

OpenCL

Mar, 15

Melia: A MapReduce Framework on OpenCL-based FPGAs

MapReduce, originally developed by Google for search applications, has recently become a popular programming framework for parallel and distributed environments. This paper presents an energy-efficient architecture design for MapReduce on Field Programmable Gate Arrays (FPGAs). The major goal is to enable users to program FPGAs with simple MapReduce interfaces, and meanwhile to embrace automatic performance […]

OpenCL

Mar, 15

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

Matrix factorization (MF) is employed by many popular algorithms, e.g., collaborative filtering. The emerging GPU technology, with massively multicore and high intra-chip memory bandwidth but limited memory capacity, presents an opportunity for accelerating MF much further when appropriately exploiting the GPU architectural characteristics. This paper presents cuMF, a CUDA-based matrix factorization library that implements memory-optimized […]

CUDA

Mar, 14

2nd IEEE International Conference on Computer and Communications (ICCC), 2016

Submission Date: Before July 1 History: Good News! All papers from ICCC 2015 has been included in IEEE Xplore. Supported by: ICCC 2016 is hosted by IEEE and Sichuan Institue of Electronics, co-organized by Southwest Jiaotong University and Xihua University. Publication: All accepted papers must be written in English and will be published into conference […]

Mar, 14

The First Int. Conference on Multimedia and Image Processing (ICMIP), 2016

ICMIP 2016 is organized by University of Brunei Darussalam, Brunei Darussalam. Publication: After a careful reviewing process, all accepted papers will be published in the Conference Proceedings, and send to be reviewed by EI Compendex. Invited Speakers from International Prestigious University: Prof. Amine Bermak, IEEE Fellow, Hong Kong University of Science and Technology, Hong Kong […]

Mar, 14

6th Int. Workshop on Computer Science and Engineering (WCSE), 2016

All accepted of WCSE 2016 will be published by Conference proceedings, which will be indexed by 【EI &Scopus.】 Keynote &Plenary Speakers Prof. Hayato Ohwada, Tokyo University of Science, Japan Prof. Taku Harada, Tokyo University of Science, Japan Prof. Akiko Aizawa, National Institute of Informatics, Japan Prof. Hiroyuki Nishiyama, Tokyo University of Science, Japan Conference Program […]

Mar, 12

Machine Learning at the Limit

Many systems have been developed for machine learning at scale. Performance has steadily improved, but there has been relatively little work on explicitly defining or approaching the limits of performance. In this paper we describe the application of roofline design, an approach borrowed from computer architecture, to large-scale machine learning. In roofline design, one exposes […]

CUDA

Mar, 12

SGO: An ultrafast engine for atomic structure global optimization by differential evolution

This paper presents a fast method for global search of atomic structures. The structures global optimization (SGO) engine consists of a high-efficiency differential evolution algorithm, accelerated local relaxation methods and an ultrafast density functional theory plane-wave code run on GPU machines. It can search the global minimum configuration of crystals, two-dimensional materials and quantum clusters […]

Mar, 12

A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the […]

Mar, 12

Clinically applicable Monte Carlo-based biological dose optimization for the treatment of head and neck cancers with spot-scanning proton therapy

Purpose: To demonstrate the feasibility of fast Monte Carlo (MC) based inverse biological planning for the treatment of head and neck tumors in spot-scanning proton therapy. Methods: Recently, a fast and accurate Graphics Processor Unit (GPU)-based MC simulation of proton transport was developed and used as the dose calculation engine in a GPU-accelerated IMPT optimizer. […]

Mar, 12

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models

The move from single core and processor systems to multi-core and many-processors systemscomes with the requirement of implementing computations in a way that can utilizethese multiple units eciently. This task of writing ecient multi-threaded algorithmswill not be possible with out improving programming languages and compilers to providethe mechanisms to do so. Computer aided mathematical modeling […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model

Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural Code

Melia: A MapReduce Framework on OpenCL-based FPGAs

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

2nd IEEE International Conference on Computer and Communications (ICCC), 2016

The First Int. Conference on Multimedia and Image Processing (ICMIP), 2016

6th Int. Workshop on Computer Science and Engineering (WCSE), 2016

Machine Learning at the Limit

SGO: An ultrafast engine for atomic structure global optimization by differential evolution

A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

Clinically applicable Monte Carlo-based biological dose optimization for the treatment of head and neck cancers with spot-scanning proton therapy

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)