high performance computing on graphics processing units: hgpu.org

Posts

Mar, 25

Scalable Breadth-First Search on a GPU Cluster

On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction […]

CUDA

Mar, 25

Optimization of Hierarchical Matrix Computation on GPU

The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H-matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H-matrices is more […]

CUDA

Mar, 25

A development of an accelerator board dedicated for multi-precision arithmetic operations and its application to Feynman loop integrals II

Evaluation of a wide variety of Feynman diagrams with multi-loop integrals and physical parameters and its comparison with high energy experiments are expected to investigate new physics beyond the Standard Model. We have been developing a direct computation method of multi-loop integrals of Feynman diagrams. One of features of our method is that we adopt […]

OpenCL

Mar, 25

MALBEC: a new CUDA-C ray-tracer in General Relativity

A new CUDA-C code for tracing orbits around non-charged black holes is presented. This code is named MALBEC, and take advantage of the graphic processing units and the CUDA platform in order to track the geodesic motion of null and timelike test particles in Schwarzschild and Kerr. Additionally, a new general set of equations that […]

CUDA

Mar, 25

Accelerating CNN inference on FPGAs: A Survey

Convolutional Neural Networks (CNNs) are currently adopted to solve an ever greater number of problems, ranging from speech recognition to image classification and segmentation. The large amount of processing required by CNNs calls for dedicated and tailored hardware support methods. Moreover, CNN workloads have a streaming nature, well suited to reconfigurable hardware architectures such as […]

OpenCL

Mar, 22

The VOLNA-OP2 Tsunami Code (Version 1.0)

In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite volume non-linear shallow water equations (NSWE) solver built on the OP2 domain specific language for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high performance computing platforms: CPUs, the Intel Xeon Phi, and GPUs. This is […]

CUDA

Mar, 22

FPGA in HPC: High Level Synthesys of OpenCL kernels for Molecular Dynamics

The overall goal of this thesis is to evaluate the feasibility of FPGA based computer system in HPC. This works is performed within ExaNeSt, an EU funded project which aims to develop and prototype energy efficient solutions for the production of exascale-level supercomputers. As the matter of fact, the current computer architectures need to be […]

OpenCL

Mar, 22

A multi-agent architecture for scheduling of high performance services in a GPU cluster

Nowadays, clusters containing multiple GPU nodes are widely used to execute high-performance computing applications. Diverse disciplines use these clusters to improve the performance of several services that consume high computational resources. The challenge of executing high-performance computing applications becomes harder when the applications are executed concurrently and each one of them may demand multiple GPU […]

Mar, 22

TBD: Benchmarking and Analyzing Deep Neural Network Training

The recent popularity of deep neural networks (DNNs) has generated a lot of research interest in performing DNN-related computation efficiently. However, the primary focus is usually very narrow and limited to (i) inference — i.e. how to efficiently execute already trained models and (ii) image classification networks as the primary benchmark for evaluation. Our primary […]

CUDA

Mar, 22

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

Graphics Processing Units (GPUs) perform the majority of computations in state-of-the-art supercomputers. Programming these GPUs is often assisted using a programming model such as (amongst others) the directive-driven OpenACC. Unfortunately, OpenACC (and other similar models) are incapable of automatically targeting and distributing work across several GPUs, which decreases productivity and forces needless manual labor upon […]

Mar, 18

International Conference on Biomedicine & Pharmacotherapy, 2018

International Conference on Biomedicine & Pharmacotherapy is going to be held during August 06-07, 2018 in Osaka, Japan. The conferences focuses on foremost topics such as Biomedicine, Biomedical Statistics, Biomedical Diagnosis, Frontiers in Biomedicine, Industrial Pharmacy, Pharmacotherapy, Molecular Biomedicine, Computational Biomedicine, Tissue Engineering, Medical Devices, Biomedical Model, Personalized Medicine, Biomedical Technology, Nanotechnology, Pharmacotherapy, Pharmaceutical Sciences, […]

Mar, 18

8th International Workshop on Computer Science and Engineering (WCSE’18), 2018

Meeting time：June 28-30, 2018 Meeting place：1880 New Petchburi Road, Bangkok 10310 Thailand Organized by Science and Engineering Institute, co organized by Bauman Moscow State technical University, Russia, Tokyo University of Science, Japan and China Agricultural University, 2018 the 8th International Workshop on Computer Science and Engineering (WCSE 2018) to Bangkok, Thailand during June 28-30, 2018. […]

high performance computing on graphics processing units: hgpu.org

Posts

Scalable Breadth-First Search on a GPU Cluster

Optimization of Hierarchical Matrix Computation on GPU

A development of an accelerator board dedicated for multi-precision arithmetic operations and its application to Feynman loop integrals II

MALBEC: a new CUDA-C ray-tracer in General Relativity

Accelerating CNN inference on FPGAs: A Survey

The VOLNA-OP2 Tsunami Code (Version 1.0)

FPGA in HPC: High Level Synthesys of OpenCL kernels for Molecular Dynamics

A multi-agent architecture for scheduling of high performance services in a GPU cluster

TBD: Benchmarking and Analyzing Deep Neural Network Training

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

International Conference on Biomedicine & Pharmacotherapy, 2018

8th International Workshop on Computer Science and Engineering (WCSE’18), 2018

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)