high performance computing on graphics processing units: hgpu.org

Posts

Aug, 1

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such […]

CUDA

Aug, 1

Deep Architectures for Neural Machine Translation

It has been shown that increasing model depth improves the quality of neural machine translation. However, different architectural variants to increase model depth have been proposed, and so far, there has been no thorough comparative study. In this work, we describe and evaluate several existing approaches to introduce depth in neural machine translation. Additionally, we […]

CUDA

Aug, 1

A GPU Based Memory Optimized Parallel Method For FFT Implementation

FFT (fast Fourier transform) plays a very important role in many fields, such as digital signal processing, digital image processing and so on. However, in application, FFT becomes a factor of affecting the processing efficiency, especially in remote sensing, which large amounts of data need to be processed with FFT. So shortening the FFT computation […]

CUDA

Aug, 1

Directive-Based Partitioning and Pipelining for Graphical Processing Units

The community needs simpler mechanisms to access the performance available in accelerators, such as GPUs, FPGAs, and APUs, due to their increasing use in stateof-the-art supercomputers. Programming models like CUDA, OpenMP, OpenACC and OpenCL can efficiently offload compute-intensive workloads to these devices. By default these models naively offload computation without overlapping it with communication (copying […]

CUDA

•

OpenCL

Jul, 25

On Simplifying and Optimizing Programs for Heterogeneous Computing Systems

Today, with the growth of highly parallel and heterogeneous architectures, systems composed of a combination of multicore CPUs, GPUs, and accelerators are becoming more common in HPC. Although heterogeneous architectures bring considerable benefits from a performance and energy perspective, they also make application development very challenging introducing the necessity of different parallel programming paradigms. Recently, […]

OpenCL

Jul, 25

FUX-Sim: Implementation of a fast universal simulation/reconstruction framework for X-ray systems

The availability of digital X-ray detectors, together with advances in reconstruction algorithms, creates an opportunity for bringing 3D capabilities to conventional radiology systems. The downside is that reconstruction algorithms for non-standard acquisition protocols are generally based on iterative approaches that involve a high computational burden. The development of new flexible X-ray systems could benefit from […]

CUDA

•

OpenCL

Jul, 25

ParTeCL: parallel testing using OpenCL

With the growing complexity of software, the number of test cases needed for effective validation is extremely large. Executing these large test suites is expensive and time consuming, putting an enormous pressure on the software development cycle. In previous work, we proposed using Graphics Processing Units (GPUs) to accelerate test execution by running test cases […]

OpenCL

Jul, 25

OpenCL Library for Parallel Graph Search Algorithms

Graphs are a popular data structure to represent large amounts of data and the relationship between them. As serial hardware hits the wall in terms of computation speed, a lot of research has been made recently in parallelizing Graph Search Algorithms such as Breadth First Search or the Single Source Shortest Path Problem hence make […]

OpenCL

Jul, 25

Memory-Efficient Implementation of DenseNets

The DenseNet architecture is highly computationally efficient as a result of feature reuse. However, a naive DenseNet implementation can require a significant amount of GPU memory: If not properly managed, pre-activation batch normalization and contiguous convolution operations can produce feature maps that grow quadratically with network depth. In this technical report, we introduce strategies to […]

CUDA

Jul, 23

International Conference on Intelligent Autonomous Systems (ICIAS), 2018

The conference will be held in Singapore during March 1-3, 2018. The theme of ICIAS2018 is “Frontier of intelligent autonomous systems”, reflecting the ever growing interests in research, development and applications in the dynamic and exciting areas of robotics. It also provides a premier interdisciplinary platform for researchers, practitioners and educators to present and discuss […]

Jul, 23

International Conference on Robotics and Intelligent System (ICRIS), 2018

Publication All submissions will be peer reviewed 2-3 reviewers, and the accepted papers after registration will be published in the International Conference Proceedings Series by ACM, which will beindexed by Ei Compendex and Scopus. Submission ICRIS 2018 is now accepting manuscript submissions. Please submit your full paper to us: icris@academic.net

Jul, 22

Scalability Study of Deep Learning Algorithms in High Performance Computer Infrastructures

Deep learning algorithms base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples, so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained. […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

Deep Architectures for Neural Machine Translation

A GPU Based Memory Optimized Parallel Method For FFT Implementation

Directive-Based Partitioning and Pipelining for Graphical Processing Units

On Simplifying and Optimizing Programs for Heterogeneous Computing Systems

FUX-Sim: Implementation of a fast universal simulation/reconstruction framework for X-ray systems

ParTeCL: parallel testing using OpenCL

OpenCL Library for Parallel Graph Search Algorithms

Memory-Efficient Implementation of DenseNets

International Conference on Intelligent Autonomous Systems (ICIAS), 2018

International Conference on Robotics and Intelligent System (ICRIS), 2018

Scalability Study of Deep Learning Algorithms in High Performance Computer Infrastructures

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)