high performance computing on graphics processing units: hgpu.org

Posts

Dec, 28

A Survey of FPGA Based Neural Network Accelerator

Recent researches on neural network have shown great advantage in computer vision over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the great computation and storage complexity of neural network based algorithms poses great difficulty on its application. CPU platforms […]

Dec, 28

Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms

Integrated CPU-GPU architecture provides excellent acceleration capabilities for data parallel applications on embedded platforms while meeting the size, weight and power (SWaP) requirements. However, sharing of main memory between CPU applications and GPU kernels can severely affect the execution of GPU kernels and diminish the performance gain provided by GPU. For example, in the NVIDIA […]

CUDA

Dec, 24

Pass a Pointer: Exploring Shared Virtual Memory Abstractions in OpenCL Tools for FPGAs

Heterogeneous CPU-FPGA systems are gaining momentum in the embedded systems sector and in the data center market. While the programming abstractions for implementing the data transfer between CPU and FPGA (and vice versa) that are available in today’s commercial programming tools are well-suited for certain types of applications, the CPU-FPGA communication for applications that share […]

OpenCL

Dec, 24

Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems

Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To […]

OpenCL

Dec, 24

An MPI-Based Python Framework for Distributed Training with Keras

We present a lightweight Python framework for distributed training of neural networks on multiple GPUs or CPUs. The framework is built on the popular Keras machine learning library. The Message Passing Interface (MPI) protocol is used to coordinate the training process, and the system is well suited for job submission at supercomputing sites. We detail […]

CUDA

Dec, 24

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through sophisticated software frameworks like cuDNN and cuBLAS. However, recent enhancements to CPU-based hardware and software has the potential to significantly enhance the […]

CUDA

Dec, 24

GAMER-2: a GPU-accelerated adaptive mesh refinement code — accuracy, performance, and scalability

We present GAMER-2, a GPU-accelerated adaptive mesh refinement (AMR) code for astrophysics. It provides a rich set of features, including adaptive time-stepping, several hydrodynamic schemes, magnetohydrodynamics, self-gravity, particles, star formation, chemistry and radiative processes with GRACKLE, data analysis with yt, and memory pool for efficient object allocation. GAMER-2 is fully bitwise reproducible. For the performance […]

CUDA

Dec, 19

Molecular dynamics recipes for genome research

Molecular dynamics (MD) simulation allows one to predict the time evolution of a system of interacting particles. It is widely used in physics, chemistry and biology to address specific questions about the structural properties and dynamical mechanisms of model systems. MD earned a great success in genome research, as it proved to be beneficial in […]

CUDA

•

OpenCL

Dec, 19

Accelerated Sparse Matrix Operations in Nonlinear Least Squares Solvers

This thesis focuses on data structures for sparse block matrices and the associated algorithms for performing linear algebra operations that I have developed. Sparse block matrices occur naturally in many key problems, such as Nonlinear LEast Squares (NLS) on graphical models. NLS are used by e.g. Simultaneous Localization and Mapping (SLAM) in robotics, Bundle Adjustment […]

OpenCL

Dec, 19

OpenCL-accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing

To meet the requirements of railway track point cloud processing, an OpenCL-accelerated Point Feature Histogram method is proposed using heterogeneous computing to improve the low computation efficiency. According to the characteristics of parallel computing of OpenCL, the data structure for point cloud storage is reconfigured. With the kernel performance analysis by CodeXL, the data reading […]

OpenCL

Dec, 19

Tactics to Directly Map CNN graphs on Embedded FPGAs

Deep Convolutional Neural Networks (CNNs) are the state-of-the-art in image classification. Since CNN feed forward propagation involves highly regular parallel computation, it benefits from a significant speed-up when running on fine grain parallel programmable logic devices. As a consequence, several studies have proposed FPGA-based accelerators for CNNs. However, because of the large computational power required […]

Dec, 19

Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors

CPU-based many-core processors present an alternative to multicore CPU and GPU processors. In particular, the 93-Petaflops Sunway supercomputer, built from clustered many-core processors, has opened a new era for high performance computing that does not rely on GPU acceleration. However, memory bandwidth remains the main challenge for these architectures. This motivates our endeavor for optimizing […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

A Survey of FPGA Based Neural Network Accelerator

Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms

Pass a Pointer: Exploring Shared Virtual Memory Abstractions in OpenCL Tools for FPGAs

Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems

An MPI-Based Python Framework for Distributed Training with Keras

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

GAMER-2: a GPU-accelerated adaptive mesh refinement code — accuracy, performance, and scalability

Molecular dynamics recipes for genome research

Accelerated Sparse Matrix Operations in Nonlinear Least Squares Solvers

OpenCL-accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing

Tactics to Directly Map CNN graphs on Embedded FPGAs

Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)