23625

Posts

Oct, 4

An OpenCL 3D FFT for Molecular Dynamics Simulations on Multiple FPGAs

3D FFTs are used to accelerate MD electrostatic forces computations but are difficult to parallelize due to communications requirements. We present a distributed OpenCL 3D FFT implementation on Intel Stratix 10 FPGAs for grids up to 128^3. We use FPGA hardware features such as HBM2 memory and multiple 100 Gbps links to provide scalable memory […]
Oct, 4

LoopBench: An Evaluation of Loop Acceleration in Heterogeneous Systems

Computational intensive applications usually consist of multiple nested or flattened loops. These loops are the main building blocks of the applications and embody a specific type of execution pattern. In order to reduce the running time of the loops, developers are analyzing the loops in the code and try to parallelize them on the target […]
Oct, 4

Flexible Performant GEMM Kernels on GPUs

General Matrix Multiplication or GEMM kernels take center place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA’s Tensor Cores. Their exploitation is hampered by the two-language problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of […]
Oct, 4

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM challenging. Modern GPUs include Tensor Core Units (TCUs), which specialize in dense matrix multiplication. Our aim is to re-purpose TCUs for sparse matrices. […]
Sep, 27

Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters

Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming […]
Sep, 27

Extending High-Level Synthesis for Task-Parallel Programs

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of result (QoR) and short development cycle compared with the traditional register-transfer level (RTL) design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt […]
Sep, 27

Performance Evaluation of Mixed Precision Algorithms for Solving Sparse Linear Systems

It is well established that mixed precision algorithms that factorize a matrix at a precision lower than the working precision can reduce the execution time and the energy consumption of parallel solvers for dense linear systems. Much less is known about the efficiency of mixed precision parallel algorithms for sparse linear systems, and existing work […]
Sep, 27

Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments

In this work, we adapt a reconfigurable computer system based on FPGA technologies to OpenCL programming environments. The reconfigurable system is part of a compute prototype of the MANGO European project that includes 96 FPGAs. To optimize the use and to obtain its maximum performance, it is essential to adapt it to heterogeneous systems programming […]
Sep, 27

RoadRunner: a fast and flexible exoplanet transit model

I present RoadRunner, a fast exoplanet transit model that can use any radially symmetric function to model stellar limb darkening while still being faster to evaluate than the analytical transit model for quadratic limb darkening by Mandel & Agol (2002). CPU and GPU implementations of the model are available in the PyTransit transit modelling package, […]
Sep, 20

Applications of Deep Neural Networks

Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Deep learning allows a neural network to learn hierarchies […]
Sep, 20

Designing Efficient Barriers and Semaphores for Graphics Processing Units

General-purpose GPU applications that use fine-grained synchronization to enforce ordering between many threads accessing shared data have become increasingly popular. Thus, it is imperative to create more efficient GPU synchronization primitives for these applications. Accordingly, in recent years there has been a push to establish a single, unified set of GPU synchronization primitives. However, unlike […]
Sep, 20

A Comparison of Optimal Scanline Voxelization Algorithms

This thesis presents a comparison between different algorithms for optimal scanline voxelization of 3D models. As the optimal scanline relies on line voxelization, three such algorithms were evaluated. These were Real Line Voxelization (RLV), Integer Line Voxelization (ILV) and a 3D Bresenham line drawing algorithm. RLV and ILV were both based on voxel traversal by […]

* * *

* * *

HGPU group © 2010-2020 hgpu.org

All rights belong to the respective authors

Contact us: