high performance computing on graphics processing units: hgpu.org

Posts

Dec, 22

Parallel FDTD Arithmetic Simulation Based on Distributed Heterogeneous Cluster System

This paper puts forward a new FDTD parallel algorithm, which is developed based on the distributed platform, the algorithm was debugged in Shanghai Jiao-tong University for the high performance computing center GPU cluster, "Rubik’s Cube" commercial super computer at Shanghai Supercomputer Center and "divinity blue" domestic super computer platform at the National Supercomputing Center in […]

OpenCL

Dec, 22

Performance and Productivity of Parallel Python Programming: A study with a CFD Test Case

The programming language Python is widely used to create rapidly compact software. However, compared to low-level programming languages like C or Fortran low performance is preventing its use for HPC applications. Efficient parallel programming of multi-core systems and graphic cards is generally a complex task. Python with add-ons might provide a simple approach to program […]

CUDA

Dec, 22

A time-energy performance analysis of MapReduce on heterogeneous systems with GPUs

Motivated by the explosion of Big Data analytics, performance improvements in lowpower (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a time-energy performance analysis of MapReduce on heterogeneous systems with GPUs. We evaluate the time and energy performance of three MapReduce applications with diverse resource demands on a Hadoop-CUDA framework. As […]

CUDA

Dec, 19

Autotuning Stencils Codes with Algorithmic Skeletons

The physical limitations of microprocessor design have forced the industry towards increasingly heterogeneous architectures to extract performance. This trend has not been matched with software tools to cope with such parallelism, leading to a growing disparity between the levels of available performance and the ability for application developers to exploit it. Algorithmic skeletons simplify parallel […]

OpenCL

Dec, 19

Study, Modelling and Implementation of the Level Set Method Used in Micromachining Processes

The main topic of the present thesis is the improvement of fabrication processes simulation by means of the Level Set (LS) method. The LS is a mathematical approach used for evolving fronts according to a motion defined by certain laws. The main advantage of this method is that the front is embedded inside a higher […]

CUDA

Dec, 19

Investigation of the SYCL for OpenCL Programming Model

OpenCL and SYCL for OpenCL are open-standard programming models which enable development of parallel programs which target heterogeneous hardware: systems which contain both general-purpose CPUs and accelerator devices such as GPGPUs or Intel Xeon Phi cards. While OpenCL provides a C API, SYCL provides a C++ API and allows programmers to take advantage of many […]

OpenCL

Dec, 19

Challenges Adapting CUDA PIC Codes to multiple GPUs

A Particle-In-Cell code is a common particle simulation method often used to simulate the behaviour of plasma. In this work, a parallel PIC code is developed in CUDA, with a focus on how to adapt the method for multiple GPUs. An electrostatic three dimensional PIC code is developed, with an FFT-based solver using the cuFFT […]

CUDA

Dec, 19

Efficient Query Processing in Co-Processor-accelerated Databases

Advancements in hardware changed the bottleneck of modern database systems from disk IO to main memory access and processing power. Since the performance of modern processors is primarily limited by a fixed energy budget, hardware vendors are forced to specialize processors. Consequently, processors become increasingly heterogeneous, which already became commodity in the form of accelerated […]

CUDA

Dec, 15

Origami: A Convolutional Network Accelerator

Today advanced computer vision (CV) systems of ever increasing complexity are being deployed in a growing number of application scenarios with strong real-time and power constraints. Current trends in CV clearly show a rise of neural network-based algorithms, which have recently broken many object detection and localization records. These approaches are very flexible and can […]

Dec, 15

Adaptive algebraic multigrid on SIMD architectures

We present details of our implementation of the Wuppertal adaptive algebraic multigrid code DD-alpha AMG on SIMD architectures, with particular emphasis on the Intel Xeon Phi processor (KNC) used in QPACE 2. As a smoother, the algorithm uses a domain-decomposition-based solver code previously developed for the KNC in Regensburg. We optimized the remaining parts of […]

Dec, 15

A CUDA Kernel Scheduler Exploiting Static Data Dependencies

The CUDA execution model of Nvidia’s GPUs is based on the asynchronous execution of thread blocks, where each thread executes the same kernel in a data-parallel fashion. When threads in different thread blocks need to synchronise and communicate, the whole computation launched onto the GPU needs to be stopped and re-invoked in order to facilitate […]

CUDA

Dec, 15

Run-time support for multi-level disjoint memory address spaces

High Performance Computing (HPC) systems have become widely used tools in many industry areas and research fields. Research to produce more powerful and efficient systems has grown in par with their popularity. As a consequence, the complexity of modern HPC architectures has increased in order to provide systems with the highest levels of performance. This […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Parallel FDTD Arithmetic Simulation Based on Distributed Heterogeneous Cluster System

Performance and Productivity of Parallel Python Programming: A study with a CFD Test Case

A time-energy performance analysis of MapReduce on heterogeneous systems with GPUs

Autotuning Stencils Codes with Algorithmic Skeletons

Study, Modelling and Implementation of the Level Set Method Used in Micromachining Processes

Investigation of the SYCL for OpenCL Programming Model

Challenges Adapting CUDA PIC Codes to multiple GPUs

Efficient Query Processing in Co-Processor-accelerated Databases

Origami: A Convolutional Network Accelerator

Adaptive algebraic multigrid on SIMD architectures

A CUDA Kernel Scheduler Exploiting Static Data Dependencies

Run-time support for multi-level disjoint memory address spaces

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)