7820

Posts

Jun, 10

CUDA Kernel Design for GPU-Based Beam Dynamics Simulations

Efficient implementation of general purpose particle tracking on GPUs can result in significant performance benefits to large scale particle tracking and tracking-based accelerator optimization simulations. We present our work on accelerating Argonne National Lab’s accelerator simulation code ELEGANT [1, 2] using CUDA-enabled GPUs [3]. In particular, we provide an overview of beamline elements ported to […]
Jun, 10

S-buffer: Sparsity-aware Multi-fragment Rendering

This work introduces S-buffer, an efficient and memory-friendly gpu-accelerated A-buffer architecture for multi-fragment rendering. Memory is organized into variable contiguous regions for each pixel, thus avoiding limitations set in linked-lists and fixed-array techniques. S-buffer exploits fragment distribution for precise allocation of the needed storage and pixel sparsity (empty pixel ratio) for computing the memory offsets […]
Jun, 10

Measuring the Impact of Configuration Parameters in CUDA Through Benchmarking

The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded to run in GPU architectures. In fact, threadblock configuration has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of […]
Jun, 9

Scaling Fast Multipole Methods up to 4000 GPUs

The Fast Multipole Method (FMM) is a hierarchical N-body algorithm with linear complexity, high arithmetic intensity, high data locality, has hierarchical communication patterns, and no global synchronization. The combination of these features allows the FMM to scale well on large GPU based systems, and to use their compute capability effectively. We present a 1 PFlop/s […]
Jun, 9

Fast Morphological Image Processing Open-Source Extensions for GPU processing with CUDA

GPU architectures offer a significant opportunity for faster morphological image processing, and the NVIDIA CUDA architecture offers a relatively inexpensive and powerful framework for performing these operations. However, the generic morphological erosion and dilation operation in the CUDA NPP library is relatively naive, and performance scales expensively with increasing structuring element size. The objective of […]
Jun, 9

Autotuning Stencil-Based Computations on GPUs

Finite-difference, stencil-based discretization approaches are widely used in the solution of partial differential equations describing physical phenomena. Newton-Krylov iterative methods commonly used in stencil-based solutions generate matrices that exhibit diagonal sparsity patterns. To exploit these structures on modern GPUs, we extend the standard diagonal sparse matrix representation and define new matrix and vector data types […]
Jun, 9

Encapsulated synchronization and load-balance in heterogeneous programming

Programming models and techniques to exploit parallelism in accelerators, such as GPUs, are different from those used in traditional parallel models for shared- or distributed-memory systems. It is a challenge to blend different programming models to coordinate and exploit devices with very different characteristics and computation powers. This paper presents a new extensible framework model […]
Jun, 9

Sparse LU Factorization for Parallel Circuit Simulation on GPU

Sparse solver has become the bottleneck of SPICE simulators. There has been few work on GPU-based sparse solver because of the high data-dependency. The strong data-dependency determines that parallel sparse LU factorization runs efficiently on shared-memory computing devices. But the number of CPU cores sharing the same memory is often limited. The state of the […]
Jun, 8

Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines

Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. […]
Jun, 8

Ameliorating Memory Contention of OLAP operators on GPU Processors

Implementations of database operators on GPU processors have shown dramatic performance improvement compared to multicore-CPU implementations. GPU threads can cooperate using shared memory, which is organized in interleaved banks and is fast only when threads read and modify addresses belonging to distinct memory banks. Therefore, data processing operators implemented on a GPU, in addition to […]
Jun, 8

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

The influence of multi-core central processing units and graphics processing units on several algebraic multigrid methods is investigated in this work. Different performance metrics traditionally employed for algebraic multigrid are reconsidered and reevaluated on these novel computing architectures. Our benchmark results show that with the use of graphics processing units for the solver phase, it […]
Jun, 8

Astrophysical Particle Simulations on Heterogeneous CPU-GPU Systems

A heterogeneous CPU-GPU node is getting popular in HPC clusters. We need to rethink algorithms and optimization techniques for such system depending on the relative performance of CPU vs. GPU. In this paper, we report a performance optimized particle simulation code "OTOO", that is based on the octree method, for heterogenous systems. Main applications of […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: