23548

Posts

Sep, 27

Performance Evaluation of Mixed Precision Algorithms for Solving Sparse Linear Systems

It is well established that mixed precision algorithms that factorize a matrix at a precision lower than the working precision can reduce the execution time and the energy consumption of parallel solvers for dense linear systems. Much less is known about the efficiency of mixed precision parallel algorithms for sparse linear systems, and existing work […]
Sep, 27

Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments

In this work, we adapt a reconfigurable computer system based on FPGA technologies to OpenCL programming environments. The reconfigurable system is part of a compute prototype of the MANGO European project that includes 96 FPGAs. To optimize the use and to obtain its maximum performance, it is essential to adapt it to heterogeneous systems programming […]
Sep, 27

RoadRunner: a fast and flexible exoplanet transit model

I present RoadRunner, a fast exoplanet transit model that can use any radially symmetric function to model stellar limb darkening while still being faster to evaluate than the analytical transit model for quadratic limb darkening by Mandel & Agol (2002). CPU and GPU implementations of the model are available in the PyTransit transit modelling package, […]
Sep, 20

Applications of Deep Neural Networks

Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Deep learning allows a neural network to learn hierarchies […]
Sep, 20

Designing Efficient Barriers and Semaphores for Graphics Processing Units

General-purpose GPU applications that use fine-grained synchronization to enforce ordering between many threads accessing shared data have become increasingly popular. Thus, it is imperative to create more efficient GPU synchronization primitives for these applications. Accordingly, in recent years there has been a push to establish a single, unified set of GPU synchronization primitives. However, unlike […]
Sep, 20

A Comparison of Optimal Scanline Voxelization Algorithms

This thesis presents a comparison between different algorithms for optimal scanline voxelization of 3D models. As the optimal scanline relies on line voxelization, three such algorithms were evaluated. These were Real Line Voxelization (RLV), Integer Line Voxelization (ILV) and a 3D Bresenham line drawing algorithm. RLV and ILV were both based on voxel traversal by […]
Sep, 20

WarpCore: A Library for fast Hash Tables on GPUs

Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In […]
Sep, 20

PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems

In the past decade, high performance compute capabilities exhibited by heterogeneous GPGPU platforms have led to the popularity of data parallel programming languages such as CUDA and OpenCL. Such languages, however, involve a steep learning curve as well as developing an extensive understanding of the underlying architecture of the compute devices in heterogeneous platforms. This […]
Sep, 13

Tools for GPU Computing–Debugging and Performance Analysis of Heterogenous HPC Applications

General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC […]
Sep, 13

Accelerating High-Order Stencils on GPUs

Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for low-order stencils on GPUs have been well-studied in the literature, not all of proposed enhancements work well for high-order stencils, such […]
Sep, 13

HPX – The C++ Standard Library for Parallelism and Concurrency

The new challenges presented by exascale system architectures have resulted in difficulty achieving the desired scalability using traditional distributed-memory runtimes. Asynchronous many-task systems (AMT) are based on a new paradigm showing promise in addressing these challenges, providing application developers with a productive and performant approach to programming on next generation systems. HPX is a C++ […]
Sep, 13

GPA: A GPU Performance Advisor Based on Instruction Sampling

Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained suggestions at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimization opportunities at a hierarchy of levels, including individual […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: