6038

Posts

Oct, 17

An Optimization for Fast Generation of Digital Hologram

Digital hologram generation methods commonly use computer generated hologram (CGH) algorithm. However, it requires complicated computation. Thus, this paper proposes an optimization method for a fast generation of digital hologram. The proposed method uses CUDA and OpenMP for multi-GPU. Also, it applies various optimization methods (variable fixation, vectorization, and loop unrolling) to a CGH algorithm. […]
Oct, 17

Dynamic Fine-Grain Scheduling of Pipeline Parallelism

Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or […]
Oct, 17

Programming with Explicit Dependencies. A Framework for Portable Parallel Programming

Computational devices are rapidly evolving into massively parallel systems. Multicore processors are already standard; high performance processors such as the Cell/BE processor, graphics processing units (GPUs) featuring hundreds of on-chip processors, and reconfigurable devices such as FPGAs are all developed to deliver high computing power. They make parallelism commonplace, not only the privilege of expensive […]
Oct, 17

A High Performance Parallel Sparse Linear Equation Solver Using CUDA

The management of electric power systems requires continuously computing the powerflow of a power system in real-time. For large power systems, this task is often beyond the capabilities of modern CPUs. Concurrent computation is an attractive approach to accelerating it. However, the powerflow computation requires solving a large system of sparse linear equations. This problem […]
Oct, 16

Hard-Sphere Collision Simulations with Multiple GPUs, PCIe Extension Buses and GPU-GPU Communications

Simulating particle collisions is an important application for physics calculations as well as for various effects in computer games and movie animations. Increasing demand for physical correctness and hence visual realism demands higher order time-integration methods and more sophisticated collision management algorithms. We report on the use of singe and multiple Graphical Processing Units (GPUs) […]
Oct, 16

Bit-Packed Damaged Lattice Potts Model Simulations with CUDA and GPUs

Models such as the Ising and Potts systems lend themselves well to simulating the phase transitions that commonly arise in materials science. A particularly interesting variation is when the material being modelled has lattice defects, dislocations or broken bonds and the material experiences a Griffiths phase. The damaged Potts system consists of a set of […]
Oct, 16

Asynchronous Communication for Finite-Difference Simulations on GPU Clusters using CUDA and MPI

Graphical processing Units (GPUs) are finding widespread use as accelerators in computer clusters. It is not yet trivial to program applications that use multiple GPU-enabled cluster nodes efficiently. A key aspect of this is managing effective communication between GPU memory on separate devices on separate nodes. We develop an algorithmic framework for Finite-Difference numerical simulations […]
Oct, 16

High performance finite difference PDE solvers on GPUs

We show how to implement highly efficient GPU solvers for one dimensional PDEs based on finite difference schemes. The typical use case is to price a large number of similar or related derivatives in parallel. Application scenarios include market making, real time pricing, and risk management. The tridiagonal systems in the backward propagation of a […]
Oct, 16

An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

We present a multi-stage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of on-chip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on-chip. The multi-stage characteristic of our method, together with various […]
Oct, 16

GPU-to-CPU callbacks

We present GPU-to-CPU callbacks, a new mechanism and abstraction for GPUs that offers them more independence in a heterogeneous computing environment. Specifically, we provide a method for GPUs to issue callback requests to the CPU. These requests serve as a tool for ease-of-use, future proofing of code, and new functionality. We classify the types of […]
Oct, 16

The Anatomy of High-Performance 2D Similarity Calculations

Similarity measures based on the comparison of dense bit vectors of two-dimensional chemical features are a dominant method in chemical informatics. For large-scale problems, including compound selection and machine learning, computing the intersection between two dense bit vectors is the overwhelming bottleneck. We describe efficient implementations of this primitive as well as example applications using […]
Oct, 16

Data Layout Pruning on GPU

This work is based on NVIDIA GTX 280 using CUDA (Computing Unified Device Architecture). We classify Dataset to be transferred into CUDA memory hierarchy into SW (shared and must write) and SR (shared but only read), and existing memory spaces (including shared memory, constant memory, texture memory and global memory) supported on CUDA-enabled GPU memory […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: