5668

Posts

Sep, 15

Processing data streams with hard real-time constraints on heterogeneous systems

Data stream processing applications such as stock exchange data analysis, VoIP streaming, and sensor data processing pose two conflicting challenges: short per-stream latency — to satisfy the milliseconds-long, hard real-time constraints of each stream, and high throughput — to enable efficient processing of as many streams as possible. High-throughput programmable accelerators such as modern GPUs […]
Sep, 15

Strategies for preparing computer science students for the multicore world

Multicore computers have become standard, and the number of cores per computer is rising rapidly. How does the new demand for understanding of parallel computing impact computer science education? In this paper, we examine several aspects of this question: (i) What parallelism body of knowledge do todaya’s students need to learn? (ii) How might these […]
Sep, 15

Performing with CUDA

Recently a GPGPU application had to be redesigned to overcome performance problems. A number of software engineering lessons were learnt from this and other projects. We describe those about obtaining high performance from nVidia GPUs and practical aspects of CUDA C software development.
Sep, 15

Fast Mersenne prime testing on the GPU

The Lucas-Lehmer test for Mersenne primality can be efficiently parallelized for GPU-based computation. The gpuLucas project implements an irrational-base discrete weighted transform approach (IBDWT) using balanced-integers, non-power-of-two transforms, and carry-save radix representations. gpuLucas uses the CUDA programming language and requires the double-precision floating point capabilities of recent GPUs. Results show up to 7x speedups over […]
Sep, 15

Scaling Lattice QCD beyond 100 GPUs

Over the past five years, graphics processing units (GPUs) have had a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations in nuclear and particle physics. While GPUs have been applied with great success to the post-Monte Carlo "analysis" phase which accounts for a substantial fraction of the workload in a typical LQCD calculation, the […]
Sep, 15

Scalable and deterministic timing-driven parallel placement for FPGAs

This paper describes a parallel implementation of the timing-driven VPR 5.0 simulated annealing engine. By restricting the move distance to a confined neighborhood, it is possible to consider a large number of non-conflicting moves in parallel and achieve a deterministic result. The full timing-driven algorithm is parallelized, including the detailed timing analysis updates done periodically […]
Sep, 15

A platform-independent tool for modeling parallel programs

Programming languages that can utilize the underlying parallel architecture in shared memory, distributed memory or Graphics Processing Units (GPUs) are used extensively for solving scientific problems. However, from our observation of studying multiple parallel programs from various domains, such programming languages have a substantial amount of sequential code mixed with the parallel code. When rewriting […]
Sep, 15

Ambient occlusion volumes

This paper introduces a new approximation algorithm for the near-field ambient occlusion problem. It combines known pieces in a new way to achieve substantially improved quality over fast methods and substantially improved performance compared to accurate methods. Intuitively, it computes the analog of a shadow volume for ambient light around each polygon, and then applies […]
Sep, 15

Visual simulation of shockwaves

We present an efficient method for visual simulations of shock phenomena in compressible, inviscid fluids. Our algorithm is derived from one class of the finite volume method especially designed for capturing shock propagation, but offers improved efficiency through physically-based simplification and adaptation for graphical rendering. Our technique is capable of handling complex, bidirectional object-shock interactions […]
Sep, 14

Auto-tuning of fast fourier transform on graphics processors

We present an auto-tuning framework for FFTs on graphics processors (GPUs). Due to complex design of the memory and compute subsystems on GPUs, the performance of FFT kernels over the range of possible input parameters can vary widely. We generate several variants for each component of the FFT kernel that, for different cases, are likely […]
Sep, 14

Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies

Systems with specialized processors such as those used for accel- erating computations (like NVIDIA’s graphics processors or IBM’s Cell) have proven their utility in terms of higher performance and lower power consumption. They have also been shown to outperform general purpose processors in case of graphics intensive or high performance applications and for enterprise applications […]
Sep, 14

OptiX: a general purpose ray tracing engine

The NVIDIA OptiX ray tracing engine is a programmable system designed for NVIDIA GPUs and other highly parallel architectures. The OptiX engine builds on the key observation that most ray tracing algorithms can be implemented using a small set of programmable operations. Consequently, the core of OptiX is a domain-specific just-in-time compiler that generates custom […]

Recent source codes

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Contact us: