Posts
Mar, 13
Parallel Branch and Bound on a CPU-GPU System
Hybrid implementation via CUDA of a branch and bound method for knapsack problems is proposed. Branch and bound computations can be carried out either on the CPU or on the GPU according to the size of the branch and bound list, i.e. the number of nodes. Tests are carried out on a Tesla C2050 GPU. […]
Mar, 13
Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries
With GPU architectures becoming increasingly important due to their large number of parallel processors, NVIDIA’s CUDA environment is becoming widely used to support general purpose applications. To efficiently use the parallel processing power, programmers need to efficiently parallelize and map their algorithms. The difficulty of this task leads to the idea to investigate CUDA’s compiler. […]
Mar, 13
Real-time execution of image change detection
State-of-the-art video analysis systems feature multiple complex processing steps and operate on high resolution images. Intensive computation power is needed for real-time execution. In this project an image change detection application is mapped to a heterogeneous multicore CPU/GPU platform. It is investigated what hardware configuration is required to execute the application in real-time. For optimal […]
Mar, 12
Dynamic Compilation of Data-Parallel Kernels for Vector Processors
Modern processors enjoy augmented throughput and power efficiency through specialized functional units leveraged via instruction set extensions. These functional units accelerate performance for specific types of operations but must be programmed explicitly. Moreover, applications targeting these specialized units will not take advantage of future ISA extensions and tend not to be portable across multiple ISAs. […]
Mar, 12
GPU Accelerated Computation of Fast Spectral Transforms
This paper discusses techniques for accelerated computation of several fast spectral transforms on graphics processing units (GPUs) using the Open Computing Language (OpenCL). We present a reformulation of fast algorithms which takes into account peculiar properties of transforms to make them suitable for the GPU implementation. A special attention is paid to the organization of […]
Mar, 12
A GPU Algorithm for Greedy Graph Matching
Greedy graph matching provides us with a fast way to coarsen a graph during graph partitioning. Direct algorithms on the CPU which perform such greedy matchings are simple and fast, but offer few handholds for parallelisation. To remedy this, we introduce a fine-grained shared-memory parallel algorithm for maximal greedy matching, together with an implementation on […]
Mar, 12
Hybrid general-purpose computation on GPU (GPGPU) and computer graphics synthetic aperture radar simulation for complex scenes
In this paper, a new hybrid general-purpose computation on GPU (GPGPU) and computer graphics synthetic aperture radar (SAR) simulation method for complex scenes is proposed. Previous SAR simulations for complex scenes only use GPU’s graphics capabilities for scattering calculation in graphical electromagnetic computing (GRECO) algorithm. The new hybrid method use GPU’s graphics and parallel computing […]
Mar, 12
A Study of Real-Time Lighting Effects
Realistic lighting is an incredibly complex problem. All surfaces scatter light to all other surfaces. Realistic lighting in volumes of fog or smoke is even more complex because each particle absorbs and scatters light. This problem has been approximated with many techniques but can take hours to produce a single image. Creating these images in […]
Mar, 11
GPU Accelerated Real-Time Object Detection on High Resolution Videos Using Modified Census Transform
This paper presents a novel GPU accelerated object detection system using CUDA. Because of its detection accuracy, speed and robustness to illumination variations, a boosting based approach with Modified Census Transform features is used. Results are given on the face detection problem for evaluation. Results show that even our single-GPU implementation can run in real-time […]
Mar, 11
Better speedups using simpler parallel programming for graph connectivity and biconnectivity
Speedups demonstrated for finding the biconnected components of a graph: 9x to 33x on the Explicit Multi-Threading (XMT) many-core computing platform relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two recent NVIDIA […]
Mar, 11
NUMA Data-Access Bandwidth Characterization and Modeling
Clusters of seemingly homogeneous compute nodes are increasingly heterogeneous within each node due to replication and distribution of node-level subsystems. This intra-node heterogeneity can adversely affect program execution performance by inflicting additional data-access performance penalties when accessing non-local data. In many modern NUMA architectures, both memory and I/O controllers are distributed within a node and […]
Mar, 11
An Algorithm for Fast Edit Distance Computation on GPUs
The problem of finding the edit distance between two sequences (and its closely related problem of longest common subsequence) are important problems with applications in many domains like virus scanners, security kernels, natural language translation and genome sequence alignment. The traditional dynamic-programming based algorithm is hard to parallelize on SIMD processors as the algorithm is […]

