Posts
Feb, 19
Memory-efficient Adaptive Subdivision for Software Rendering on the GPU
The adaptive subdivision step for surface tessellation is a key component of the Reyes rendering pipeline. While this operation has been successfully parallelized for execution on the GPU using a breadth-first traversal, the resulting implementations are limited by their high worst-case memory consumption and high global memory bandwidth utilization. This report proposes an alternate strategy […]
Feb, 19
NMF-mGPU: non-negative matrix factorization on multi-GPU systems
BACKGROUND: In the last few years, the Non-negative Matrix Factorization (NMF) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. […]
Feb, 13
NUPAR: A Benchmark Suite for Modern GPU Architectures
Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all […]
Feb, 13
Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures
Emerging hybrid accelerator architectures for high performance computing are often suited for the use of a data-parallel programming model. Unfortunately, programmers of these architectures face a steep learning curve that frequently requires learning a new language (e.g., OpenCL). Furthermore, the distributed (and frequently multi-level) nature of the memory organization of clusters of these machines provides […]
Feb, 13
Quadratic Pseudo-Boolean Optimization for Scene Analysis using CUDA
Many problems in early computer vision, like image segmentation, image reconstruction, 3D vision or object labeling can be modeled by Markov Random Fields (MRF). General algorithms to optimize a MRF like Simulated Annealing, Belief Propagation or Iterated Conditional Modes are either slow or produce low quality results [Rother 07]. On the other hand, in the […]
Feb, 13
Large-Scale Deep Learning on the YFCC100M Dataset
We present a work-in-progress snapshot of learning with a 15 billion parameter deep learning network on HPC architectures applied to the largest publicly available natural image and video dataset released to-date. Recent advancements in unsupervised deep neural networks suggest that scaling up such networks in both model and training dataset size can yield significant improvements […]
Feb, 13
Primal Dual Affine Scaling on GPUs
Here we present an implementation of Primal-Dual Affine scaling method to solve linear optimization problem on GPU based systems. Strategies to convert the system generated by complementary slackness theorem into a symmetric system are given. A new CUDA friendly technique to solve the resulting symmetric positive definite subsystem is also developed. Various strategies to reduce […]
Feb, 12
A Real-time GPU Implementation of the SIFT Algorithm for Large-Scale Video Analysis Tasks
The SIFT algorithm is one of the most popular feature extraction methods and therefore widely used in all sort of video analysis tasks like instance search and duplicate/near-duplicate detection. We present an efficient GPU implementation of the SIFT descriptor extraction algorithm using CUDA. The major steps of the algorithm are presented and for each step […]
Feb, 10
FSCL: Homogeneous programming, scheduling and execution on heterogeneous platforms
The last few years has seen activity towards programming models, languages and frameworks to address the increasingly wide range and broad availability of heterogeneous computing resources through raised programming abstraction and portability across different platforms. The effort spent in simplifying parallel programming across heterogeneous platforms is often outweighed by the need for low-level control over […]
Feb, 10
GPU-accelerated HMM for Speech Recognition
Speech recognition is used in a wide range of applications and devices such as mobile phones, in-car entertainment systems and web-based services. Hidden Markov Models (HMMs) is one of the most popular algorithmic approaches applied in speech recognition. Training and testing a HMM is computationally intensive and time-consuming. Running multiple applications concurrently with speech recognition […]
Feb, 10
Analysis and Modeling of the Timing Behavior of GPU Architectures
Graphics processing units (GPUs) offer massive parallelism. Since a couple of years GPUs can also be used for more general purpose applications; a wide variety of applications can be accelerated efficiently with the use of the CUDA and OpenCL programming models. Real-time systems frequently use many sensors that produce a big amount of data. GPUs […]
Feb, 10
Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)
Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability. Code is either tuned using device-specific optimizations to achieve maximum performance or is […]