Feb, 22

Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool (open-source code)

To enable the design of large sized caches, novel memory technologies (such as non-volatile memory) and novel fabrication approaches (e.g. 3D stacking) have been explored. The existing modeling tools, however, cover only few memory technologies, CMOS technology nodes and fabrication approaches. We present DESTINY, a tool for modeling 3D (and 2D) cache designs using SRAM, […]
Feb, 19

Reproducible Triangular Solvers for High-Performance Computing

On modern parallel architectures, floating-point computations may become non-deterministic and, therefore, non-reproducible mainly due to non-associativity of floating-point operations. We propose an algorithm to solve dense triangular systems by leveraging the standard parallel triangular solver and our, recently introduced, multi-level exact summation approach. Finally, we present implementations of the proposed fast reproducible triangular solver and […]
Feb, 19

Fast, Memory-Efficient Construction of Voxelized Shadows

We present a fast and memory efficient algorithm for generating Compact Precomputed Voxelized Shadows. By performing much of the common sub-tree merging before identical nodes are ever created, we improve construction times by several orders of magnitude for large data structures, and require much less working memory. We also propose a new set of rules […]
Feb, 19

Auto-tuning Shallow water simulations on GPUs

Graphic processing units (GPUs) have gained popularity in scientific computing the recent years. This is because of the massive computing power they can provide for parallel tasks, and while GPUs are powerful, it is also hard to fully utilize their power. A part of this difficulty comes from the many parameters available, and tuning of […]
Feb, 19

Memory-efficient Adaptive Subdivision for Software Rendering on the GPU

The adaptive subdivision step for surface tessellation is a key component of the Reyes rendering pipeline. While this operation has been successfully parallelized for execution on the GPU using a breadth-first traversal, the resulting implementations are limited by their high worst-case memory consumption and high global memory bandwidth utilization. This report proposes an alternate strategy […]
Feb, 19

NMF-mGPU: non-negative matrix factorization on multi-GPU systems

BACKGROUND: In the last few years, the Non-negative Matrix Factorization (NMF) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. […]
Feb, 13

NUPAR: A Benchmark Suite for Modern GPU Architectures

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all […]
Feb, 13

Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures

Emerging hybrid accelerator architectures for high performance computing are often suited for the use of a data-parallel programming model. Unfortunately, programmers of these architectures face a steep learning curve that frequently requires learning a new language (e.g., OpenCL). Furthermore, the distributed (and frequently multi-level) nature of the memory organization of clusters of these machines provides […]
Feb, 13

Quadratic Pseudo-Boolean Optimization for Scene Analysis using CUDA

Many problems in early computer vision, like image segmentation, image reconstruction, 3D vision or object labeling can be modeled by Markov Random Fields (MRF). General algorithms to optimize a MRF like Simulated Annealing, Belief Propagation or Iterated Conditional Modes are either slow or produce low quality results [Rother 07]. On the other hand, in the […]
Feb, 13

Large-Scale Deep Learning on the YFCC100M Dataset

We present a work-in-progress snapshot of learning with a 15 billion parameter deep learning network on HPC architectures applied to the largest publicly available natural image and video dataset released to-date. Recent advancements in unsupervised deep neural networks suggest that scaling up such networks in both model and training dataset size can yield significant improvements […]
Feb, 13

Primal Dual Affine Scaling on GPUs

Here we present an implementation of Primal-Dual Affine scaling method to solve linear optimization problem on GPU based systems. Strategies to convert the system generated by complementary slackness theorem into a symmetric system are given. A new CUDA friendly technique to solve the resulting symmetric positive definite subsystem is also developed. Various strategies to reduce […]
Feb, 12

A Real-time GPU Implementation of the SIFT Algorithm for Large-Scale Video Analysis Tasks

The SIFT algorithm is one of the most popular feature extraction methods and therefore widely used in all sort of video analysis tasks like instance search and duplicate/near-duplicate detection. We present an efficient GPU implementation of the SIFT descriptor extraction algorithm using CUDA. The major steps of the algorithm are presented and for each step […]
