13534

Posts

Feb, 22

Comparison of SPMV performance on matrices with different matrix format using CUSP, cuSPARSE and ViennaCL

ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP. In addition to core functionality and many other features including BLAS level 1-3 support and iterative solvers, the latest release family ViennaCL 1.6.x provides fast […]
Feb, 22

QPACE 2 and Domain Decomposition on the Intel Xeon Phi

We give an overview of QPACE 2, which is a custom-designed supercomputer based on Intel Xeon Phi processors, developed in a collaboration of Regensburg University and Eurotech. We give some general recommendations for how to write high-performance code for the Xeon Phi and then discuss our implementation of a domain-decomposition-based solver and present a number […]
Feb, 22

RSVDPACK: Subroutines for computing partial singular value decompositions via randomized sampling on single core, multi core, and GPU architectures

This document describes an implementation in C of a set of randomized algorithms for computing partial Singular Value Decompositions (SVDs). The techniques largely follow the prescriptions in the article "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions," N. Halko, P.G. Martinsson, J. Tropp, SIAM Review, 53(2), 2011, pp. 217-288, but with some […]
Feb, 22

Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool (open-source code)

To enable the design of large sized caches, novel memory technologies (such as non-volatile memory) and novel fabrication approaches (e.g. 3D stacking) have been explored. The existing modeling tools, however, cover only few memory technologies, CMOS technology nodes and fabrication approaches. We present DESTINY, a tool for modeling 3D (and 2D) cache designs using SRAM, […]
Feb, 19

Reproducible Triangular Solvers for High-Performance Computing

On modern parallel architectures, floating-point computations may become non-deterministic and, therefore, non-reproducible mainly due to non-associativity of floating-point operations. We propose an algorithm to solve dense triangular systems by leveraging the standard parallel triangular solver and our, recently introduced, multi-level exact summation approach. Finally, we present implementations of the proposed fast reproducible triangular solver and […]
Feb, 19

Fast, Memory-Efficient Construction of Voxelized Shadows

We present a fast and memory efficient algorithm for generating Compact Precomputed Voxelized Shadows. By performing much of the common sub-tree merging before identical nodes are ever created, we improve construction times by several orders of magnitude for large data structures, and require much less working memory. We also propose a new set of rules […]
Feb, 19

Auto-tuning Shallow water simulations on GPUs

Graphic processing units (GPUs) have gained popularity in scientific computing the recent years. This is because of the massive computing power they can provide for parallel tasks, and while GPUs are powerful, it is also hard to fully utilize their power. A part of this difficulty comes from the many parameters available, and tuning of […]
Feb, 19

Memory-efficient Adaptive Subdivision for Software Rendering on the GPU

The adaptive subdivision step for surface tessellation is a key component of the Reyes rendering pipeline. While this operation has been successfully parallelized for execution on the GPU using a breadth-first traversal, the resulting implementations are limited by their high worst-case memory consumption and high global memory bandwidth utilization. This report proposes an alternate strategy […]
Feb, 19

NMF-mGPU: non-negative matrix factorization on multi-GPU systems

BACKGROUND: In the last few years, the Non-negative Matrix Factorization (NMF) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. […]
Feb, 13

NUPAR: A Benchmark Suite for Modern GPU Architectures

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all […]
Feb, 13

Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures

Emerging hybrid accelerator architectures for high performance computing are often suited for the use of a data-parallel programming model. Unfortunately, programmers of these architectures face a steep learning curve that frequently requires learning a new language (e.g., OpenCL). Furthermore, the distributed (and frequently multi-level) nature of the memory organization of clusters of these machines provides […]
Feb, 13

Quadratic Pseudo-Boolean Optimization for Scene Analysis using CUDA

Many problems in early computer vision, like image segmentation, image reconstruction, 3D vision or object labeling can be modeled by Markov Random Fields (MRF). General algorithms to optimize a MRF like Simulated Annealing, Belief Propagation or Iterated Conditional Modes are either slow or produce low quality results [Rother 07]. On the other hand, in the […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org