Apr, 26

GPU-Aware Non-contiguous Data Movement In Open MPI

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applications. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non-contiguous data movements for GPU-resident data […]
Apr, 26

Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a […]
Apr, 26

Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging

Many graphics and vision problems are naturally expressed as optimizations with either linear or non-linear least squares objective functions over visual data, such as images and meshes. The mathematical descriptions of these functions are extremely concise, but their implementation in real code is tedious, especially when optimized for real-time performance in interactive applications. We propose […]
Apr, 26

CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Hyperparameters of deep neural networks are often optimized by grid search, random search or Bayesian optimization. As an alternative, we propose to use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which is known for its state-of-the-art performance in derivative-free optimization. CMA-ES has some useful invariance properties and is friendly to parallel evaluations of solutions. We […]
Apr, 24

GPL: A GPU-based Pipelined Query Processing Engine

Graphics Processing Units (GPUs) have evolved as a powerful query co-processor for main memory On-Line Analytical Processing (OLAP) databases. However, existing GPU-based query processors adopt a kernel-based execution approach which optimizes individual kernels for resource utilization and executes the GPU kernels involved in the query plan one by one. Such a kernel-based approach cannot utilize […]
Apr, 22

OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges

Benchmarking general-purpose computing on graphics processing unit (GPGPU) aims to profile and compare performance across different devices. Due to the low-level nature of most GPGPU APIs, GPGPU benchmarks are also useful for architectural exploration and program optimization. This can be challenging in mobile devices due to lack of underlying hardware details and limited profiling capabilities […]
Apr, 19

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK […]
Apr, 19

A Parallel Solution to Finding Nodal Neighbors in Generic Meshes

In this paper we specifically present a parallel solution to finding the one-ring neighboring nodes and elements for each vertex in generic meshes. The finding of nodal neighbors is computationally straightforward but expensive for large meshes. To improve the efficiency, the parallelism is adopted by utilizing the modern Graphics Processing Unit (GPU). The presented parallel […]
Apr, 19

MIML Learning with CNNs: Yelp Restaurant Photo Classification

We present the conditions of a data science challenge from Kaggle, which can be viewed as a multi-instance multilabel learning problem in the image domain, and describe the official training dataset provided. We discuss our technical approach, and address the challenges in using transfer learning and with finetuning, trying out different strategies to tackle the […]
Apr, 19

A Unified, Hardware-Fitted, Cross-GPU Performance Model

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (‘kernels’) expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run time. We use a series of ‘performance-instructive’ kernels to fit the parameters of a unified model to the performance characteristics of GPU hardware […]
Apr, 19

LightScan: Faster Scan Primitive on CUDA Compatible Manycore Processors

Scan (or prefix sum) is a fundamental and widely used primitive in parallel computing. In this paper, we present LightScan, a faster parallel scan primitive for CUDA-enabled GPUs, which investigates a hybrid model combining intra-block computation and inter-block communication to perform a scan. Our algorithm employs warp shuffle functions to implement fast intra-block computation and […]
Apr, 16

GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

Large-scale deep learning requires huge computational resources to train a multi-layer neural network. Recent systems propose using 100s to 1000s of machines to train networks with tens of layers and billions of connections. While the computation involved can be done more efficiently on GPUs than on more traditional CPU cores, training such networks on a […]
Page 60 of 924« First...102030...5859606162...708090...Last »

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: