high performance computing on graphics processing units: hgpu.org

Posts

Dec, 13

C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++

Capitalize on the faster GPU processors in today’s computers with the C++ AMP code library—and bring massive parallelism to your project. With this practical book, experienced C++ developers will learn parallel programming fundamentals with C++ AMP through detailed examples, code snippets, and case studies. Learn the advantages of parallelism and get best practices for harnessing […]

Dec, 12

Real-Time Grasp Detection Using Convolutional Neural Networks

We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks. Our network performs single-stage regression to graspable bounding boxes without using standard sliding window or region proposal techniques. The model outperforms state-of-the-art approaches by 14 percentage points and runs at 13 frames per second on a GPU. Our network can […]

CUDA

Dec, 12

A Survey Paper on Solving TSP using Ant Colony Optimization on GPU

Ant Colony Optimization (ACO) is meta-heuristic algorithm inspired from nature to solve many combinatorial optimization problem such as Travelling Salesman Problem (TSP). There are many versions of ACO used to solve TSP like, Ant System, Elitist Ant System, Max-Min Ant System, Rank based Ant System algorithm. For improved performance, these methods can be implemented in […]

CUDA

Dec, 12

cuLGT: Lattice Gauge Fixing on GPUs

We adopt CUDA-capable Graphic Processing Units (GPUs) for Landau, Coulomb and maximally Abelian gauge fixing in 3+1 dimensional SU(3) and SU(2) lattice gauge field theories. A combination of simulated annealing and overrelaxation is used to aim for the global maximum of the gauge functional. We use a fine grained degree of parallelism to achieve the […]

CUDA

Dec, 12

Compiler-Level Explicit Cache for a GPGPU Programming Framework

GPU is widely used for high-performance computing. However, standard programming framework such as CUDA and OpenCL requires low-level specifications, thus programming is difficult and the performance is not portable. Therefore, we are developing a new framework named MESI-CUDA. Providing virtual shared variables accessible from both CPU and GPU, MESI-CUDA hides complex memory architecture and eliminates […]

CUDA

Dec, 12

Strong scaling of general-purpose molecular dynamics simulations on GPUs

We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, arXiv:1308.5587). Our approach is inspired by a traditional CPU-based code, LAMMPS (Plimpton, J. Comp. Phys. 117, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson […]

CUDA

Dec, 9

Theano-based Large-Scale Visual Recognition with Multiple GPUs

In this report, we describe a Theano-based AlexNet (Krizhevsky et al., 2012) implementation and its naive data parallelism on multiple GPUs. Our performance on 2 GPUs is comparable with the state-of-art Caffe library (Jia et al., 2014) run on 1 GPU. To the best of our knowledge, this is the first open-source Python-based AlexNet implementation […]

CUDA

Dec, 9

Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors

The gap between the cost of moving data and the cost of computing continues to grow, making it ever harder to design iterative solvers on extreme-scale architectures. This problem can be alleviated by alternative algorithms that reduce the amount of data movement. We investigate this in the context of Lattice Quantum Chromodynamics and implement such […]

Dec, 9

MLitB: Machine Learning in the Browser

With few exceptions, the field of Machine Learning (ML) research has largely ignored the browser as a computational engine. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML research, but also, inexpensively and on a massive scale, to bring sophisticated ML learning and prediction to […]

Dec, 9

Risk Estimation Without Using Stein’s Lemma — Application to Image Denoising

Image denoising is a classical problem in image processing and has applications in areas ranging from photography to medical imaging. In this paper, we examine the denoising performance of an optimized spatially-varying Gaussian filter. The parameters of the Gaussian filter are tuned by optimizing a mean squared error estimate which is similar Stein’s Unbiased Risk […]

OpenCL

Dec, 9

Portable OpenCL Out-of-Order Execution Framework for Heterogeneous Platforms

Heterogeneous computing has become a viable option in seeking computing performance, to the side of conventional homogeneous multi-/single-processor approaches. The advantage of heterogeneity is the possibility to choose the best device on the platform for different distinct workloads in the application to gain performance and/or to lower power consumption. The drawback of heterogeneity is the […]

OpenCL

Dec, 8

XIII International Conference on Parallel Processing, ICPP 2015

The ICPP 2015 : XIII International Conference on Parallel Processing is the premier interdisciplinary forum for the presentation of new advances and research results in the fields of Parallel Processing. The conference will bring together leading academic scientists, researchers and scholars in the domain of interest from around the world. Topics of interest for submission […]