14985

Posts

Nov, 24

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Although the latest high-end smartphone has powerful CPU and GPU, running deeper convolutional neural networks (CNNs) for complex tasks such as ImageNet classification on mobile devices is challenging. To deploy deep CNNs on mobile devices, we present a simple and effective scheme to compress the entire CNN, which we call one-shot whole network compression. The […]
Nov, 24

Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning

Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is performed […]
Nov, 20

Recurrent Neural Networks Hardware Implementation on FPGA

Recurrent Neural Networks (RNNs) have the ability to retain memory and learn data sequences, and are a recent breakthrough of machine learning. Due to the recurrent nature of RNNs, it is sometimes hard to parallelize all its computations on conventional hardware. CPUs do not currently offer large parallelism, while GPUs offer limited parallelism due to […]
Nov, 20

Supervised Hashing with Deep Neural Networks

In this paper, we propose training very deep neural networks (DNNs) for supervised learning of hash codes. Existing methods in this context train relatively "shallow" networks limited by the issues arising in back propagation (vanishing gradients) as well as computational efficiency. We propose a novel and efficient training algorithm inspired by alternating direction method of […]
Nov, 20

Large Scale Artificial Neural Network Training Using Multi-GPUs

This paper describes a method for accelerating large scale Artificial Neural Networks (ANN) training using multi-GPUs by reducing the forward and backward passes to matrix multiplication. We propose an out-of-core multi-GPU matrix multiplication and integrate the algorithm with the ANN training. The experiments demonstrate that our matrix multiplication algorithm achieves linear speedup on multiple inhomogeneous […]
Nov, 20

GPU-accelerated adjoint algorithmic differentiation

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store […]
Nov, 20

GPU-Based Inverse Rendering With Multi-Objective Particle Swarm Optimization

We present a novel, GPU-accelerated per-pixel inverse rendering (IR) optimization algorithm based on Particle Swarm Optimization (PSO), IRPSO. IRPSO estimates the per-pixel scene attributes including reflectance properties of a 3D model, and is fast enough to do in situ visualization of the optimization in real-time. We utilize the GPU framebuffer as a computational domain, where […]
Nov, 13

Fast Neuromimetic Object Recognition using FPGA Outperforms GPU Implementations

Recognition of objects in still images has traditionally been regarded as a difficult computational problem. Although modern automated methods for visual object recognition have achieved steadily increasing recognition accuracy, even the most advanced computational vision approaches are unable to obtain performance equal to that of humans. This has led to the creation of many biologically-inspired […]
Nov, 13

GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication

The generic matrix-matrix multiplication (GEMM) is arguably the most popular computational kernel of the 20th century. Yet, surprisingly, no common methodology for evaluating GEMM performance has been established over the many decades of using GEMM for comparing architectures, compilers and ninja-class programmers. We introduce GEMMbench, a framework and methodology for evaluating performance of GEMM implementations. […]
Nov, 13

Accelerating Recommender Systems using GPUs

We describe GPU implementations of the matrix recommender algorithms CCD++ and ALS. We compare the processing time and predictive ability of the GPU implementations with existing multi-core versions of the same algorithms. Results on the GPU are better than the results of the multi-core versions (maximum speedup of 14.8).
Nov, 13

Accelerating Adaptive IDW Interpolation Algorithm on a Single GPU

This paper focuses on the design and implementing of GPU-accelerated Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm. The AIDW is an improved version of the standard IDW, which can adaptively determine the power parameter according to the spatial points distribution pattern and achieve more accurate predictions than those by IDW. In this paper, we first […]
Nov, 13

A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors

To meet the needs of diverse range of workloads, asymmetric multicore processors (AMPs) have been proposed, which feature cores of different microarchitecture or ISAs. However, given the diversity inherent in their design and application scenarios, several challenges need to be addressed to effectively architect AMPs and leverage their potential in optimizing both sequential and parallel […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: