18170

Posts

Apr, 25

A Survey of Techniques for Dynamic Branch Prediction

Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path. However, reducing latency and storage overhead of BP while maintaining high accuracy presents significant challenges. In this paper, we present a survey of dynamic branch prediction […]
Apr, 22

Fast inference of deep neural networks in FPGAs for particle physics

Recent results at the Large Hadron Collider (LHC) have pointed to enhanced physics capabilities through the improvement of the real-time event processing techniques. Machine learning methods are ubiquitous and have proven to be very powerful in LHC physics, and particle physics as a whole. However, exploration of the use of such techniques in low-latency, low-power […]
Apr, 22

CANNA: Neural Network Acceleration using Configurable Approximation on GPGPU

Neural networks have been successfully used in many applications. Due to their computational complexity, it is difficult to implement them on embedded devices. Neural networks are inherently approximate and thus can be simplified. In this paper, CANNA proposes a gradual training approximation which adaptively sets the level of hardware approximation depending on the neural network’s […]
Apr, 22

CytonRL: an Efficient Reinforcement Learning Open-source Toolkit Implemented in C++

This paper presents an open-source enforcement learning toolkit named CytonRL. The toolkit implements four recent advanced deep Q-learning algorithms from scratch using C++ and NVIDIA’s GPU-accelerated libraries. The code is simple and elegant, owing to an open-source general-purpose neural network library named CytonLib. Benchmark shows that the toolkit achieves competitive performances on the popular Atari […]
Apr, 22

mu-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably, depending on the layer dimensions. When an algorithm is automatically selected by cuDNN, the decision is performed on a per-layer basis, and thus it often […]
Apr, 22

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU software designers to remain up-to-date with the technological advances at a microarchitectural level. To address this dearth of public, microarchitectural-level information on […]
Apr, 15

DLL: A Blazing Fast Deep Neural Network Library

Deep Learning Library (DLL) is a new library for machine learning with deep neural networks that focuses on speed. It supports feed-forward neural networks such as fully-connected Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). It also has very comprehensive support for Restricted Boltzmann Machines (RBMs) and Convolutional RBMs. Our main motivation for this […]
Apr, 15

Automatic Optimization of OpenCL-Based Stencil Codes for FPGAs and Its Evaluation

Recently, C-based OpenCL design environment is proposed to design FPGA (field programmable gate array) accelerators. Although many C-programs can be executed on FPGAs, the best c-code for a CPU may not be the most appropriate one for an FPGA. Users must have some knowledge about computer architecture in order to write a good OpenCL code. […]
Apr, 15

Implementing Push-Pull Efficiently in GraphBLAS

We factor Beamer’s push-pull, also known as direction-optimized breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them for generalizability, asymptotic speedup, and contribution to overall speedup. We demonstrate that masking is critical for high performance and can be generalized to all graph algorithms where the sparsity pattern of the output is known a priori. We […]
Apr, 15

G-NET: Effective GPU Sharing in NFV Systems

Network Function Virtualization (NFV) virtualizes software network functions to offer flexibility in their design, management and deployment. Although GPUs have demonstrated their power in significantly accelerating network functions, they have not been effectively integrated into NFV systems for the following reasons. First, GPUs are severely underutilized in NFV systems with existing GPU virtualization approaches. Second, […]
Apr, 15

Towards a Unified CPU-GPU code hybridization: A GPU Based Optimization Strategy Efficient on Other Modern Architectures

In this paper, we suggest a different methodology to shorten the code optimization development time while getting a unified code with good performance on different targeted devices. In the scope of this study, experiments are illustrated on a Discontinuous Galerkin code applied to Computational Fluid Dynamics. Tests are performed on CPUs, KNL Xeon-Phi and GPUs […]
Apr, 7

Evaluating Performance Tradeoffs on the Radeon Open Compute Platform

GPUs have been shown to deliver impressive computing performance, while also providing high energy efficiency, across a wide range of high-performance and embedded system workloads. However, limited support for efficient communication and synchronization between the CPU and the GPU impacts our ability to fully exploit the benefits of heterogeneous systems. Recently, the Heterogeneous System Architecture […]
Page 10 of 957« First...89101112...203040...Last »

Recent source codes

* * *

* * *

HGPU group © 2010-2018 hgpu.org

All rights belong to the respective authors

Contact us: