high performance computing on graphics processing units: hgpu.org

As the trends of process scaling make memory system even more crucial bottleneck, the importance of latency hiding techniques such as prefetching grows further. However, naively using prefetching can harm performance and energy efficiency and hence, several factors and parameters need to be taken into account to fully realize its potential. In this paper, we survey several recent techniques that aim to improve implementation and effectiveness of prefetching. We characterize the techniques on several parameters to highlight their similarities and differences. The aim of this survey is to provide insights to researchers into working of prefetching techniques and spark interesting future work for improving the performance advantages of prefetching even further.

March 25, 2016 by sparsh0mittal · · >>>

Recurrent neural networks for language modeling

Emil Sauer Lynge

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Deep learning, LSTM, Neural networks, NLP, nVidia, nVidia GeForce GTX Titan X, Package, Python, RNN, Thesis

March 22, 2016 by hgpu

DeepSpeed: a deep learning optimization library that makes distributed training and inference easy, efficient, and effective

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

* * *

high performance computing on graphics processing units: hgpu.org

Applications

A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs

A generalized GPU-based connected component labeling algorithm

Generic Inverted Index on the GPU

A Stencil DSEL for Single Code Accelerated Computing with SYCL

GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets

Wanted: Floating-Point Add Round-off Error instruction

Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent

An Efficient Implementation of the Longest Common Subsequence Algorithm with Bit-Parallelism on GPUs

A mixed precision semi-Lagrangian algorithm and its performance on accelerators

A Survey of Recent Prefetching Techniques for Processor Caches

Recurrent neural networks for language modeling

Recent source codes

DeepSpeed: a deep learning optimization library that makes distributed training and inference easy, efficient, and effective

HPX: a C++ Standard Library for Concurrency and Parallelism

TorchQC: Quantum Dynamics and Machine Learning

Matrix multiplication using Tensor Cores in CUDA

CPP Joules: Energy Measurement tool for CPP/CUDA programs

HPC-Coder-v2

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

SW#SYCL

tdg-benchs: benchmarks used to test the performance of taskgraph

LLOR: Automatic Repair of OpenMP Programs

Most viewed papers (last 30 days)