18418

Posts

Aug, 19

Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures

Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication formulation, and direct convolution primarily targeting […]
Aug, 19

CosmoFlow: Using Deep Learning to Learn the Universe at Scale

Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. CosmoFlow uses efficient implementations of 3D convolution and pooling primitives, together with improvements in threading […]
Aug, 19

Matrix Factorization on GPUs with Memory Optimization and Approximate Computing

Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability e.g. easy to handle positive-unlabeled inputs, fast convergence and parallelization […]
Aug, 19

libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

Hardware accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors (PHIs), and Field-Programmable Gate Arrays (FPGAs) are now ubiquitous in extreme-scale high performance computing (HPC), cloud, and Big data platforms to facilitate execution of workloads that demand high energy efficiency. They present unique interfaces and programming models therefore posing several limitations, which must […]
Aug, 11

A Cross-platform Evaluation of Graphics Shader Compiler Optimization

For real-time graphics applications such as games and virtual reality, performance is crucial to provide a smooth user experience. Central to this is the performance of shader programs which render images on the GPU. The rise of low-level graphics APIs such as Vulkan means compilation tools play an increasingly important role in the graphics ecosystem. […]
Aug, 11

GPU parallelization of a hybrid pseudospectral fluid turbulence framework using CUDA

An existing hybrid MPI-OpenMP scheme is augmented with a CUDA-based fine grain parallelization approach for multidimensional distributed Fourier transforms, in a well-characterized pseudospectral fluid turbulence code. Basics of the hybrid scheme are reviewed, and heuristics provided to show a potential benefit of the CUDA implementation. The method draws heavily on the CUDA runtime library to […]
Aug, 11

A Case Study in Using OpenCL on FPGAs: Creating an Open-Source Accelerator of the AutoDock Molecular Docking Software

In recent years, OpenCL has been increasingly adopted as it enables software programmers to harness the performance and power efficiency of FPGAs. Despite simplifying the FPGA programming challenge, achieving high performance and energy efficiency with OpenCL is still a difficult task. In order to further contribute to the advance of the OpenCL usage for FPGAs, […]
Aug, 11

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Recent work has shown how to train Convolutional Neural Networks (CNNs) rapidly on large image datasets, then transfer the knowledge gained from these models to a variety of tasks. Following [Radford 2017], in this work, we demonstrate similar scalability and transfer for Recurrent Neural Networks (RNNs) for Natural Language tasks. By utilizing mixed precision arithmetic […]
Aug, 11

Parallax: Automatic Data-Parallel Training of Deep Neural Networks

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in machine learning (ML). ML frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist ML researchers to train their models in a distributed fashion. However, correctly and efficiently utilizing multiple machines and GPUs is […]
Aug, 5

Energy-based Tuning of Convolutional Neural Networks on Multi-GPUs

Deep Learning (DL) applications are gaining momentum in the realm of Artificial Intelligence, particularly after GPUs have demonstrated remarkable skills for accelerating their challenging computational requirements. Within this context, Convolutional Neural Network (CNN) models constitute a representative example of success on a wide set of complex applications, particularly on datasets where the target can be […]
Aug, 5

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

For a deep learning model, efficient execution of its computation graph is key to achieving high performance. Previous work has focused on improving the performance for individual nodes of the computation graph, while ignoring the parallelization of the graph as a whole. However, we observe that running multiple operations simultaneously without interference is critical to […]
Aug, 5

GPU schedulers: how fair is fair enough?

Blocking synchronisation idioms, e.g. mutexes and barriers, play an important role in concurrent programming. However, systems with semi-fair schedulers, e.g. graphics processing units (GPUs), are becoming increasingly common. Such schedulers provide varying degrees of fairness, guaranteeing enough to allow some, but not all, blocking idioms. While a number of applications that use blocking idioms do […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: