19164

Posts

Oct, 20

Characterizing Deep Learning Training Workloads on Alibaba-PAI

Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing […]
Oct, 20

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Supported by their high power efficiency and recent advancements in High Level Synthesis (HLS), FPGAs are quickly finding their way into HPC and cloud systems. Large amounts of work have been done so far on loop and area optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and efficiency […]
Oct, 13

Accelerated Approximate Nearest Neighbors Search Through Hierarchical Product Quantization

A fundamental recurring task in many machine learning applications is the search for the Nearest Neighbor in high dimensional metric spaces. Towards answering queries in large scale problems, state-of-the-art methods employ Approximate Nearest Neighbors (ANN) search, a search that returns the nearest neighbor with high probability, as well as techniques that compress the dataset. Product-Quantization […]
Oct, 13

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations […]
Oct, 13

Performance Evaluation of Blocking and NonBlocking Concurrent Queues on GPUs

The efficiency of concurrent data structures is crucial to the performance of multithreaded programs in shared-memory systems. The arbitrary execution of concurrent threads, however, can result in an incorrect behavior of these data structures. Graphics Processing Units (GPUs) have appeared as a powerful platform for high-performance computing. As regular data-parallel computations are straightforward to implement […]
Oct, 13

hlslib: Software Engineering for Hardware Design

High-level synthesis (HLS) tools have brought FPGA development into the mainstream, by allowing programmers to design architectures using familiar languages such as C, C++, and OpenCL. While the move to these languages has brought significant benefits, many aspects of traditional software engineering are still unsupported, or not exploited by developers in practice. Furthermore, designing reconfigurable […]
Oct, 13

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessarily the raw […]
Oct, 6

Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures

3D visual computing data are often spatially sparse. To exploit such sparsity, people have developed hierarchical sparse data structures, such as multilevel sparse voxel grids, particles, and 3D hash tables. However, developing and using these high-performance sparse data structures is challenging, due to their intrinsic complexity and overhead. We propose Taichi, a new data-oriented programming […]
Oct, 6

Verification of GPU Program Optimizations in Lean

Graphics processing units (GPUs) have become of major importance for highperformance computing due to their high throughput. To get the best possible performance, GPU programs are frequently optimized. However, every optimization carries the risk of introducing bugs. In this thesis, we present a framework for the theorem prover Lean to formally verify transformations of GPU […]
Oct, 6

waLBerla: A block-structured high-performance framework for multiphysics simulations

Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building […]
Oct, 6

Syntix: A Profiling Based Resource Estimator for CUDA Kernels

Trending applications such as AI and data analytics have mandated the use of GPUs in modern datacenters for performance reasons. Current practice dictates to dedicate GPUs to applications, which limits the amount of concurrent users to the available GPUs. That use of GPUs contradicts with the policy of datacenters to oversubscribe resources and accommodate as […]
Oct, 6

MIOpen: An Open Source Library For Deep Learning Primitives

Deep Learning has established itself to be a common occurrence in the business lexicon. The unprecedented success of deep learning in recent years can be attributed to: abundance of data, availability of gargantuan compute capabilities offered by GPUs, and adoption of open-source philosophy by the researchers and industry. Deep neural networks can be decomposed into […]

* * *

* * *

HGPU group © 2010-2019 hgpu.org

All rights belong to the respective authors

Contact us: