high performance computing on graphics processing units: hgpu.org

Posts

Apr, 7

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video […]

CUDA

Mar, 31

Methods for Accelerating Machine Learning in High Performance Computing

Driven by massive dataset corpuses and advances and programmability in accelerator architectures, such as GPUs and FPGAs, machine learning (ML) has delivered remarkable, human-like accuracy in tasks such as image recognition, machine translation and speech processing. Although ML has improved accuracy in selected human tasks, the time to train models can range from hours to […]

CUDA

Mar, 31

Dynamic Application Autotuning for Self-Aware Approximate Computing

In the autonomic computing context, we perceive the system as an ensemble of autonomous elements capable of self-managing, where endusers define high-level goals and the system shall adapt to achieve the desired behaviour. This runtime adaptation creates several optimisation opportunities, especially if we consider approximate computing applications, where it is possible to trade off the […]

OpenCL

Mar, 31

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

The combined impact of new computing resources and techniques with an increasing avalanche of large datasets, is transforming many research areas and may lead to technological breakthroughs that can be used by billions of people. In the recent years, Machine Learning and especially its subfield Deep Learning have seen impressive advances. Techniques developed within these […]

CUDA

•

OpenCL

Mar, 31

Hybrid CPU-GPU execution support in the skeleton programming framework SkePU

In this paper, we present a hybrid execution backend for the skeleton programming framework SkePU. The backend is capable of automatically dividing the workload and simultaneously executing the computation on a multi-core CPU and any number of accelerators, such as GPUs. We show how to efficiently partition the workload of skeletons such as Map, MapReduce, […]

CUDA

•

OpenCL

Mar, 31

HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators

With the ever-increasing amount of data and input variations, portable performance is becoming harder to exploit on today’s architectures. Computational setups utilize single-chip processors, such as GPUs or large-scale multicores for graph analytics. Some algorithm-input combinations perform more efficiently when utilizing a GPU’s higher concurrency and bandwidth, while others perform better with a multicore’s stronger […]

OpenCL

Mar, 24

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

This paper reports our efforts on swCaffe, a highly efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, the current fastest supercomputer in the world that adopts a unique many-core heterogeneous architecture, with 40,960 SW26010 processors connected through a customized communication network. First, we point out some insightful principles to fully […]

CUDA

Mar, 24

Surface Compression Using Dynamic Color Palettes

Off-chip memory traffic is a major source of power and energy consumption on mobile platforms. A large amount of this off-chip traffic is used to manipulate graphics framebuffer surfaces. To cut down the cost of accessing off-chip memory, framebuffer surfaces are compressed to reduce the bandwidth consumed on surface manipulation when rendering or displaying. In […]

OpenGL

Mar, 24

The ANTAREX Domain Specific Language for High Performance Computing

The ANTAREX project relies on a Domain Specific Language (DSL) based on Aspect Oriented Programming (AOP) concepts to allow applications to enforce extra functional properties such as energy-efficiency and performance and to optimize Quality of Service (QoS) in an adaptive way. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well […]

OpenCL

Mar, 24

Accelerating ternary quantized convolutional neural networks using OpenCL for FPGA

FPGAs balance the reprogammability of CPUs and the performance of ASICs. They seem the perfect solution to increase the throughput of neural networks. However they must also prove to be highly competitive in a market dominated by GPUs. To achieve this, we focus on the strength of FPGAs that cannot be taken advantage of on […]

OpenCL

Mar, 24

Dissecting the NVidia Turing T4 GPU via Microbenchmarking

In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible performance. Last year, these very reasons motivated us to dissect the Volta GPU architecture using microbenchmarks. The introduction in August […]

CUDA

Mar, 17

Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms

Heterogeneity has turned into one of the most profound and challenging characteristics of today’s HPC environments. Modern HPC platforms have become highly heterogeneous owing to the tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and […]

CUDA

•

OpenCL