high performance computing on graphics processing units: hgpu.org

Posts

Apr, 7

High Performance Monte Carlo Simulation of Ising Model on TPU Clusters

Large scale deep neural networks profited from an emerging class of AI accelerators. Although the accelerators are specialized for machine learning, some of their designs are general enough for other computing intensive applications. Cloud TPU, as one of them, offers tremendous computing resources and is easily accessible through TensorFlow by expressing the computation in a […]

CUDA

Apr, 7

The Study of the OpenCL Processing Models for the FPGA Devices

In our study, we present the results of the implementation of the SHA-512 algorithm in FPGAs. The distinguished element of our work is that we conducted the work using OpenCL for FPGA, which is a relatively new development method for reconfigurable logic. We examine loop unrolling as an OpenCL performance optimization method and compare the […]

OpenCL

Apr, 7

Full-System Simulation of Mobile CPU/GPU Platforms

Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and userspace drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is […]

OpenCL

Apr, 7

TonY: An Orchestrator for Distributed Machine Learning Jobs

Training machine learning (ML) models on large datasets requires considerable computing power. To speed up training, it is typical to distribute training across several machines, often with specialized hardware like GPUs or TPUs. Managing a distributed training job is complex and requires dealing with resource contention, distributed configurations, monitoring, and fault tolerance. In this paper, […]

Apr, 7

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video […]

CUDA

Mar, 31

Methods for Accelerating Machine Learning in High Performance Computing

Driven by massive dataset corpuses and advances and programmability in accelerator architectures, such as GPUs and FPGAs, machine learning (ML) has delivered remarkable, human-like accuracy in tasks such as image recognition, machine translation and speech processing. Although ML has improved accuracy in selected human tasks, the time to train models can range from hours to […]

CUDA

Mar, 31

Dynamic Application Autotuning for Self-Aware Approximate Computing

In the autonomic computing context, we perceive the system as an ensemble of autonomous elements capable of self-managing, where endusers define high-level goals and the system shall adapt to achieve the desired behaviour. This runtime adaptation creates several optimisation opportunities, especially if we consider approximate computing applications, where it is possible to trade off the […]

OpenCL

Mar, 31

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

The combined impact of new computing resources and techniques with an increasing avalanche of large datasets, is transforming many research areas and may lead to technological breakthroughs that can be used by billions of people. In the recent years, Machine Learning and especially its subfield Deep Learning have seen impressive advances. Techniques developed within these […]

CUDA

•

OpenCL

Mar, 31

Hybrid CPU-GPU execution support in the skeleton programming framework SkePU

In this paper, we present a hybrid execution backend for the skeleton programming framework SkePU. The backend is capable of automatically dividing the workload and simultaneously executing the computation on a multi-core CPU and any number of accelerators, such as GPUs. We show how to efficiently partition the workload of skeletons such as Map, MapReduce, […]

CUDA

•

OpenCL

Mar, 31

HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators

With the ever-increasing amount of data and input variations, portable performance is becoming harder to exploit on today’s architectures. Computational setups utilize single-chip processors, such as GPUs or large-scale multicores for graph analytics. Some algorithm-input combinations perform more efficiently when utilizing a GPU’s higher concurrency and bandwidth, while others perform better with a multicore’s stronger […]

OpenCL

Mar, 24

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

This paper reports our efforts on swCaffe, a highly efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, the current fastest supercomputer in the world that adopts a unique many-core heterogeneous architecture, with 40,960 SW26010 processors connected through a customized communication network. First, we point out some insightful principles to fully […]

CUDA

Mar, 24

Surface Compression Using Dynamic Color Palettes

Off-chip memory traffic is a major source of power and energy consumption on mobile platforms. A large amount of this off-chip traffic is used to manipulate graphics framebuffer surfaces. To cut down the cost of accessing off-chip memory, framebuffer surfaces are compressed to reduce the bandwidth consumed on surface manipulation when rendering or displaying. In […]

OpenGL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

High Performance Monte Carlo Simulation of Ising Model on TPU Clusters

The Study of the OpenCL Processing Models for the FPGA Devices

Full-System Simulation of Mobile CPU/GPU Platforms

TonY: An Orchestrator for Distributed Machine Learning Jobs

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Methods for Accelerating Machine Learning in High Performance Computing

Dynamic Application Autotuning for Self-Aware Approximate Computing

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Hybrid CPU-GPU execution support in the skeleton programming framework SkePU

HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

Surface Compression Using Dynamic Color Palettes

Recent source codes

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)