high performance computing on graphics processing units: hgpu.org

Posts

Sep, 6

Performance portability through machine learning guided kernel selection in SYCL libraries

Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes and parameters. General purpose compute libraries must be able to cater to all inputs and parameters provided by a user, and so […]

Sep, 6

Hash-Based Authentication Revisited in the Age of High-Performance Computers

Hash-based authentication is a widespread technique for protecting passwords in many modern software systems including databases. A hashing function is a one-way mathematical function that is used in various security contexts in this domain. In this paper, we revisit three popular hashing algorithms (MD5, SHA-1, and NTLM), that are considered weak or insecure. More specifically, […]

CUDA

•

OpenCL

Sep, 6

Array-Oriented Languages and Polyhedral Compilation

Collection-oriented languages are characterised by their provision of a set of primitives which work on data in aggregate, and are intrinsically suited to parallelisation. Polyhedral compilation systems aim at optimising aggregate computations using an abstract, mathematical representation. However, they are typically run as optimisation passes not over high-level programmes representing aggregate computations, but over the […]

Sep, 6

A Distributed Architecture for Smart Recycling Using Machine Learning

Recycling is vital for a sustainable and clean environment. Developed and developing countries are both facing the problem of solid management waste and recycling issues. Waste classification is a good solution to separate the waste from the recycle materials. In this work, we propose a cloud based classification algorithm for automated machines in recycling factories […]

CUDA

Sep, 6

Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA

Decision trees are machine learning models commonly used in various application scenarios. In the era of big data, traditional decision tree induction algorithms are not suitable for learning large-scale datasets due to their stringent data storage requirement. Online decision tree learning algorithms have been devised to tackle this problem by concurrently training with incoming samples […]

Aug, 30

Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer

JavaScript (JS) engine vulnerabilities pose significant security threats affecting billions of web browsers. While fuzzing is a prevalent technique for finding such vulnerabilities, there have been few studies that leverage the recent advances in neural network language models (NNLMs). In this paper, we present Montage, the first NNLM-guided fuzzer for finding JS engine vulnerabilities. The […]

Aug, 30

Machine Learning in Compilers: Past, Present and Future

Writing optimising compilers is difficult. The range of programs that may be presented to the compiler is huge and the systems on which they run are complex, heterogeneous, non-deterministic, and constantly changing. The space of possible optimisations is also vast, making it very hard for compiler writers to design heuristics that take all of these […]

Aug, 30

Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user […]

CUDA

Aug, 30

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that […]

CUDA

Aug, 30

8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

Performance optimization can be a daunting task especially as the hardware architecture becomes more and more complex. This paper takes a kernel from the Materials Science code BerkeleyGW, and demonstrates a few performance analysis and optimization techniques. Despite challenges such as high register usage, low occupancy, complex data access patterns, and the existence of several […]

Aug, 27

The 18th International Conference on High Performance Computing & Simulation (HPCS), 2020

The 2020 International Conference on High Performance Computing & Simulation (HPCS 2020) will be held on December 10-14, 2020 in Barcelona, Spain (virtually). Under the theme of “HPC and Modeling & Simulation for the 21st Century,” HPCS 2020 will focus on a wide range of the state-of-the-art as well as emerging topics pertaining to high […]

Aug, 23

Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs

Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Performance portability through machine learning guided kernel selection in SYCL libraries

Hash-Based Authentication Revisited in the Age of High-Performance Computers

Array-Oriented Languages and Polyhedral Compilation

A Distributed Architecture for Smart Recycling Using Machine Learning

Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA

Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer

Machine Learning in Compilers: Past, Present and Future

Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

The 18th International Conference on High Performance Computing & Simulation (HPCS), 2020

Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)