high performance computing on graphics processing units: hgpu.org

Posts

Sep, 18

International Joint Conference on Computer Vision and Pattern Recognition (CCVPR), 2018

CCVPR 2018 welcomes researchers, engineers, scientists and industry professionals to an open forum where advances in the field of Computer Vision and Pattern Recognition can be shared and examined. The conference is an ideal platform for keeping up with advances and changes to a consistently morphing field. Publication and Indexing All accepted papers will be published […]

Sep, 16

Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are extremely computationally demanding, presenting a large barrier to their deployment on resource-constrained devices. Since such systems are where some of their most useful applications lie (e.g. obstacle detection for mobile robots, vision-based medical assistive technology), significant bodies of work from both machine learning and systems communities have attempted to provide […]

OpenCL

Sep, 16

A deep learning approach to autonomous lunar landing

Over the past few years, in the huge field of Artificial Intelligence (AI), new Machine Learning techniques are playing a central role, proving to be very powerful and versatile. For this reason, it is expected that they could become protagonist of space applications and they are already under study. Thanks to the large availability of […]

Sep, 16

Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications

Medical applications challenge today’s text categorization techniques by demanding both high accuracy and ease-of-interpretation. Although deep learning has provided a leap ahead in accuracy, this leap comes at the sacrifice of interpretability. To address this accuracy-interpretability challenge, we here introduce, for the first time, a text categorization approach that leverages the recently introduced Tsetlin Machine. […]

CUDA

Sep, 16

Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms

Gradient boosted decision trees (GBDTs) have seen widespread adoption in academia, industry and competitive data science due to their state-of-the-art performance in a wide variety of machine learning tasks. In this paper, we present an extensive empirical comparison of XGBoost, LightGBM and CatBoost, three popular GBDT algorithms, to aid the data science practitioner in the […]

CUDA

Sep, 16

ZUCL: A ZYNQ UltraScale+ Framework for OpenCL HLS Applications

In this work, we are proposing the ZUCL framework for implementing and running OpenCL applications for the latest Xilinx ZYNQ UltraScale+ platform. ZUCL is a holistic framework addressing the FPGA OS infrastructure, high level synthesis (HLS) module implementation as well as the runtime management. ZUCL enables partial reconfiguration (PR) on this platform by providing an […]

OpenCL

Sep, 9

Efficient and Scalable k-Means on GPUs

k-Means is a versatile clustering algorithm widely used in practice. To cluster large data sets, state-of-the-art implementations use GPUs to shorten the data to knowledge time. These implementations commonly assign points on a GPU and update centroids on a CPU. We identify two main shortcomings of this approach. First, it requires expensive data exchange between […]

OpenCL

Sep, 9

Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU

Sparse matrix-vector multiplication (SpMV) can be used to solve diverse-scaled linear systems and eigenvalue problems that exist in numerous, and varying scientific applications. One of the scientific applications that SpMV is involved in is known as Configuration Interaction (CI). CI is a linear method for solving the non-relativistic Schroedinger equation for quantum chemical multi-electron systems, […]

CUDA

Sep, 9

Doctor AI: Interpretable Deep Learning for Modeling Electronic Health Records

Deep learning recently has been showing superior performance in complex domains such as computer vision, audio processing and natural language processing compared to traditional statistical methods. Naturally, deep learning techniques, combined with large electronic health records (EHR) data generated from healthcare organizations have potential to bring dramatic changes to the healthcare industry. However, typical deep […]

CUDA

Sep, 9

Using SIMD and SIMT vectorization to evaluate sparse chemical kinetic Jacobian matrices and thermochemical source terms

Accurately predicting key combustion phenomena in reactive-flow simulations, e.g., lean blow-out, extinction/ignition limits and pollutant formation, necessitates the use of detailed chemical kinetics. The large size and high levels of numerical stiffness typically present in chemical kinetic models relevant to transportation/power-generation applications make the efficient evaluation/factorization of the chemical kinetic Jacobian and thermochemical source-terms critical […]

OpenCL

Sep, 9

Cracks in the Sky: Abelian-Higgs Cosmic String Evolution with CUDA

Topological defects form at cosmological phase transitions by the Kibble mechanism, with cosmic strings and superstrings having the most interesting phenomenology. A rigorous analysis of their astrophysical consequences is limited by the availability of accurate numerical simulations, and therefore by hardware resources and computation time. Improving the speed and efficiency of existing codes is therefore […]

CUDA

Sep, 2

Optimizing Communication for Clusters of GPUs

GPUs are frequently used to accelerate data-parallel workloads across a wide variety of application domains. While GPUs offer a large amount of computational throughput within a single node, the largest problems require a cluster of such devices communicating with different compute nodes across a network. These clusters can range in size from a small handful […]

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

International Joint Conference on Computer Vision and Pattern Recognition (CCVPR), 2018

Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks

A deep learning approach to autonomous lunar landing

Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications

Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms

ZUCL: A ZYNQ UltraScale+ Framework for OpenCL HLS Applications

Efficient and Scalable k-Means on GPUs

Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU

Doctor AI: Interpretable Deep Learning for Modeling Electronic Health Records

Using SIMD and SIMT vectorization to evaluate sparse chemical kinetic Jacobian matrices and thermochemical source terms

Cracks in the Sky: Abelian-Higgs Cosmic String Evolution with CUDA

Optimizing Communication for Clusters of GPUs

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)