high performance computing on graphics processing units: hgpu.org

Posts

May, 15

GPU-based JSON data processing using structural indexes

In recent years, large amounts of data are being increasingly generated and stored every day. Big data is often processed by different software systems, which require a common data interchange format. JavaScript Object Notation, or JSON, is one of the most popular data exchange formats and is widely used in web and data-intensive applications. Unfortunately, […]

CUDA

May, 15

SYCLops: A SYCL Specific LLVM to MLIR Converter

There is a growing need for higher level abstractions for device kernels in heterogeneous environments, and the multi-level nature of the MLIR infrastructure perfectly addresses this requirement. As SYCL begins to gain industry adoption for heterogeneous applications and MLIR continues to develop, we present SYCLops: a converter capable of translating SYCL specific LLVM IR to […]

May, 15

Can We Run in Parallel? Automating Loop Parallelization for TornadoVM

With the advent of multi-core systems, GPUs and FPGAs, loop parallelization has become a promising way to speed-up program execution. In order to stay up with time, various performance-oriented programming languages provide a multitude of constructs to allow programmers to write parallelizable loops. Correspondingly, researchers have developed techniques to automatically parallelize loops that do not […]

OpenCL

May, 15

Productive Performance Engineering for Weather and Climate Modeling with Python

Earth system models are developed with a tight coupling to target hardware, often containing highly-specialized code predicated on processor characteristics. This coupling stems from using imperative languages that hard-code computation schedules and layout. In this work, we present a detailed account of optimizing the Finite Volume Cubed-Sphere (FV3) weather model, improving productivity and performance. By […]

CUDA

May, 15

NEPTUNE: Network- and GPU-aware Management of Serverless Functions at the Edge

Nowadays a wide range of applications is constrained by low-latency requirements that cloud infrastructures cannot meet. Multi-access Edge Computing (MEC) has been proposed as the reference architecture for executing applications closer to users and reduce latency, but new challenges arise: edge nodes are resource-constrained, the workload can vary significantly since users are nomadic, and task […]

May, 8

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL

We explore the design and development of structured-mesh based solvers on current Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multidimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core modules in a wide-range […]

May, 8

Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL

We describe the experience of converting a CUDA implementation of a parallel graph coloring algorithm to SYCL. The goals are for our work to be useful to application and compiler developers by providing a detailed description of migration paths between CUDA and SYCL. We will describe how CUDA functions are mapped to SYCL functions. Evaluating […]

CUDA

•

OpenCL

May, 8

cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms

Particle Swarm Optimization (PSO) is a stochastic technique for solving the optimization problem. Attempts have been made to shorten the computation times of PSO based algorithms with massive threads on GPUs (graphic processing units), where thread groups are formed to calculate the information of particles and the computed outputs for the particles are aggregated and […]

CUDA

May, 8

GPUNet: Searching the Deployable Convolution Neural Networks for GPUs

Customizing Convolution Neural Networks (CNN) for production use has been a challenging task for DL practitioners. This paper intends to expedite the model customization with a model hub that contains the optimized models tiered by their inference latency using Neural Architecture Search (NAS). To achieve this goal, we build a distributed NAS system to search […]

CUDA

May, 8

Analytical Performance Estimation during Code Generation on Modern GPUs

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This […]

CUDA

•

OpenCL

May, 1

The Celerity High-level API: C++20 for Accelerator Clusters

Providing convenient APIs and notations for data parallelism which remain accessible for programmers while still providing good performance has been a long-term goal of researchers as well as language and library designers. C++20 introduces ranges and views, as well as the composition of operations on them using a concise syntax, but the efficient implementation of […]

May, 1

End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning

To enable heterogeneous computing systems with autonomous programming and optimization capabilities, we propose a unified, end-to-end, programmable graph representation learning (PGL) framework that is capable of mining the complexity of high-level programs down to the universal intermediate representation, extracting the specific computational patterns and predicting which code segments would run best on a specific core […]

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU-based JSON data processing using structural indexes

SYCLops: A SYCL Specific LLVM to MLIR Converter

Can We Run in Parallel? Automating Loop Parallelization for TornadoVM

Productive Performance Engineering for Weather and Climate Modeling with Python

NEPTUNE: Network- and GPU-aware Management of Serverless Functions at the Edge

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL

Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL

cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms

GPUNet: Searching the Deployable Convolution Neural Networks for GPUs

Analytical Performance Estimation during Code Generation on Modern GPUs

The Celerity High-level API: C++20 for Accelerator Clusters

End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)