26769

Posts

May, 15

SYCLops: A SYCL Specific LLVM to MLIR Converter

There is a growing need for higher level abstractions for device kernels in heterogeneous environments, and the multi-level nature of the MLIR infrastructure perfectly addresses this requirement. As SYCL begins to gain industry adoption for heterogeneous applications and MLIR continues to develop, we present SYCLops: a converter capable of translating SYCL specific LLVM IR to […]
May, 15

Can We Run in Parallel? Automating Loop Parallelization for TornadoVM

With the advent of multi-core systems, GPUs and FPGAs, loop parallelization has become a promising way to speed-up program execution. In order to stay up with time, various performance-oriented programming languages provide a multitude of constructs to allow programmers to write parallelizable loops. Correspondingly, researchers have developed techniques to automatically parallelize loops that do not […]
May, 15

Productive Performance Engineering for Weather and Climate Modeling with Python

Earth system models are developed with a tight coupling to target hardware, often containing highly-specialized code predicated on processor characteristics. This coupling stems from using imperative languages that hard-code computation schedules and layout. In this work, we present a detailed account of optimizing the Finite Volume Cubed-Sphere (FV3) weather model, improving productivity and performance. By […]
May, 15

NEPTUNE: Network- and GPU-aware Management of Serverless Functions at the Edge

Nowadays a wide range of applications is constrained by low-latency requirements that cloud infrastructures cannot meet. Multi-access Edge Computing (MEC) has been proposed as the reference architecture for executing applications closer to users and reduce latency, but new challenges arise: edge nodes are resource-constrained, the workload can vary significantly since users are nomadic, and task […]
May, 8

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL

We explore the design and development of structured-mesh based solvers on current Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multidimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core modules in a wide-range […]
May, 8

Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL

We describe the experience of converting a CUDA implementation of a parallel graph coloring algorithm to SYCL. The goals are for our work to be useful to application and compiler developers by providing a detailed description of migration paths between CUDA and SYCL. We will describe how CUDA functions are mapped to SYCL functions. Evaluating […]
May, 8

cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms

Particle Swarm Optimization (PSO) is a stochastic technique for solving the optimization problem. Attempts have been made to shorten the computation times of PSO based algorithms with massive threads on GPUs (graphic processing units), where thread groups are formed to calculate the information of particles and the computed outputs for the particles are aggregated and […]
May, 8

GPUNet: Searching the Deployable Convolution Neural Networks for GPUs

Customizing Convolution Neural Networks (CNN) for production use has been a challenging task for DL practitioners. This paper intends to expedite the model customization with a model hub that contains the optimized models tiered by their inference latency using Neural Architecture Search (NAS). To achieve this goal, we build a distributed NAS system to search […]
May, 8

Analytical Performance Estimation during Code Generation on Modern GPUs

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This […]
May, 1

The Celerity High-level API: C++20 for Accelerator Clusters

Providing convenient APIs and notations for data parallelism which remain accessible for programmers while still providing good performance has been a long-term goal of researchers as well as language and library designers. C++20 introduces ranges and views, as well as the composition of operations on them using a concise syntax, but the efficient implementation of […]
May, 1

CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems

Modern computing platforms tend to deploy multiple GPUs on a single node to boost performance. GPUs have large computing capacities and are an expensive resource. Increasing their utilization without causing performance degradation of individual workloads is an important and challenging problem. Although services such as NVIDIA’s MPS allow multiple cooperative kernels to simultaneously run on […]
May, 1

End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning

To enable heterogeneous computing systems with autonomous programming and optimization capabilities, we propose a unified, end-to-end, programmable graph representation learning (PGL) framework that is capable of mining the complexity of high-level programs down to the universal intermediate representation, extracting the specific computational patterns and predicting which code segments would run best on a specific core […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: