22270

Posts

Jul, 19

Compyle: a Python package for parallel computing

Compyle allows users to execute a restricted subset of Python on a variety of HPC platforms. It is an embedded domain-specific language (eDSL) for parallel computing. It currently supports multi-core execution using Cython, and OpenCL and CUDA for GPU devices. Users write code in a restricted subset of Python that is automatically transpiled to high-performance […]
Jul, 19

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies data reuse analysis and access optimization in different levels of the memory hierarchy. To achieve the balance between computation and […]
Jul, 19

A load balance multi-scheduling model for OpenCL kernel tasks in an integrated cluster

Nowadays, embedded systems are comprised of heterogeneous multi-core architectures, i.e., CPUs and GPUs. If the application is mapped to an appropriate processing core, then these architectures provide many performance benefits to applications. Typically, programmers map sequential applications to CPU and parallel applications to GPU. The task mapping becomes challenging because of the usage of evolving […]
Jul, 19

Deep Graph Library Optimizations for Intel(R) x86 Architecture

The Deep Graph Library (DGL) was designed as a tool to enable structure learning from graphs, by supporting a core abstraction for graphs, including the popular Graph Neural Networks (GNN). DGL contains implementations of all core graph operations for both the CPU and GPU. In this paper, we focus specifically on CPU implementations and present […]
Jul, 12

GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research

Protein-ligand docking is an in silico tool used to screen potential drug compounds for their ability to bind to a given protein receptor within a drug-discovery campaign. Experimental drug screening is expensive and time consuming, and it is desirable to carry out large scale docking calculations in a high-throughput manner to narrow the experimental search […]
Jul, 12

Investigating Input Representations and Representation Models of Source Code for Machine Learning

Machine Learning methods are actively used to solve various tasks on source code, such as in Compilers to improve performance of executable code, or IDEs to boost developer productivity. While the use cases are manifold, most of these methods rely on manually-defined features that require substantial engineering efforts, while not necessarily being optimal. In this […]
Jul, 12

Bayesian inference for artificial perception using OpenCL on FPGAs and GPUs

This dissertation project addresses the implementation of Bayesian inference on FPGAs and GPUs, following a top-down approach and using OpenCL. The target application of this Bayesian inference algorithms is artificial perception in robotics. The aim is to improve the power efficiency of Bayesian inference computations. Previous work at our university in the scope of an […]
Jul, 12

Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation

GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent […]
Jul, 12

GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks

Graph Neural Networks (GNNs) have achieved significant improvements in various domains. Sparse Matrix-Matrix multiplication (SpMM) is a fundamental operator in GNNs, which performs a multiplication between a sparse matrix and a dense matrix. Accelerating SpMM on parallel hardware like GPUs can face the following challenges: From the GNN application perspective, the compatibility needs to be […]
Jul, 5

SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing

The SYCL standard promises to enable high productivity in heterogeneous programming of a broad range of parallel devices, including multicore CPUs, GPUs, and FPGAs. Its modern and expressive C++ API design, as well as flexible task graph execution model give rise to ample optimization opportunities at run-time, such as the overlapping of data transfers and […]
Jul, 5

Studies on CUDA Offloading for Real-Time Simulation and Visualization

The Graphics Processing Unit (GPU) is a co-processor designed to aid the Central Processing Unit (CPU) for rendering 3D graphics. The prompt development of these graphics chips due to the popularity of games and media design helped the GPU to evolve its ubiquitous parallel architecture. The programmability of these devices increased with the introduction of […]
Jul, 5

Efficient Matrix Factorization on Heterogeneous CPU-GPU Systems

Matrix Factorization (MF) has been widely applied in machine learning and data mining. A large number of algorithms have been studied to factorize matrices. Among them, stochastic gradient descent (SGD) is a commonly used method. Heterogeneous systems with multi-core CPUs and GPUs have become more and more promising recently due to the prevalence of GPUs […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: