high performance computing on graphics processing units: hgpu.org

Posts

May, 16

NPBench: A Benchmarking Suite for High-Performance NumPy

Python, already one of the most popular languages for scientific computing, has made significant inroads in High Performance Computing (HPC). At the center of Python’s ecosystem is NumPy, an efficient implementation of the multi-dimensional array (tensor) structure, together with basic arithmetic and linear algebra. Compared to traditional HPC languages, the relatively low performance of Python […]

CUDA

Jan, 20

Automatic acceleration of Numpy applications on GPUs and multicore CPUs

Frameworks like Numpy are a popular choice for application developers from varied fields such as image processing to bio-informatics to machine learning. Numpy is often used to develop prototypes or for deployment since it provides efficient implementation for operations involving arrays. Such an approach requires every operation to be executed eagerly. The result of each […]

Nov, 16

CUDArray: CUDA-based NumPy

This technical report introduces CUDArray – a CUDA-accelerated subset of the NumPy library. The goal of CUDArray is to combine the ease of development from NumPy with the computational power of Nvidia GPUs in a lightweight and extensible framework. Since the motivation behind CUDArray is to facilitate neural network programming, CUDArray extends NumPy with a […]

CUDA

Nov, 22

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

In this paper we introduce Bohrium, a runtime-system for mapping array-operations onto a number of different hardware platforms, from multi-core systems to clusters and GPU enabled systems. As a result, the Bohrium runtime system enables NumPy code to utilize CPU, GPU, and Clusters. Bohrium integrates seamlessly into NumPy through the implicit data parallelization of array […]

OpenCL

Nov, 26

A New Compilation Path: From Python/NumPy to OpenCL

Jit4OpenCL is a new compiler that converts scientific applications written in Python/NumPy into OpenCL code. This compiler is based on unPython, an ahead-of-time compiler from Python/Numpy to an intermediate form and OpenMP code, and on jit4GPU, a just-in-time compiler that converts that intermediate code into AMD CAL code that is specific for AMD GPUs. The […]

OpenCL

Sep, 17

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

We evaluate the use of the open-source Llama-2 model for generating well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on different parallel programming models and languages (e.g., C++: OpenMP, OpenMP Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python: numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built upon our previous work […]

CUDA

Nov, 14

Performance Evaluation of Python ParallelProgramming Models: Charm4Py and mpi4py

Python is rapidly becoming the lingua franca of machine learning and scientific computing. With the broad use of frameworks such as Numpy, SciPy, and TensorFlow, scientific computing and machine learning are seeing a productivity boost on systems without a requisite loss in performance. While high-performance libraries often provide adequate performance within a node, distributed computing […]

CUDA

Oct, 24

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowing library developers to enhance performance of their applications by harnessing the computing power offered by High Performance Computing (HPC) platforms. […]

CUDA

Jul, 4

Productivity, Portability, Performance: Data-Centric Python

Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. […]

Jul, 19

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads

As specialized hardware accelerators such as GPUs become increasingly popular, developers are looking for ways to target these platforms with high-level APIs. One promising approach is kernel libraries such as PyTorch or cuML, which provide interfaces that mirror CPU-only counterparts such as NumPy or Scikit-Learn. Unfortunately, these libraries are hard to develop and to adopt […]

CUDA

Sep, 3

DeepPy: Pythonic deep learning

This technical report introduces DeepPy – a deep learning framework built on top of NumPy with GPU acceleration. DeepPy bridges the gap between highperformance neural networks and the ease of development from Python/NumPy. Users with a background in scientific computing in Python will quickly be able to understand and change the DeepPy codebase as it […]

CUDA

Oct, 6

A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs

Array-based languages such as MATLAB and Python (with NumPy) have become very popular for scientific computing. However, the performance of the implementations of these languages is often lacking. For example, some of the implementations are interpreted. Further, these languages were not designed with multi-core CPUs and GPUs in mind and thus don’t take full advantage […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

NPBench: A Benchmarking Suite for High-Performance NumPy

Automatic acceleration of Numpy applications on GPUs and multicore CPUs

CUDArray: CUDA-based NumPy

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

A New Compilation Path: From Python/NumPy to OpenCL

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

Performance Evaluation of Python ParallelProgramming Models: Charm4Py and mpi4py

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Productivity, Portability, Performance: Data-Centric Python

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads

DeepPy: Pythonic deep learning

A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs

Recent source codes

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

EnergyUCB-Bandit

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

Faial: finds bugs in CUDA kernels

Intel® SHMEM: Device initiated shared memory based communication library

miniLB: Lattice Botlzmann miniapp w/SYCL

AFOCL

2domination

MFC: Exascale simulation of multiphase/physics fluid dynamics

UVaFTLE: Lagrangian finite time Lyapunov exponent extraction for fluid dynamic applications

Most viewed papers (last 30 days)