high performance computing on graphics processing units: hgpu.org

Posts

Jun, 17

2nd International Conference on Mechanical, Aeronautical and Automotive Engineering (ICMAA), 2015

Topics: Mechanical Engineering Applied Mechanics Automation Biomechanics Computational Fluid Dynamics Design and Manufacturing Energy Management Fluid Dynamics Fuels and Combustion Green Manufacturing Heat and Mass Transfer Industrial Tribology Instrumentation and Control Internal Combustion Engines Mechatronics Micro-Machining Modeling of Processes Nano- Technology Optimization of Systems Renewable and Non-Renewable Energies Reverse Engineering Robotics Solid Mechanics Oil and […]

Jun, 17

RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles

RUMD is a general purpose, high-performance molecular dynamics (MD) simulation package running on graphical processing units (GPU’s). RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up to hundred thousand particles). It has a performance that is comparable to […]

CUDA

Jun, 17

Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA

OpenACC is an application programming interface (API) that aims to unleash the power of heterogeneous systems composed of CPUs and accelerators such as graphic processing units (GPUs) or Intel Xeon Phi coprocessors. This directive-based programming model is intended to enable developers to accelerate their application’s execution with much less effort. Coprocessors offer significant computing power […]

CUDA

Jun, 17

GPU-Enabled Particle-Particle Particle-Tree Scheme for Simulating Dense Stellar Cluster System

We describe the implementation and performance of the P^3T (Particle-Particle Particle-Tree) scheme for simulating dense stellar systems. In P^3T, the force experienced by a particle is split into short-range and long-range contributions. Short-range forces are evaluated by direct summation and integrated with the fourth order Hermite predictor-corrector method with the block timesteps. For long-range forces, […]

CUDA

Jun, 17

Automatic Data Layout Optimizations for GPUs

Memory optimizations have became increasingly important in order to fully exploit the computational power of modern GPUs. The data arrangement has a big impact on the performance, and it is very hard for GPU programmers to identify a well-suited data layout. Classical data layout transformations include grouping together data fields that have similar access patterns, […]

OpenCL

Jun, 17

Layered Interpretation of Street View Images

We propose a layered street view model to encode both depth and semantic information on street view images for autonomous driving. Recently, stixels, stix-mantics, and tiered scene labeling methods have been proposed to model street view images. We propose a 4-layer street view model, a compact representation over the recently proposed stix-mantics model. Our layers […]

CUDA

Jun, 16

Perfect Hashing Structures for Parallel Similarity Searches

Seed-based heuristics have proved to be efficient for studying similarity between genetic databases with billions of base pairs. This paper focuses on algorithms and data structures for the filtering phase in seed-based heuristics, with an emphasis on efficient parallel GPU/manycores implementation. We propose a 2-stage index structure which is based on neighborhood indexing and perfect […]

OpenCL

Jun, 16

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Graph algorithms are used in several domains such as social networking, biological sciences, computational geometry, and compilers, to name a few. It has been shown that they possess enough parallelism to keep several computing resources busy – even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware […]

CUDA

Jun, 16

Characterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs

Sparse matrix-vector multiplication (SpMV) is a widely used kernel in scientific applications as well as data analytics. Many GPU implementations of SpMV have been proposed, proposing different sparse matrix representations. However, no sparse matrix representation is consistently superior, and the best representation varies for sparse matrices with different sparsity patterns. In this paper we study […]

CUDA

Jun, 16

GPU Predictor-Corrector Interior Point Method for Large-Scale Linear Programming

This master’s thesis concerns the implementation of a GPUaccelerated version of Mehrotra’s predictor-corrector interior point algorithm for large-scale linear programming (LP). The implementations are tested on LP problems arising in the financial industry, where there is high demand for faster LP solvers. The algorithm was implemented in C++, MATLAB and CUDA, using double precision for […]

CUDA

Jun, 16

Parallelization of DIRA and CTmod using OpenMP and OpenCL

Parallelization is the answer to the ever-growing demands of computing power by taking advantage of multi-core processor technology and modern many-core graphics compute units. Multi-core CPUs and many-core GPUs have the potential to substantially reduce the execution time of a program but it is often a challenging task to ensure that all available hardware is […]

OpenCL

Jun, 14

Automatic Selection of Sparse Matrix Representation on GPUs

Sparse matrix-vector multiplication (SpMV) is a core kernel in numerous applications, ranging from physics simulation and large-scale solvers to data analytics. Many GPU implementations of SpMV have been proposed, targeting several sparse representations and aiming at maximizing overall performance. No single sparse matrix representation is uniformly superior, and the best performing representation varies for sparse […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

2nd International Conference on Mechanical, Aeronautical and Automotive Engineering (ICMAA), 2015

RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles

Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA

GPU-Enabled Particle-Particle Particle-Tree Scheme for Simulating Dense Stellar Cluster System

Automatic Data Layout Optimizations for GPUs

Layered Interpretation of Street View Images

Perfect Hashing Structures for Parallel Similarity Searches

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Characterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs

GPU Predictor-Corrector Interior Point Method for Large-Scale Linear Programming

Parallelization of DIRA and CTmod using OpenMP and OpenCL

Automatic Selection of Sparse Matrix Representation on GPUs

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)