high performance computing on graphics processing units: hgpu.org

Posts

May, 11

Large scale parallel state space search utilizing graphics processing units and solid state disks

The evolution of science is a double-track process composed of theoretical insights on the one hand and practical inventions on the other one. While in most cases new theoretical insights motivate hardware developers to produce systems following the theory, in some cases the shown hardware solutions force theoretical research to forecast the results to expect. […]

CUDA

May, 11

CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures

Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applying masked […]

CUDA

May, 10

Enhancing data parallelism for Ant Colony Optimization on GPUs

Graphics Processing Units (GPUs) have evolved into highly parallel and fully programmable architectures over the past five years, and the advent of CUDA has facilitated their application to many real-world applications. In this paper, we deal with a GPU implementation of Ant Colony Optimisation (ACO), a population-based optimisation method which comprises two major stages: Tour […]

CUDA

May, 10

A GPU-Accelerated Algorithm for Self-Organizing Maps in a Distributed Environment

In this paper we introduce a MapReduce-based implementation of self-organizing maps that performs compute-bound operations on distributed GPUs. The kernels are optimized to ensure coalesced memory access and effective use of shared memory. We have performed extensive tests of our algorithms on a cluster of eight nodes with two NVidia Tesla M2050 attached to each, […]

CUDA

May, 10

An Efficient Common Substrings Algorithm for On-the-Fly Behavior-Based Malware Detection and Analysis

It is well known that malware (worms, botnets, etc…) thrive on communication systems. The process of detecting and analyzing malware is very latent and not well-suited for real-time application, which is critical especially for propagating malware. For this reason, recent methods identify similarities among malware dynamic trace logs to extract malicious behavior snippets. These snippets […]

CUDA

May, 10

Constructing Natural Neighbor Interpolation Based Grid DEM Using CUDA

Constructing digitial elevation model(DEM) from dense LiDAR points becomes increasingly important. Natural Neighbor Interpolation (NNI) is a popular approach to DEM construction from point datasets but is computationally intensive. In this study, we present a set of General Purpose computing Graphics Processing Unit(GPGPU) based algorithms that can significant speed up the process. Evaluating three real […]

CUDA

May, 10

GHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomics

BACKGROUND: A large number of sensitive homology searches are required for mapping DNA sequence fragments to known protein sequences in public and private databases during metagenomic analysis. BLAST is currently used for this purpose, but its calculation speed is insufficient, especially for analyzing the large quantities of sequence data obtained from a next-generation sequencer. However, […]

CUDA

May, 9

Exploration of Optimization Options for Increasing Performance of a GPU Implementation of a Three-Dimensional Bilateral Filter

This report explores using GPUs as a platform for performing high performance medical image data processing, specifically smoothing using a 3D bilateral filter, which performs anisotropic, edge-preserving smoothing. The algorithm consists of a running a specialized 3D convolution kernel over a source volume to produce an output volume. Overall, our objective is to understand what […]

CUDA

May, 9

An Overview of Selected Hybrid and Reconfigurable Architectures

Node level heterogeneous architectures have become attractive in recent years for several reasons: Compared to traditional symmetric CPUs, they offer high performance for real applications, and can be energy and/or cost efficient. In this paper, we give an overview of the state-of-the-art in heterogeneous computing, focusing on some common architectures: The NVidia and the ATI […]

OpenCL

May, 9

Automatic Discovery of Algorithms for Multi-Agent Systems

Automatic algorithm generation for large-scale distributed systems is one of the holy grails of artificial intelligence and agent-based modeling. It has direct applicability in future engineered (embedded) systems, such as mesh networks of sensors and actuators where there is a high need to harness their capabilities via algorithms that have good scalability characteristics. NetLogo has […]

CUDA

May, 9

Enabling task-level scheduling on heterogeneous platforms

OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. […]

OpenCL

May, 9

high performance computing on graphics processing units: hgpu.org

Posts

Large scale parallel state space search utilizing graphics processing units and solid state disks

CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures

Enhancing data parallelism for Ant Colony Optimization on GPUs

A GPU-Accelerated Algorithm for Self-Organizing Maps in a Distributed Environment

An Efficient Common Substrings Algorithm for On-the-Fly Behavior-Based Malware Detection and Analysis

Constructing Natural Neighbor Interpolation Based Grid DEM Using CUDA

GHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomics

Exploration of Optimization Options for Increasing Performance of a GPU Implementation of a Three-Dimensional Bilateral Filter

An Overview of Selected Hybrid and Reconfigurable Architectures

Automatic Discovery of Algorithms for Multi-Agent Systems

Enabling task-level scheduling on heterogeneous platforms

Divide-and-Conquer 3D Convex Hulls on the GPU

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)