high performance computing on graphics processing units: hgpu.org

Posts

Jan, 23

Deployment of CPU and GPU-based genetic programming on heterogeneous devices

A widely available and economic means of increasing the computing power applied to a problem is to use modern graphics processing units (GPUs) for parallel processing. We present a new, optimized general methodology for deploying genetic programming (GP) to the PC, Xbox 360 video game console, and Zune portable media device. This work describes, for […]

Jan, 23

Implementation of Parallel Genetic Algorithms on Graphics Processing Units

In this paper, we propose to parallelize a Hybrid Genetic Algorithm (HGA) on Graphics Processing Units (GPUs) which are available and installed on ubiquitous personal computers. HGA extends the classical genetic algorithm by incorporating the Cauchy mutation operator from evolutionary programming. In our parallel HGA, all steps except the random number generation procedure are performed […]

Jan, 23

High performance genetic programming on GPU

The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. We compare two parallelization schemes that evaluate several GP programs in parallel. We show that the […]

CUDA

Jan, 23

An Improved Magma Gemm For Fermi Graphics Processing Units

We present an improved matrix-matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, […]

CUDA

Jan, 22

Implementing molecular dynamics on hybrid high performance computers – short range forces

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In […]

CUDA

Jan, 22

Swan: A tool for porting CUDA programs to OpenCL

The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence […]

CUDA

•

OpenCL

Jan, 22

A parallel evolutionary algorithm to optimize dynamic memory managers in embedded systems

For the last 30 years, several dynamic memory managers (DMMs) have been proposed. Such DMMs include first fit, best fit, segregated fit and buddy systems. Since the performance, memory usage and energy consumption of each DMM differs, software engineers often face difficult choices in selecting the most suitable approach for their applications. This issue has […]

Jan, 22

Evolving Soft Robotic Locomotion in PhysX

Given the complexity of the problem, genetic algorithms are one of the more promising methods of discovering control schemes for soft robotics. Since physically embodied evolution is time consuming and expensive, an outstanding challenge lies in developing fast and suitably realistic simulations in which to evolve soft robot gaits. We describe two parallel methods of […]

CUDA

Jan, 22

Porting estimation of distribution algorithms to the cell broadband engine

Current consumer-grade computers and game devices incorporate very powerful processors that can be used to accelerate many classes of scientific codes. In this paper we explore the ability of the Cell Broadband Engine to run two similar Estimation of Distribution Algorithms, one for the discrete domain and the other for the continuous domain. Starting from […]

Jan, 22

Evaluating the cell broadband engine as a platform to run estimation of distribution algorithms

Current consumer-grade computers and game devices incorporate very powerful processors that can be used to accelerate many classes of scientific codes. However, programming multi-core chips, hybrid multi-processors or graphical processing units is not an easy task for those programmers that deal mainly with sequential codes. In this paper, we explore the ability of the Cell […]

Jan, 22

Strategies to minimise the total run time of cyclic graph based genetic programming with GPUs

In this paper, we describe our work to investigate how much cyclic graph based Genetic Programming (GP) can be accelerated on one machine using currently available mid-range Graphics Processing Units (GPUs). Cyclic graphs pose different problems for evaluation than do trees and we describe how our CUDA based, “population parallel” evaluator tackles these problems. Previous […]

CUDA

Jan, 22

Distributed genetic programming on GPUs using CUDA

Using of a cluster of Graphics Processing Unit (GPU) equipped computers, it is possible to accelerate the evaluation of individuals in Genetic Programming. Program compilation, fitness case data and fitness execution are spread over the cluster of computers, allowing for the efficient processing of very large datasets. Here, the implementation is demonstrated on datasets containing […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Deployment of CPU and GPU-based genetic programming on heterogeneous devices

Implementation of Parallel Genetic Algorithms on Graphics Processing Units

High performance genetic programming on GPU

An Improved Magma Gemm For Fermi Graphics Processing Units

Implementing molecular dynamics on hybrid high performance computers – short range forces

Swan: A tool for porting CUDA programs to OpenCL

A parallel evolutionary algorithm to optimize dynamic memory managers in embedded systems

Evolving Soft Robotic Locomotion in PhysX

Porting estimation of distribution algorithms to the cell broadband engine

Evaluating the cell broadband engine as a platform to run estimation of distribution algorithms

Strategies to minimise the total run time of cyclic graph based genetic programming with GPUs

Distributed genetic programming on GPUs using CUDA

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)