high performance computing on graphics processing units: hgpu.org

Posts

Jan, 5

Parallel evolutionary algorithms on graphics processing unit

Evolutionary algorithms (EAs) are effective and robust methods for solving many practical problems such as feature selection, electrical circuit synthesis, and data mining. However, they may execute for a long time for some difficult problems, because several fitness evaluations must be performed. A promising approach to overcome this limitation is to parallelize these algorithms. In […]

Jan, 5

Data-parallel computing

Data parallelism is a key concept in leveraging the power of today’s manycore GPUs.

Jan, 5

Significantly Improved Performances Of The Cryptographically Generated Addresses Thanks To ECC And GPGPU

Cryptographically Generated Addresses (CGA) are today mainly used with the Secure Neighbor Discovery Protocol (SEND). Despite CGA generalization, current standards only show how to construct CGA with the RSA algorithm and SHA-1 hash function. This limitation may prevent new usages of CGA and SEND in mobile environments where nodes are energy and storage limited. In […]

CUDA

Jan, 5

A design case study: CPU vs. GPGPU vs. FPGA

This paper describes our winning submission for the Absolute Performance category of the MEMOCODE 2009 Design Contest. We show that our GPGPU-based design achieves performance within a factor of four of theoretical maximum performance for the implemented algorithm. This result was reached after a short design-cycle of 2 man-days, which indicates that the NVIDIA CUDA […]

CUDA

Jan, 5

Revisiting sorting for GPGPU stream architectures

This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value […]

CUDA

Jan, 5

A complete modular resultant algorithm targeted for realization on graphics hardware

This paper presents a complete modular approach to computing bivariate polynomial resultants on Graphics Processing Units (GPU). Given two polynomials, the algorithm first maps them to a prime field for sufficiently many primes, and then processes each modular image individually. We evaluate each polynomial at several points and compute a set of univariate resultants for […]

CUDA

Jan, 5

Modular Resultant Algorithm for Graphics Processors

In this paper we report on the recent progress in computing bivariate polynomial resultants on Graphics Processing Units (GPU). Given two polynomials in Z[x,y], our algorithm first maps the polynomials to a prime field. Then, each modular image is processed individually. The GPU evaluates the polynomials at a number of points and computes univariate modular […]

CUDA

Jan, 5

Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford

The ParLab at Berkeley, UPCRC-Illinois, and the Pervasive Parallel Laboratory at Stanford are studying how to make parallel programming succeed given industry’s recent shift to multicore computing. All three centers assume that future microprocessors will have hundreds of cores and are working on applications, programming environments, and architectures that will meet this challenge. This article […]

Jan, 4

Practical Random Linear Network Coding on GPUs

Recently, random linear network coding has been widely applied in peer-to-peer network applications. Instead of sharing the raw data with each other, peers in the network produce and send encoded data to each other. As a result, the communication protocols have been greatly simplified, and the applications experience higher end-to-end throughput and better robustness to […]

CUDA

Jan, 4

Nuclei: GPU-Accelerated Many-Core Network Coding

While it is a well known result that network coding achieves optimal flow rates in multicast sessions, its potential for practical use has remained to be a question, due to its high computational complexity. Our previous work has attempted to design a hardware-accelerated and multi-threaded implementation of network coding to fully utilize multi-core CPUs, as […]

CUDA

Jan, 4

A parameterisable and scalable Smith-Waterman algorithm implementation on CUDA-compatible GPUs

This paper describes a multi-threaded parallel design and implementation of the Smith-Waterman (SM) algorithm on compute unified device architecture (CUDA)-compatible graphic processing units (GPUs). A novel technique has been put forward to solve the restriction on the length of the query sequence in previous GPU implementations of the Smith-Waterman algorithm. The main reasons behind this […]

CUDA

Jan, 4

Approximate Belief Propagation by Hierarchical Averaging of Outgoing Messages

This paper presents an approximate belief propagation algorithm that replaces outgoing messages from a node with the averaged outgoing message and propagates messages from a low resolution graph to the original graph hierarchically. The proposed method reduces the computational time by half or two-thirds and reduces the required amount of memory by 60% compared with […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Parallel evolutionary algorithms on graphics processing unit

Data-parallel computing

Significantly Improved Performances Of The Cryptographically Generated Addresses Thanks To ECC And GPGPU

A design case study: CPU vs. GPGPU vs. FPGA

Revisiting sorting for GPGPU stream architectures

A complete modular resultant algorithm targeted for realization on graphics hardware

Modular Resultant Algorithm for Graphics Processors

Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford

Practical Random Linear Network Coding on GPUs

Nuclei: GPU-Accelerated Many-Core Network Coding

A parameterisable and scalable Smith-Waterman algorithm implementation on CUDA-compatible GPUs

Approximate Belief Propagation by Hierarchical Averaging of Outgoing Messages

Recent source codes

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

LC Framework

pplx-garden: Perplexity open source garden for inference technology

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

OpScanner

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Most viewed papers (last 30 days)