high performance computing on graphics processing units: hgpu.org

Posts

Sep, 7

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

We present FLASH (Fast LSH Algorithm for Similarity search accelerated with HPC (High-Performance Computing)), a similarity search system for ultra-high dimensional datasets on a single machine, which does not require similarity computation. Our system is an auspicious illustration of the power of randomized algorithms carefully tailored for high-performance computing platforms. We leverage LSH style randomized […]

OpenCL

Sep, 7

Multi-Tasking Scheduling for Heterogeneous Systems

Heterogeneous platforms play an increasingly important role in modern computer systems. They combine high performance with low power consumption. From mobiles to supercomputers, we see an increasing number of computer systems that are heterogeneous. The most well-known heterogeneous system, CPU+GPU platforms have been widely used in recent years. As they become more mainstream, serving multiple […]

OpenCL

Sep, 7

Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization

An efficient algorithm for recurrent neural network training is presented. The approach increases the training speed for tasks where a length of the input sequence may vary significantly. The proposed approach is based on the optimal batch bucketing by input sequence length and data parallelization on multiple graphical processing units. The baseline training performance without […]

Sep, 3

Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters

Data analytics is undergoing a revolution in many scientific domains, demanding cost-effective parallel data analysis techniques. Traditional Java-based Big Data processing tools like Hadoop MapReduce are designed for commodity CPUs. In contrast, emerging manycore processors like Xeon Phi has an order of magnitude of computation power and memory bandwidth. To harness the computing capabilities, we […]

Sep, 3

Integer sorting on multicores: some (experiments and) observations

There have been many proposals for sorting integers on multicores/GPUs that include radix-sort and its variants or other approaches that exploit specialized hardware features of a particular multicore architecture. Comparison-based algorithms have also been used. Network-based algorithms have also been used with primary example Batcher’s bitonic sorting algorithm. Although such a latter approach is theoretically […]

CUDA

Sep, 3

Real-Time Rendering of Molecular Dynamics Simulation Data: A Tutorial

Achieving real-time molecular dynamics rendering is a challenge, especially when the rendering requires intensive computation involving a large simulation data-set. The task becomes even more challenging when the size of the data is too large to fit into random access memory (RAM) and the final imagery depends on the input and output (I/O) performance. The […]

OpenCL

•

OpenGL

Sep, 3

Towards On-Chip Optical FFTs for Convolutional Neural Networks

Convolutional neural networks have become an essential element of spatial deep learning systems. In the prevailing architecture, the convolution operation is performed with Fast Fourier Transforms (FFT) electronically in GPUs. The parallelism of GPUs provides an efficiency over CPUs, however both approaches being electronic are bound by the speed and power limits of the interconnect […]

Sep, 3

Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes

The basic features of some of the most versatile and popular open source frameworks for machine learning (TensorFlow, Deep Learning4j, and H2O) are considered and compared. Their comparative analysis was performed and conclusions were made as to the advantages and disadvantages of these platforms. The performance tests for the de facto standard MNIST data set […]

CUDA

Aug, 26

Accelerate Local Tone Mapping for High Dynamic Range Images Using OpenCL with GPU

Tone mapping has been used to transfer HDR (high dynamic range) images to low dynamic range. This paper describes an algorithm to display high dynamic range images. Although local tone-mapping operator is better than global operator in reproducing images with better details and contrast, however, local tone mapping algorithm usually requires a huge amount of […]

OpenCL

Aug, 26

Vulnerability Analysis and Attacks on Intel Xeon Phi Coprocessor

The Intel Xeon Phi coprocessor is a PCIe based add-in card. Though it is prone to simple attacks, many high performance computing systems are constructed by combining CPUs and coprocessors. This paper describes two attacks that exploit vulnerabilities related to the boot process of coprocessor and ownership of offload user. Proof of concept codes are […]

Aug, 26

Large Integer Arithmetic in GPU for Cryptography

Most computer nowadays support 32 bits or 64 bits of data type on various type of programming languages and they are sufficient for most use cases. However, in cryptography, the required range and precision are more than 64 bits which are computationally expensive on CPUs. In this report, we present our design and implementation of […]

CUDA

Aug, 26

Dynamic Parallelism in GPU Optimized Barnes Hut Trees for Molecular Dynamics Simulations

Since the beginning of the modern computing era, high performance computing has been pushing the boundaries of the types of problems that can be solved in many different disciplines. One of the leading fields is computational biophysics where molecular dynamics (MD) simulations provide microscopic resolution details of how biomolecules move, fold, and assemble into intricate […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

Multi-Tasking Scheduling for Heterogeneous Systems

Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization

Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters

Integer sorting on multicores: some (experiments and) observations

Real-Time Rendering of Molecular Dynamics Simulation Data: A Tutorial

Towards On-Chip Optical FFTs for Convolutional Neural Networks

Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes

Accelerate Local Tone Mapping for High Dynamic Range Images Using OpenCL with GPU

Vulnerability Analysis and Attacks on Intel Xeon Phi Coprocessor

Large Integer Arithmetic in GPU for Cryptography

Dynamic Parallelism in GPU Optimized Barnes Hut Trees for Molecular Dynamics Simulations

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)