high performance computing on graphics processing units: hgpu.org

Posts

Jul, 15

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators

Kokkos [1], [2] is a C++ programming model that offers the ability to write portable code that targets a wide degree of parallelism found in current HPC systems. It works by providing abstractions for parallel execution and data layouts that are mapped to different hardware resources during compilation. Some parameters, such as the size of […]

Jul, 15

Multicore architecture and cache optimization techniques for solving graph problems

With the advent of era of Big Data and Internet of Things, there has been an exponential increase in the availability of large data sets. These data sets require in-depth analysis that provides intelligence for improvements in methods for academia and industry. Majority of the data sets are represented and available in the form of […]

CUDA

Jul, 15

CloudCL: Single-Paradigm Distributed Heterogeneous Computing for Cloud Infrastructures

The ever-growing demand for compute resources has reached a wide range of application domains, and with that has created a larger audience for compute-intensive tasks. In this paper, we present the CloudCL framework, which empowers users to run compute-intensive tasks without having to face the total cost of ownership of operating an extensive high-performance compute […]

OpenCL

Jul, 15

Data-Parallel Hashing Techniques for GPU Architectures

Hash tables are one of the most fundamental data structures for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. Key factors affecting the performance of different hashing schemes are […]

CUDA

Jul, 7

Application of Deep-Learning to Compiler-Based Graphs

Graph-structured data is used in many domains to represent complex objects, such as the molecular structure of chemicals or interactions between members of a social network. However, extracting meaningful information from these graphs is a difficult task, which is often undertaken on a case by case basis. Devising automated methods to mine information from graphs […]

OpenCL

Jul, 7

Calamari – A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results (Springmann and L"udeling, 2016, Reul et al., 2017a). To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and […]

Jul, 7

Energy Consumption of Algorithms for Solving the Compressible Navier-Stokes Equations on CPU’s, GPU’s and KNL’s

In addition to the hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU […]

CUDA

Jul, 7

FluidFFT: common API (C++ and Python) for Fast Fourier Transform HPC libraries

The Python package fluidfft provides a common Python API for performing Fast Fourier Transforms (FFT) in sequential, in parallel and on GPU with different FFT libraries (FFTW, P3DFFT, PFFT, cuFFT). fluidfft is a comprehensive FFT framework which allows Python users to easily and efficiently perform FFT and the associated tasks, such as as computing linear […]

CUDA

Jul, 7

An Efficient Dispatcher for Large Scale GraphProcessing on OpenCL-based FPGAs

High parallel framework has been proved to be very suitable for graph processing. There are various work to optimize the implementation in FPGAs, a pipeline parallel device. The key to make use of the parallel performance of FPGAs is to process graph data in pipeline model and take advantage of on-chip memory to realize necessary […]

OpenCL

Jul, 5

Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes

Data as well as hardware characteristics are two key aspects for efficient data management. This holds in particular for the field of in-memory data processing. Aside from increasing main memory capacities, efficient in-memory processing benefits from novel processing concepts based on lightweight compressed data. Thus, an active research field deals with the adaptation of new […]

Jul, 5

Exploration of Low Numeric Precision Deep Learning Inference Using Intel FPGAs

CNNs have been shown to maintain reasonable classification accuracy when quantized to lower precisions. Quantizing to sub 8-bit activations and weights can result in accuracy falling below an acceptable threshold. Techniques exist for closing the accuracy gap of limited numeric precision typically by increasing computation. This results in a trade-off between throughput and accuracy and […]

OpenCL

Jul, 5

Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL

Discovering identical or near-identical items is urgently important in many applications such as Web crawling since it drastically reduces the text processing costs. Simhash is a widely used technique, able to attribute a bit-string identity to a text, such that similar texts have similar identities. In this study, a real-time solution for a simhash calculation […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators

Multicore architecture and cache optimization techniques for solving graph problems

CloudCL: Single-Paradigm Distributed Heterogeneous Computing for Cloud Infrastructures

Data-Parallel Hashing Techniques for GPU Architectures

Application of Deep-Learning to Compiler-Based Graphs

Calamari – A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Energy Consumption of Algorithms for Solving the Compressible Navier-Stokes Equations on CPU’s, GPU’s and KNL’s

FluidFFT: common API (C++ and Python) for Fast Fourier Transform HPC libraries

An Efficient Dispatcher for Large Scale GraphProcessing on OpenCL-based FPGAs

Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes

Exploration of Low Numeric Precision Deep Learning Inference Using Intel FPGAs

Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)