high performance computing on graphics processing units: hgpu.org

Posts

Feb, 18

MapSQ: A MapReduce-based Framework for SPARQL Queries on GPU

In this paper, we present a MapReduce-based framework for evaluating SPARQL queries on GPU (named MapSQ) to large-scale RDF datesets efficiently by applying both high performance. Firstly, we develop a MapReduce-based Join algorithm to handle SPARQL queries in a parallel way. Secondly, we present a coprocessing strategy to manage the process of evaluating queries where […]

CUDA

Feb, 14

Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose

Fully connected network has been widely used in deep learning, and its computation efficiency is highly benefited from the matrix multiplication algorithm with cuBLAS on GPU. However, We found that, there exist some drawbacks of cuBLAS in calculating matrix $textbf{A}$ multiplies the transpose of matrix $textbf{B}$ (i.e., NT operation). To reduce the impact of NT […]

CUDA

Feb, 14

Best Practice Guide Intel Xeon Phi v2.0

This Best Practice Guide provides information about Intel’s Many Integrated Core (MIC) architecture and programming models for the first generation Intel Xeon Phi coprocessor named Knights Corner (KNC) in order to enable programmers to achieve good performance out of their applications. The guide covers a wide range of topics from the description of the hardware […]

Feb, 14

Improved Lossless Image Compression Model Using Coefficient Based Discrete Wavelet Transform

Compression is used for storage related applications that offers compression of audio/video, executable program, text, source code and so on. While compressing images into smallest space as possible, the constraint lies in the multispectral form of data with continuous images. In such a scenario, efficient lossless image compression is required such that the compression ratio […]

OpenCL

Feb, 14

cellGPU: massively parallel simulations of dynamic vertex models

Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cell interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and […]

CUDA

Feb, 14

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized variant of DNNs. BNNs constraint network weight and/or neuron value to either +1 or -1, which is representable in 1 bit. This leads to dramatic algorithm efficiency improvement, due to reduction in […]

CUDA

Feb, 12

BIG Data Business Intelligence Peer Group Meeting, 2017

A single CPU reached its limit of computational throughput over a decade ago, and as a response the technology industry was forced to shift to parallel processing. Today processors are increasingly parallel, with increasing core counts, wider SIMD lanes, and more hardware threads. Systems are also heterogeneous, so that a single workstation, server, or smartphone […]

Feb, 10

Development of JavaScript-based deep learning platform and application to distributed training

Deep learning is increasingly attracting attention for processing big data. Existing frameworks for deep learning must be set up to specialized computer systems. Gaining sufficient computing resources therefore entails high costs of deployment and maintenance. In this work, we implement a matrix library and deep learning framework that uses JavaScript. It can run on web […]

OpenCL

Feb, 10

GPU-Accelerated SVM Training Algorithm Based on PC and Mobile Device

This work is to design an accelerated SVM (Support Vector Machine) which is suitable for Android operating system. SVM is widely used in the health-related applications. The SVM provides a potential classification technology based on the pattern recognition method and statistical learning theory. This paper proposes a parallel SVM algorithm based on GPU accelerator. GPU […]

OpenCL

Feb, 10

gearshifft – The FFT Benchmark Suite for Heterogeneous Platforms

Fast Fourier Transforms (FFTs) are exploited in a wide variety of fields ranging from computer science to natural sciences and engineering. With the rising data production bandwidths of modern FFT applications, judging best which algorithmic tool to apply, can be vital to any scientific endeavor. As tailored FFT implementations exist for an ever increasing variety […]

OpenCL

Feb, 10

Acceleration of low-latency gravitational wave searches using Maxwell-microarchitecture GPUs

Low-latency detections of gravitational waves (GWs) are crucial to enable prompt follow-up observations to astrophysical transients by conventional telescopes. We have developed a low-latency pipeline using a technique called Summed Parallel Infinite Impulse Response (SPIIR) filtering, realized by a Graphic Processing Unit (GPU). In this paper, we exploit the new Maxwell memory access architecture in […]

CUDA

Feb, 10

Backpropagation Training for Fisher Vectors within Neural Networks

Fisher-Vectors (FV) encode higher-order statistics of a set of multiple local descriptors like SIFT features. They already show good performance in combination with shallow learning architectures on visual recognitions tasks. Current methods using FV as a feature descriptor in deep architectures assume that all original input features are static. We propose a framework to jointly […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

MapSQ: A MapReduce-based Framework for SPARQL Queries on GPU

Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose

Best Practice Guide Intel Xeon Phi v2.0

Improved Lossless Image Compression Model Using Coefficient Based Discrete Wavelet Transform

cellGPU: massively parallel simulations of dynamic vertex models

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

BIG Data Business Intelligence Peer Group Meeting, 2017

Development of JavaScript-based deep learning platform and application to distributed training

GPU-Accelerated SVM Training Algorithm Based on PC and Mobile Device

gearshifft – The FFT Benchmark Suite for Heterogeneous Platforms

Acceleration of low-latency gravitational wave searches using Maxwell-microarchitecture GPUs

Backpropagation Training for Fisher Vectors within Neural Networks

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)