17275

Posts

Jun, 1

A performance spectrum for parallel computational frameworks that solve PDEs

Important computational physics problems are often large-scale in nature, and it is highly desirable to have robust and high performing computational frameworks that can quickly address these problems. However, it is no trivial task to determine whether a computational framework is performing efficiently or is scalable. The aim of this paper is to present various […]
Jun, 1

SparkJNI: A Reference Design for a Heterogeneous Apache Spark Framework

The digital era’s requirements pose many challenges related to deployment, implementation and efficient resource utilization in modern hybrid computing infrastructures. In light of the recent improvements in computing units, the defacto structure of a high-performance computing cluster, ordinarily consisted of CPUs only, is superseeded by heterogeneous architectures (comprised of GPUs, FPGAs and DSPs) which offer […]
Jun, 1

Fast MPEG-CDVS Encoder with GPU-CPU Hybrid Computing

The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group (MPEG) has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In […]
Jun, 1

Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs

Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on fast convolution. The high computation throughput and memory bandwidth of graphics processing units (GPUs) make GPUs a natural choice for accelerating […]
Jun, 1

A Unified Optimization Approach for Sparse Tensor Operations on GPUs

Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor operations make […]
May, 24

Accelerating Discrete Wavelet Transforms on GPUs

The two-dimensional discrete wavelet transform has a huge number of applications in image-processing techniques. Until now, several papers compared the performance of such transform on graphics processing units (GPUs). However, all of them only dealt with lifting and convolution computation schemes. In this paper, we show that corresponding horizontal and vertical lifting parts of the […]
May, 24

Implementing Efficient, Portable Computations for Machine Learning

Computers are powerful tools which perform fast, accurate calculations over huge sets of data. However, many layers of abstraction are required to use computers for any given task. Recent advances in machine learning employ compute-intensive operations embedded in complex overall flows. Further, deployment of these systems must balance many concerns: accuracy, speed, energy, portability, and […]
May, 24

Parallel and in-process compilation of individuals for genetic programming on GPU

Three approaches to implement genetic programming on GPU hardware are compilation, interpretation and direct generation of machine code. The compiled approach is known to have a prohibitive overhead compared to other two. This paper investigates methods to accelerate compilation of individuals for genetic programming on GPU hardware. We apply in-process compilation to minimize the compilation […]
May, 24

Intel Xeon Phi acceleration of Hybrid Total FETI solver

This paper describes an approach for acceleration of the Hybrid Total FETI (HTFETI) domain decomposition method using the Intel Xeon Phi coprocessors. The HTFETI method is a memory bound algorithm which uses sparse linear BLAS operations with irregular memory access pattern. The presented local Schur complement (LSC) method has regular memory access pattern, that allows […]
May, 24

Espresso: Efficient Forward Propagation for BCNNs

There are many applications scenarios for which the computational performance and memory footprint of the prediction phase of Deep Neural Networks (DNNs) needs to be optimized. Binary Neural Networks (BDNNs) have been shown to be an effective way of achieving this objective. In this paper, we show how Convolutional Neural Networks (CNNs) can be implemented […]
May, 22

GPU System Call

GPUs are becoming first-class compute citizens and are being tasked to perform increasingly complex work. Modern GPUs increasingly support programmability-enhancing features such as shared virtual memory and hardware cache coherence, enabling them to run a wider variety of programs. But a key aspect of general-purpose programming where GPUs are still found lacking is the ability […]
May, 22

GPUMap: A Transparently GPU-Accelerated Map Function

As GPGPU computing becomes more popular, it will be used to tackle a wider range of problems. However, due to the current state of GPGPU programming, programmers are typically required to be familiar with the architecture of the GPU in order to effectively program it. Fortunately, there are software packages that attempt to simplify GPGPU […]

Recent source codes

* * *

* * *

HGPU group © 2010-2018 hgpu.org

All rights belong to the respective authors

Contact us: