17110

Posts

Apr, 3

Merge or Separate? Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms

Computer systems are increasingly heterogeneous with nodes consisting of CPUs and GPU accelerators. As such systems become mainstream, they move away from specialized highperformance single application platforms to a more general setting with multiple, concurrent, application jobs. Determining how jobs should be dynamically best scheduled to heterogeneous devices is non-trivial. In certain cases, performance is […]
Apr, 3

Evaluation of Libraries for Parallel Computing in Haskell – A Case Study with a Super-resolution Application

Haskell is a functional language featuring lazy evaluation and referential transparency. On one hand, Referential transparency is useful for parallel computing because the results do not depend on the evaluation order, but on the other hand, parallel computing requires an evaluation order that is different from that of lazy evaluation. There are some parallel programming […]
Apr, 3

Sparse Matrix-Vector Multiplication on GPGPUs

The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrixvector multiplication is therefore crucial and has been the subject of an […]
Apr, 3

Chai: Collaborative Heterogeneous Applications for Integrated-architectures

Heterogeneous system architectures are evolving towards tighter integration among devices, with emerging features such as shared virtual memory, memory coherence, and systemwide atomics. Languages, device architectures, system specifications, and applications are rapidly adapting to the challenges and opportunities of tightly integrated heterogeneous platforms. Programming languages such as OpenCL 2.0, CUDA 8.0, and C++ AMP allow […]
Apr, 3

FPGA-based Tsunami Simulation: Performance Comparison with GPUs, and Roofline Model for Scalability Analysis

MOST (Method Of Splitting Tsunami) is widely used to solve shallow water equations (SWEs) for simulation of tsunami. This paper presents high-performance and power-efficient computation of MOST for practical tsunami simulation with FPGA. The custom hardware for simulation is based on a stream computing architecture for deeply pipelining to increase performance with a limited bandwidth. […]
Mar, 28

Pipelined MapReduce: A Decoupled MapReduce RunTime for Shared Memory Multi-Processors

Modern multi-processors embody up to hundreds of cores in a single chip, in an attempt to attain TFlops/sec performance. Many subtle programming frameworks have emerged in order to facilitate the development of parallel, efficient and scalable applications. The MapReduce programming model, after having indisputably, demonstrated its usability and effectiveness in the area of Large-Scale Distributed […]
Mar, 28

LookNN: Neural Network with No Multiplication

Neural networks are machine learning models that have been successfully used in many applications. Due to the high computational complexity of neural networks, deploying such models on embedded devices with severe power/resource constraints is troublesome. Neural networks are inherently approximate and can be simplified. We propose LookNN, a methodology to replace floating-point multiplications with lookup […]
Mar, 28

Energy conservation techniques for GPU computing

The emerging general purpose graphics processing units (GPGPU) computing has tremendously speeded up a great variety of commercial and scientific applications. The GPUs have become prevalent accelerators in current high performance clusters. Though the computational capacity per Watt of the GPUs is much higher than that of the CPUs, the hybrid GPU clusters still consume […]
Mar, 28

Parallelized Vlasov-Fokker-Planck solver for desktop personal computers

The numerical solution of the Vlasov-Fokker-Planck equation is a well established method to simulate the dynamics, including the self-interaction with its own wake field, of an electron bunch in a storage ring. In this paper we present Inovesa, a modularly extensible program that uses OpenCL to massively parallelize the computation. It allows a standard desktop […]
Mar, 28

APUNet: Revitalizing GPU as Packet Processing Accelerator

Many research works have recently experimented with GPU to accelerate packet processing in network applications. Most works have shown that GPU brings a significant performance boost when it is compared to the CPU-only approach, thanks to its highly-parallel computation capacity and large memory bandwidth. However, a recent work argues that for many applications, the key […]
Mar, 20

Comparing Programmer Productivity in OpenACC and CUDA: an Empirical Investigation

OpenACC has been touted as a "high productivity" API designed to make GPGPU programming accessible to scientific programmers, but to date, no studies have attempted to verify this quantitatively. In this paper, we conduct an empirical investigation of program productivity comparisons between OpenACC and CUDA in the programming time, the execution time and the analysis […]
Mar, 20

A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and Quantum Computers

Current Deep Learning approaches have been very successful using convolutional neural networks (CNN) trained on large graphical processing units (GPU)-based computers. Three limitations of this approach are: 1) they are based on a simple layered network topology, i.e., highly connected layers, without intra-layer connections; 2) the networks are manually configured to achieve optimal results, and […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: