16830

Posts

Dec, 18

The 5th International Workshop on OpenCL (IWOCL), 2017

The International Workshop on OpenCL (IWOCL) is an annual meeting of OpenCL users, researchers, developers and suppliers to share OpenCL best practise, and to promote the evolution and advancement of the OpenCL standard. The meeting is open to anyone who is interested in contributing to, and participating in the OpenCL community. IWOCL is the premier […]
Dec, 18

International Conference on Biomacromolecules and Biomimetic Materials (ICBBM), 2017

2017 International Conference on Biomacromolecules and Biomimetic Materials (ICBBM 2017) will be held in Boracay, Philippine during March 6-12, 2017.The objective of ICBBM 2017 is to present the latest research and results of scientists related to Biomacromolecules and Biomimetic Materials topics. This conference provides opportunities for the different areas delegates to exchange new ideas and […]
Dec, 17

GPU-Based Nonlocal Filtering for Large Scale SAR Processing

In the past few years nonlocal filters have emerged as a serious contender for denoising synthetic aperture radar (SAR) images, offering superior noise reduction and detail preservation compared to many other filters. In this manuscript we analyze how nonlocal filters, whose computational costs were so far prohibitive for large scale processing, can be implemented efficiently […]
Dec, 17

Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization

We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher […]
Dec, 17

Toward Automatic Translation: From OpenACC to OpenMP 4

For the past few years, OpenACC has been the primary directive-based API for programming accelerator devices like GPUs. OpenMP 4.0 is now a competitor in this space, with support from different vendors. In our work, we analyse the feasibility for automatic conversion from OpenACC to OpenMP 4. We describe an algorithm to convert OpenACC device […]
Dec, 17

Speedup for quantum optimal control from GPU-based automatic differentiation

We implement a quantum optimal control algorithm based on automatic differentiation and harness the acceleration afforded by graphics processing units (GPUs). Automatic differentiation allows us to specify advanced optimization criteria and incorporate them in the optimization process with ease. We demonstrate that the use of GPUs can speed up calculations by more than an order […]
Dec, 17

Parallel Level set algorithm with MPI and accelerated on GPU

Level set method has been used to capture interface motion. Narrow band algorithm is applied to localize the solving of level-set PDE on global domain to a tube around interface. Due to the unknown evolving interface, narrow band algorithm brings load balance problem for parallelizing computing. This work presents a tool for evenly distributing work […]
Dec, 14

Automating the Last-Mile for High Performance Dense Linear Algebra

High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex […]
Dec, 14

Translating OpenMP Device Constructs to OpenCL using Unnecessary Data Transfer Elimination

In this paper, we propose a framework that translates OpenMP 4.0 accelerator directives to OpenCL. By translating an OpenMP program to an OpenCL program, the program can be executed on any hardware platform that supports OpenCL. We also propose a run-time optimization technique that automatically eliminates unnecessary data transfers between the host and the target […]
Dec, 14

Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

The most popular multithreaded languages based on the fork-join concurrency model (CilkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural […]
Dec, 14

nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware

In this work, a novel package called nmfgpu4R is presented, which offers the computation of Non-negative Matrix Factorization (NMF) on Compute Unified Device Architecture (CUDA) platforms within the R environment. Benchmarks show a remarkable speed-up in terms of time per iteration by utilizing the parallelization capabilities of modern graphics cards. Therefore the application of NMF […]
Dec, 14

GaDei: On Scale-up Training As A Service For Deep Learning

Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. The unique challenge of TaaS is that it must satisfy a wide range of customers who have no experience and resources to tune DL hyper-parameters, and meticulous tuning for each user’s dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that […]
Page 21 of 921« First...10...1920212223...304050...Last »

Recent source codes

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: