high performance computing on graphics processing units: hgpu.org

Posts

Mar, 5

Multi-kernel Data Partitioning with Channel on OpenCL-based FPGAs

FPGAs have been widely used to accelerate relational database applications, due to their high throughput and high energy efficiency. However, hardware programmer needs to leverage hardware description languages (HDLs) to program FPGAs. Since HDL is cycle-sensitive and error-prone, deep knowledge about hardware design and hands-on experiences are required to guarantee a successful design on FPGA, […]

OpenCL

Mar, 5

Improving the Neural GPU Architecture for Algorithm Learning

Algorithm learning is a core problem in artificial intelligence with significant implications on automation level that can be achieved by machines. Recently deep learning methods are emerging for synthesizing an algorithm from its input-output examples, the most successful being the Neural GPU, capable of learning multiplication. We present several improvements to the Neural GPU that […]

Mar, 5

Performance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC

An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming environments try to tackle this problem. […]

CUDA

•

OpenCL

Mar, 5

Billion-scale similarity search with GPUs

Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less […]

CUDA

Feb, 28

Speckle Reduction with Trained Nonlinear Diffusion Filtering

Speckle reduction is a prerequisite for many image processing tasks in synthetic aperture radar (SAR) images, as well as all coherent images. In recent years, predominant state-of-the-art approaches for despeckling are usually based on nonlocal methods which mainly concentrate on achieving utmost image restoration quality, with relatively low computational efficiency. Therefore, in this study we […]

Feb, 28

An Efficient Multiway Mergesort for GPU Architectures

Sorting is a primitive operation that is a building block for countless algorithms. As such, it is important to design sorting algorithms that approach peak performance on a range of hardware architectures. Graphics Processing Units (GPUs) are particularly attractive architectures as they provides massive parallelism and computing power. However, the intricacies of their compute and […]

CUDA

Feb, 28

Deep Voice: Real-time Neural Text-to-Speech

We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an […]

CUDA

Feb, 28

Key Reconciliation with Low-Density Parity-Check Codes for Long-Distance Quantum Cryptography

The speed at which two remote parties can exchange secret keys over a fixed-length fiber-optic cable in continuous-variable quantum key distribution (CV-QKD) is currently limited by the computational complexity of post-processing algorithms for key reconciliation. Multi-edge low-density parity-check (LDPC) codes with low code rates and long block lengths were proposed for CV-QKD, in order to […]

CUDA

Feb, 28

CHAOS: A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi

Deep learning is an important component of big-data analytic tools and intelligent applications, such as, self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is computationally intensive, and often requires a large amount of time if performed sequentially. Modern parallel computing systems provide the capability to reduce the required training time […]

Feb, 27

International Conference on Bioinformatics and Computational Intelligence (ICBCI), 2017

Publication ICBCI 2017 will be published in Proceedings. Submission Methods Electronic Submission System (.pdf) http://www.easychair.org/conferences/?conf=icbci2017 Contacts Ms. Ada R. L. Wei Email: icbci@zhconf.ac.cn Tel: +86-28-8625-6789 10 am–12 am, 2 pm-6 pm, Monday to Friday

Feb, 27

The 2nd International Conference on Network Security (ICNS), 2017

2017 II International Conference on Network Security (ICNS 2017) will be held in Kunming, China, during December 8-10, 2017. ICNS 2017 will be a remarkable event which brings together professors, researchers and students in the field of Network Security making the conference a perfect platform to share experience, foster collaborations across industry and academia, and […]

Feb, 26

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Multi-kernel Data Partitioning with Channel on OpenCL-based FPGAs

Improving the Neural GPU Architecture for Algorithm Learning

Performance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC

Billion-scale similarity search with GPUs

Speckle Reduction with Trained Nonlinear Diffusion Filtering

An Efficient Multiway Mergesort for GPU Architectures

Deep Voice: Real-time Neural Text-to-Speech

Key Reconciliation with Low-Density Parity-Check Codes for Long-Distance Quantum Cryptography

CHAOS: A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi

International Conference on Bioinformatics and Computational Intelligence (ICBCI), 2017

The 2nd International Conference on Network Security (ICNS), 2017

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)