17005

Posts

Feb, 22

A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks

FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy […]
Feb, 22

Efficient Large-scale Approximate Nearest Neighbor Search on the GPU

We present a new approach for efficient approximate nearest neighbor (ANN) search in high dimensional spaces, extending the idea of Product Quantization. We propose a two-level product and vector quantization tree that reduces the number of vector comparisons required during tree traversal. Our approach also includes a novel highly parallelizable re-ranking method for candidate vectors […]
Feb, 22

Blocking Self-avoiding Walks Stops Cyber-epidemics: A Scalable GPU-based Approach

Cyber-epidemics, the widespread of fake news or propaganda through social media, can cause devastating economic and political consequences. A common countermeasure against cyber-epidemics is to disable a small subset of suspected social connections or accounts to effectively contain the epidemics. An example is the recent shutdown of 125,000 ISIS-related Twitter accounts. Despite many proposed methods […]
Feb, 19

A Survey of Soft-Error Mitigation Techniques for Non-Volatile Memories

Non-volatile memories (NVMs) offer superior density and energy characteristics compared to the conventional memories; however, NVMs suffer from severe reliability issues that can easily eclipse their energy efficiency advantages. In this paper, we survey architectural techniques for improving the soft-error reliability of NVMs, specifically PCM (phase change memory) and STT-RAM (spin transfer torque RAM). We […]
Feb, 18

Profiling High Level Heterogeneous Programs: Using the SPOC GPGPU framework for OCaml

Heterogeneous systems are widespread. When neatly used, they enable an impressive performance increase. However, they typically demand developers to combine multiple programming models, languages and tools into very complex programs that are hard to design and debug. Writing correct heterogeneous programs is difficult, achieving good performance is even harder. To help developers, many high-level solutions […]
Feb, 18

LAMMPS’ PPPM Long-Range Solver for the Second Generation Xeon Phi

Molecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly […]
Feb, 18

An Efficient Parallel Data Clustering Algorithm Using Isoperimetric Number of Trees

We propose a parallel graph-based data clustering algorithm using CUDA GPU, based on exact clustering of the minimum spanning tree in terms of a minimum isoperimetric criteria. We also provide a comparative performance analysis of our algorithm with other related ones which demonstrates the general superiority of this parallel algorithm over other competing algorithms in […]
Feb, 18

Trie Compression for GPU Accelerated Multi-Pattern Matching

Graphics Processing Units allow for running massively parallel applications offloading the CPU from computationally intensive resources, however GPUs have a limited amount of memory. In this paper a trie compression algorithm for massively parallel pattern matching is presented demonstrating 85% less space requirements than the original highly efficient parallel failure-less aho-corasick, whilst demonstrating over 22 […]
Feb, 18

MapSQ: A MapReduce-based Framework for SPARQL Queries on GPU

In this paper, we present a MapReduce-based framework for evaluating SPARQL queries on GPU (named MapSQ) to large-scale RDF datesets efficiently by applying both high performance. Firstly, we develop a MapReduce-based Join algorithm to handle SPARQL queries in a parallel way. Secondly, we present a coprocessing strategy to manage the process of evaluating queries where […]
Feb, 14

Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose

Fully connected network has been widely used in deep learning, and its computation efficiency is highly benefited from the matrix multiplication algorithm with cuBLAS on GPU. However, We found that, there exist some drawbacks of cuBLAS in calculating matrix $textbf{A}$ multiplies the transpose of matrix $textbf{B}$ (i.e., NT operation). To reduce the impact of NT […]
Feb, 14

Best Practice Guide Intel Xeon Phi v2.0

This Best Practice Guide provides information about Intel’s Many Integrated Core (MIC) architecture and programming models for the first generation Intel Xeon Phi coprocessor named Knights Corner (KNC) in order to enable programmers to achieve good performance out of their applications. The guide covers a wide range of topics from the description of the hardware […]
Feb, 14

Improved Lossless Image Compression Model Using Coefficient Based Discrete Wavelet Transform

Compression is used for storage related applications that offers compression of audio/video, executable program, text, source code and so on. While compressing images into smallest space as possible, the constraint lies in the multispectral form of data with continuous images. In such a scenario, efficient lossless image compression is required such that the compression ratio […]
Page 11 of 919« First...910111213...203040...Last »

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: