high performance computing on graphics processing units: hgpu.org

Posts

May, 30

Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs

For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA […]

OpenCL

May, 30

A GPU Accelerated Continuous and Discontinuous Galerkin Non-hydrostatic Atmospheric Model

We present a GPU accelerated nodal discontinuous Galerkin method for the solution of the three dimensional Euler equations, which are nonlinear hyperbolic equations that govern the motion and thermodynamic state of the atmosphere. The part of the solution process that solves the governing equations of motion with no moist processes is called the dynamical core. […]

CUDA

•

OpenCL

May, 30

Deep API Learning

Developers often wonder how to implement a certain functionality (e.g., how to parse XML files) using APIs. Obtaining an API usage sequence based on an API-related natural language query is very helpful in this regard. Given a query, existing approaches utilize information retrieval models to search for matching API sequences. These approaches treat queries and […]

CUDA

May, 28

Data Remanence and Digital Forensic Investigation for CUDA Graphics Processing Units

This paper investigates the practicality of memory attacks on commercial Graphics Processing Units (GPUs). With recent advances in the performance and viability of using GPUs for various highly-parallelised data processing tasks, a number of security challenges are raised. Unscrupulous software running subsequently on the same GPU, either by the same user, or another user, in […]

May, 28

Efficient High-Speed WPA2 Brute Force Attacks using Scalable Low-Cost FPGA Clustering

WPA2-Personal is widely used to protect Wi-Fi networks against illicit access. While attackers typically use GPUs to speed up the discovery of weak network passwords, attacking random passwords is considered to quickly become infeasible with increasing password length. Professional attackers may thus turn to commercial high-end FPGA-based cluster solutions to significantly increase the speed of […]

CUDA

May, 28

An OpenMP Programming Environment on Mobile Devices

Recently, the computational speed and battery capability of mobile devices were greatly prompted. With an enormous number of APPs, users can do many things in mobile devices as well as in computers. Consequently, more and more scientific researchers are encouraged to move their working environment from computers to mobile devices for increasing their work efficiency […]

OpenCL

May, 28

EPEM: A General and Validated Energy Complexity Model for Multithreaded Algorithms

Like time complexity models that have significantly contributed to the analysis and development of fast algorithms, energy complexity models for parallel algorithms are desired as crucial means to develop energy efficient algorithms for ubiquitous multicore platforms. Ideal energy complexity models should be validated on real multicore platforms and applicable to a wide range of parallel […]

May, 28

Multi-threaded Geant4 on the Xeon-Phi with Complex High-Energy Physics Geometry

To study the performance of multi-threaded Geant4 for high-energy physics experiments, an application has been developed which generalizes and extends previous work. A highly-complex detector geometry is used for benchmarking on an Intel Xeon Phi coprocessor. In addition, an implementation of parallel I/O based on Intel SCIF and ROOT technologies is incorporated and studied.

May, 28

Theano-MPI: a Theano-based Distributed Training Framework

We develop a scalable and extendable training framework that can utilize GPUs across nodes in a cluster and accelerate the training of deep learning models based on data parallelism. Both synchronous and asynchronous training are implemented in our framework, where parameter exchange among GPUs is based on CUDA-aware MPI. In this report, we analyze the […]

CUDA

May, 26

Faster GPU-based convolutional gridding via thread coarsening

Convolutional gridding is a processor-intensive step in interferometric imaging. While it is possible to use graphics processing units (GPUs) to accelerate this operation, existing methods use only a fraction of the available flops. We apply thread coarsening to improve the efficiency of an existing algorithm, and observe performance gains of up to 3.2x for single-polarization […]

OpenCL

May, 26

Learning a Metric Embedding for Face Recognition using the Multibatch Method

This work is motivated by the engineering task of achieving a near state-of-the-art face recognition on a minimal computing budget running on an embedded system. Our main technical contribution centers around a novel training method, called Multibatch, for similarity learning, i.e., for the task of generating an invariant "face signature" through training pairs of "same" […]

May, 26

Implementing Deep Neural Networks for Financial Market Prediction on the Intel Xeon Phi

Deep neural networks (DNNs) are powerful types of artificial neural networks (ANNs) that use several hidden layers. They have recently gained considerable attention in the speech transcription and image recognition community (Krizhevsky et al., 2012) for their superior predictive properties including robustness to overfitting. However their application to financial market prediction has not been previously […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs

A GPU Accelerated Continuous and Discontinuous Galerkin Non-hydrostatic Atmospheric Model

Deep API Learning

Data Remanence and Digital Forensic Investigation for CUDA Graphics Processing Units

Efficient High-Speed WPA2 Brute Force Attacks using Scalable Low-Cost FPGA Clustering

An OpenMP Programming Environment on Mobile Devices

EPEM: A General and Validated Energy Complexity Model for Multithreaded Algorithms

Multi-threaded Geant4 on the Xeon-Phi with Complex High-Energy Physics Geometry

Theano-MPI: a Theano-based Distributed Training Framework

Faster GPU-based convolutional gridding via thread coarsening

Learning a Metric Embedding for Face Recognition using the Multibatch Method

Implementing Deep Neural Networks for Financial Market Prediction on the Intel Xeon Phi

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)