Posts
Jun, 2
Weighted Residuals for Very Deep Networks
Deep residual networks have recently shown appealing performance on many challenging computer vision tasks. However, the original residual structure still has some defects making it difficult to converge on very deep networks. In this paper, we introduce a weighted residual network to address the incompatibility between ReLU and element-wise addition and the deep network initialization […]
May, 31
Computer Vision on the GPU — Tools, Algorithms and Frameworks
In recent years, graphic processing units (GPUs) have emerged as an attractive alternative to CPUs for implementing algorithms in a wide range of applications. The focus of this work is to give an overview about the current state on using GPUs for computer vision. We describe briefly tools like CUDA, OpenCL and OpenACC used for […]
May, 30
clSPARSE: A Vendor-Optimized Open-Source Sparse BLAS Library
Sparse linear algebra is a cornerstone of modern computational science. These algorithms ignore the zero-valued entries found in many domains in order to work on much larger problems at much faster rates than dense algorithms. Nonetheless, optimizing these algorithms is not straightforward. Highly optimized algorithms for multiplying a sparse matrix by a dense vector, for […]
May, 30
TensorFlow: A system for large-scale machine learning
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore […]
May, 30
Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs
For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA […]
May, 30
A GPU Accelerated Continuous and Discontinuous Galerkin Non-hydrostatic Atmospheric Model
We present a GPU accelerated nodal discontinuous Galerkin method for the solution of the three dimensional Euler equations, which are nonlinear hyperbolic equations that govern the motion and thermodynamic state of the atmosphere. The part of the solution process that solves the governing equations of motion with no moist processes is called the dynamical core. […]
May, 30
Deep API Learning
Developers often wonder how to implement a certain functionality (e.g., how to parse XML files) using APIs. Obtaining an API usage sequence based on an API-related natural language query is very helpful in this regard. Given a query, existing approaches utilize information retrieval models to search for matching API sequences. These approaches treat queries and […]
May, 28
Data Remanence and Digital Forensic Investigation for CUDA Graphics Processing Units
This paper investigates the practicality of memory attacks on commercial Graphics Processing Units (GPUs). With recent advances in the performance and viability of using GPUs for various highly-parallelised data processing tasks, a number of security challenges are raised. Unscrupulous software running subsequently on the same GPU, either by the same user, or another user, in […]
May, 28
Efficient High-Speed WPA2 Brute Force Attacks using Scalable Low-Cost FPGA Clustering
WPA2-Personal is widely used to protect Wi-Fi networks against illicit access. While attackers typically use GPUs to speed up the discovery of weak network passwords, attacking random passwords is considered to quickly become infeasible with increasing password length. Professional attackers may thus turn to commercial high-end FPGA-based cluster solutions to significantly increase the speed of […]
May, 28
An OpenMP Programming Environment on Mobile Devices
Recently, the computational speed and battery capability of mobile devices were greatly prompted. With an enormous number of APPs, users can do many things in mobile devices as well as in computers. Consequently, more and more scientific researchers are encouraged to move their working environment from computers to mobile devices for increasing their work efficiency […]
May, 28
Multi-threaded Geant4 on the Xeon-Phi with Complex High-Energy Physics Geometry
To study the performance of multi-threaded Geant4 for high-energy physics experiments, an application has been developed which generalizes and extends previous work. A highly-complex detector geometry is used for benchmarking on an Intel Xeon Phi coprocessor. In addition, an implementation of parallel I/O based on Intel SCIF and ROOT technologies is incorporated and studied.
May, 28
Theano-MPI: a Theano-based Distributed Training Framework
We develop a scalable and extendable training framework that can utilize GPUs across nodes in a cluster and accelerate the training of deep learning models based on data parallelism. Both synchronous and asynchronous training are implemented in our framework, where parameter exchange among GPUs is based on CUDA-aware MPI. In this report, we analyze the […]