Posts
Feb, 17
The performances of R GPU implementations of the GMRES method
Although the performance of commodity computers has improved drastically with the introduction of multicore processors and GPU computing, the standard R distribution is still based on single-threaded model of computation, using only a small fraction of the computational power available now for most desktops and laptops. Modern statistical software packages rely on high performance implementations […]
Feb, 17
Input-Aware Auto-Tuning of Compute-Bound HPC Kernels
Efficient implementations of HPC applications for parallel architectures generally rely on external software packages (e.g., BLAS, LAPACK, CUDNN). While these libraries provide highly optimized routines for certain characteristics of inputs (e.g., square matrices), they generally do not retain optimal performance across the wide range of problems encountered in practice. In this paper, we present an […]
Feb, 17
Transforming and Optimizing Irregular Applications for Parallel Architectures
Parallel architectures, including multi-core processors, many-core processors, and multi-node systems, have become commonplace, as it is no longer feasible to improve single-core performance through increasing its operating clock frequency. Furthermore, to keep up with the exponentially growing desire for more and more computational power, the number of cores/nodes in parallel architectures has continued to dramatically […]
Feb, 15
A Survey of Techniques for Improving Security of Non-volatile Memories
Due to their high density and near-zero leakage power consumption, non-volatile memories (NVMs) are promising candidates for designing future memory systems. However, compared to conventional memories, NVMs also face more-severe security threats, e.g., the limited write endurance of NVMs makes them vulnerable to write-attacks. Also, the non-volatility of NVMs allows the data to persist even […]
Feb, 15
TVM: End-to-End Optimization Stack for Deep Learning
Scalable frameworks, such as TensorFlow, MXNet, Caffe, and PyTorch drive the current popularity and utility of deep learning. However, these frameworks are optimized for a narrow range of server-class GPUs and deploying workloads to other platforms such as mobile phones, embedded devices, and specialized accelerators (e.g., FPGAs, ASICs) requires laborious manual effort. We propose TVM, […]
Feb, 15
Accelerating Interpreted Programming Languages on GPUs with Just-In-Time Compilation and Runtime Optimisations
Nowadays, most computer systems are equipped with powerful parallel devices such as Graphics Processing Units (GPUs). They are present in almost every computer system including mobile devices, tablets, desktop computers and servers. These parallel systems have unlocked the possibility for many scientists and companies to process significant amounts of data in shorter time. But the […]
Feb, 15
Improving Locality of Unstructured Mesh Algorithms on GPUs
To most efficiently utilize modern parallel architectures, the memory access patterns of algorithms must make heavy use of the cache architecture: successively accessed data must be close in memory (spatial locality) and one piece of data must be reused as many times as possible (temporal locality). In this work we analyse the performance of unstructured […]
Feb, 15
GPU Accelerated Finite Element Assembly with Runtime Compilation
In recent years, high performance scientific computing on graphics processing units (GPUs) have gained widespread acceptance. These devices are designed to offer massively parallel threads for running code with general purpose. There are many researches focus on finite element method with GPUs. However, most of the works are specific to certain problems and applications. Some […]
Feb, 10
Using Meta-heuristics and Machine Learning for Software Optimization of Parallel Computing Systems: A Systematic Literature Review
While the modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models. Furthermore, optimized software execution on parallel computing systems demands consideration of many parameters at compile-time and run-time. Determining the optimal set of parameters in a […]
Feb, 10
Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management
The application resource specification – a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block–forms a critical component of the existing GPU programming models. This specification determines the performance of the application during execution because the corresponding on-chip hardware resources are allocated and managed purely […]
Feb, 10
Running Financial Risk Management Applications on FPGA in the Amazon Cloud
Nowadays, risk analysis and management is a core part of the daily operations in the financial industry, and strictly enforced by regulatory agencies. At the same time, large financial corporations have started migrating their operations into cloud services. Since the latter use a pay-per-use business model, there is a real need for implementations with high […]
Feb, 10
Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level
Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation-agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated […]