Posts
Feb, 15
A Survey of Techniques for Improving Security of Non-volatile Memories
Due to their high density and near-zero leakage power consumption, non-volatile memories (NVMs) are promising candidates for designing future memory systems. However, compared to conventional memories, NVMs also face more-severe security threats, e.g., the limited write endurance of NVMs makes them vulnerable to write-attacks. Also, the non-volatility of NVMs allows the data to persist even […]
Feb, 15
Accelerating Interpreted Programming Languages on GPUs with Just-In-Time Compilation and Runtime Optimisations
Nowadays, most computer systems are equipped with powerful parallel devices such as Graphics Processing Units (GPUs). They are present in almost every computer system including mobile devices, tablets, desktop computers and servers. These parallel systems have unlocked the possibility for many scientists and companies to process significant amounts of data in shorter time. But the […]
Feb, 15
TVM: End-to-End Optimization Stack for Deep Learning
Scalable frameworks, such as TensorFlow, MXNet, Caffe, and PyTorch drive the current popularity and utility of deep learning. However, these frameworks are optimized for a narrow range of server-class GPUs and deploying workloads to other platforms such as mobile phones, embedded devices, and specialized accelerators (e.g., FPGAs, ASICs) requires laborious manual effort. We propose TVM, […]
Feb, 15
Improving Locality of Unstructured Mesh Algorithms on GPUs
To most efficiently utilize modern parallel architectures, the memory access patterns of algorithms must make heavy use of the cache architecture: successively accessed data must be close in memory (spatial locality) and one piece of data must be reused as many times as possible (temporal locality). In this work we analyse the performance of unstructured […]
Feb, 15
GPU Accelerated Finite Element Assembly with Runtime Compilation
In recent years, high performance scientific computing on graphics processing units (GPUs) have gained widespread acceptance. These devices are designed to offer massively parallel threads for running code with general purpose. There are many researches focus on finite element method with GPUs. However, most of the works are specific to certain problems and applications. Some […]
Feb, 10
Using Meta-heuristics and Machine Learning for Software Optimization of Parallel Computing Systems: A Systematic Literature Review
While the modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models. Furthermore, optimized software execution on parallel computing systems demands consideration of many parameters at compile-time and run-time. Determining the optimal set of parameters in a […]
Feb, 10
Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management
The application resource specification – a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block–forms a critical component of the existing GPU programming models. This specification determines the performance of the application during execution because the corresponding on-chip hardware resources are allocated and managed purely […]
Feb, 10
Running Financial Risk Management Applications on FPGA in the Amazon Cloud
Nowadays, risk analysis and management is a core part of the daily operations in the financial industry, and strictly enforced by regulatory agencies. At the same time, large financial corporations have started migrating their operations into cloud services. Since the latter use a pay-per-use business model, there is a real need for implementations with high […]
Feb, 10
Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level
Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation-agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated […]
Feb, 10
Accelerating Deep Neural Networks on Low Power Heterogeneous Architectures
Deep learning applications are able to recognise images and speech with great accuracy, and their use is now everywhere in our daily lives. However, developing deep learning architectures such as deep neural networks in embedded systems is a challenging task because of the demanding computational resources and power consumption. Hence, sophisticated algorithms and methods that […]
Feb, 9
Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout
Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We […]
Feb, 9
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip […]

