high performance computing on graphics processing units: hgpu.org

Posts

Feb, 15

A Survey of Techniques for Improving Security of Non-volatile Memories

Due to their high density and near-zero leakage power consumption, non-volatile memories (NVMs) are promising candidates for designing future memory systems. However, compared to conventional memories, NVMs also face more-severe security threats, e.g., the limited write endurance of NVMs makes them vulnerable to write-attacks. Also, the non-volatility of NVMs allows the data to persist even […]

Feb, 15

Accelerating Interpreted Programming Languages on GPUs with Just-In-Time Compilation and Runtime Optimisations

Nowadays, most computer systems are equipped with powerful parallel devices such as Graphics Processing Units (GPUs). They are present in almost every computer system including mobile devices, tablets, desktop computers and servers. These parallel systems have unlocked the possibility for many scientists and companies to process significant amounts of data in shorter time. But the […]

OpenCL

Feb, 15

TVM: End-to-End Optimization Stack for Deep Learning

Scalable frameworks, such as TensorFlow, MXNet, Caffe, and PyTorch drive the current popularity and utility of deep learning. However, these frameworks are optimized for a narrow range of server-class GPUs and deploying workloads to other platforms such as mobile phones, embedded devices, and specialized accelerators (e.g., FPGAs, ASICs) requires laborious manual effort. We propose TVM, […]

CUDA

•

OpenCL

Feb, 15

Improving Locality of Unstructured Mesh Algorithms on GPUs

To most efficiently utilize modern parallel architectures, the memory access patterns of algorithms must make heavy use of the cache architecture: successively accessed data must be close in memory (spatial locality) and one piece of data must be reused as many times as possible (temporal locality). In this work we analyse the performance of unstructured […]

CUDA

Feb, 15

GPU Accelerated Finite Element Assembly with Runtime Compilation

In recent years, high performance scientific computing on graphics processing units (GPUs) have gained widespread acceptance. These devices are designed to offer massively parallel threads for running code with general purpose. There are many researches focus on finite element method with GPUs. However, most of the works are specific to certain problems and applications. Some […]

CUDA

Feb, 10

Using Meta-heuristics and Machine Learning for Software Optimization of Parallel Computing Systems: A Systematic Literature Review

While the modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models. Furthermore, optimized software execution on parallel computing systems demands consideration of many parameters at compile-time and run-time. Determining the optimal set of parameters in a […]

Feb, 10

Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management

The application resource specification – a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block–forms a critical component of the existing GPU programming models. This specification determines the performance of the application during execution because the corresponding on-chip hardware resources are allocated and managed purely […]

CUDA

Feb, 10

Running Financial Risk Management Applications on FPGA in the Amazon Cloud

Nowadays, risk analysis and management is a core part of the daily operations in the financial industry, and strictly enforced by regulatory agencies. At the same time, large financial corporations have started migrating their operations into cloud services. Since the latter use a pay-per-use business model, there is a real need for implementations with high […]

OpenCL

Feb, 10

Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level

Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation-agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated […]

Feb, 10

Accelerating Deep Neural Networks on Low Power Heterogeneous Architectures

Deep learning applications are able to recognise images and speech with great accuracy, and their use is now everywhere in our daily lives. However, developing deep learning architectures such as deep neural networks in embedded systems is a challenging task because of the demanding computational resources and power consumption. Hence, sophisticated algorithms and methods that […]

OpenCL

Feb, 9

Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout

Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We […]

CUDA

Feb, 9

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

A Survey of Techniques for Improving Security of Non-volatile Memories

Accelerating Interpreted Programming Languages on GPUs with Just-In-Time Compilation and Runtime Optimisations

TVM: End-to-End Optimization Stack for Deep Learning

Improving Locality of Unstructured Mesh Algorithms on GPUs

GPU Accelerated Finite Element Assembly with Runtime Compilation

Using Meta-heuristics and Machine Learning for Software Optimization of Parallel Computing Systems: A Systematic Literature Review

Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management

Running Financial Risk Management Applications on FPGA in the Amazon Cloud

Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level

Accelerating Deep Neural Networks on Low Power Heterogeneous Architectures

Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)