high performance computing on graphics processing units: hgpu.org

Posts

Jul, 2

Speeding up lattice sieve with Xeon Phi coprocessor

Major substep in a lattice sieve algorithm which solves the Euclidean shortest vector problem (SVP) is the computation of sums and Euclidean norms of many vector pairs. Finding a solution to the SVP is the foundation of an attack against many lattice based crypto systems. We optimize the main subfunction of a sieve for the […]

Jul, 2

Snowflake: A Lightweight Portable Stencil DSL

Stencil computations are not well optimized by general-purpose production compilers and the increased use of multicore, manycore, and accelerator-based systems makes the optimization problem even more challenging. In this paper we present Snowflake, a Domain Specific Language (DSL) for stencils that uses a "micro-compiler" approach, i.e., small, focused, domain-specific code generators. The approach is similar […]

OpenCL

Jul, 2

Synthesis of Embedded Software using Dataflow Schedule Graphs

In the design and implementation of digital signal processing (DSP) systems, dataflow is recognized as a natural model for specifying applications, and dataflow enables useful model-based methodologies for analysis, synthesis, and optimization of implementations. A wide range of embedded signal processing applications can be designed efficiently using the high level abstractions that are provided by […]

OpenCL

Jul, 2

Deep neural networks for direct, featureless learning through observation: the case of 2d spin models

We train a deep convolutional neural network to accurately predict the energies and magnetizations of Ising model configurations, using both the traditional nearest-neighbour Hamiltonian, as well as a long-range screened Coulomb Hamiltonian. We demonstrate the capability of a convolutional deep neural network in predicting the nearest-neighbour energy of the 4×4 Ising model. Using its success […]

CUDA

Jun, 25

DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications

The rapid emergence of head-mounted devices such as the Microsoft Holo-lens enables a wide variety of continuous vision applications. Such applications often adopt deep-learning algorithms such as CNN and RNN to extract rich contextual information from the first-person-view video streams. Despite the high accuracy, use of deep learning algorithms in mobile devices raises critical challenges, […]

OpenCL

Jun, 25

Scalar collapse in AdS with an OpenCL open source code

We study the spherically symmetric collapse of a scalar field in anti-de Sitter spacetime using a newly constructed, open-source code which parallelizes over heterogeneous architectures using the open standard OpenCL. An open question for this scenario concerns how to tell, a priori, whether some form of initial data will be stable or will instead develop […]

OpenCL

Jun, 25

ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android

The Android ecosystem contains three major platforms for execution suitable for different purposes. Android applications are normally written in the Java programming language, but computationally intensive parts of Android applications can be sped up by choosing to use a native language or by utilising the parallel architecture found in graphics processing units (GPUs). The experiments […]

OpenCL

Jun, 25

An Analysis of Variation Between Cores For Intel Xeon Phi Knights Corner And Xeon Phi Knights Landing

As we move towards exascale computing, the efficiency of application performance and energy utilization, must be optimized by redefining architectural features and application performance analysis. This research analyzes the performance per core of 8 applications on Intel Xeon Phi Knights Corner (KNC) and Knights Landing (KNL) to determine if performance variation within cores can lead […]

Jun, 25

High-Performance Out-of-core Block Randomized Singular Value Decomposition on GPU

Fast computation of singular value decomposition (SVD) is of great interest in various machine learning tasks. Recently, SVD methods based on randomized linear algebra have shown significant speedup in this regime. This paper attempts to further accelerate the computation by harnessing a modern computing architecture, namely graphics processing unit (GPU), with the goal of processing […]

CUDA

Jun, 21

Multi-level Parallelism with MPI and OpenACC for CFD Applications

High-level parallel programming approaches, such as OpenACC, have recently become popular in complex fluid dynamics research since they are cross-platform and easy to implement. OpenACC is a directive-based programming model that, unlike low-level programming models, abstracts the details of implementation on the GPU. Although OpenACC generally limits the performance of the GPU, this model significantly […]

Jun, 21

Panda: A Compiler Framework for Concurrent CPU-GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

This paper describes a new compiler framework for heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil codes originally written in C can be automatically parallelized for large-scale GPU clusters. The most distinctive […]

CUDA

Jun, 21

On the Use of a GPU-Accelerated Mobile Device Processor for Sound Source Localization

The growing interest to incorporate new features into mobile devices has increased the number of signal processing applications running over processors designed for mobile computing. A challenging signal processing field is acoustic source localization, which is attractive for applications such as automatic camera steering systems, human-machine interfaces, video gaming or audio surveillance. In this context, […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Speeding up lattice sieve with Xeon Phi coprocessor

Snowflake: A Lightweight Portable Stencil DSL

Synthesis of Embedded Software using Dataflow Schedule Graphs

Deep neural networks for direct, featureless learning through observation: the case of 2d spin models

DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications

Scalar collapse in AdS with an OpenCL open source code

ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android

An Analysis of Variation Between Cores For Intel Xeon Phi Knights Corner And Xeon Phi Knights Landing

High-Performance Out-of-core Block Randomized Singular Value Decomposition on GPU

Multi-level Parallelism with MPI and OpenACC for CFD Applications

Panda: A Compiler Framework for Concurrent CPU-GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

On the Use of a GPU-Accelerated Mobile Device Processor for Sound Source Localization

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)