high performance computing on graphics processing units: hgpu.org

Posts

Jul, 14

DeepProf: Performance Analysis for Deep Learning Applications via Mining GPU Execution Patterns

Deep learning applications are computation-intensive and often employ GPU as the underlying computing devices. Deep learning frameworks provide powerful programming interfaces, but the gap between source codes and practical GPU operations make it difficult to analyze the performance of deep learning applications. In this paper, through examing the features of GPU traces and deep learning […]

CUDA

Jul, 14

A Similarity Measure for GPU Kernel Subgraph Matching

Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures […]

CUDA

Jul, 5

OpenCL-Based Implementation of an FPGA Accelerator for Molecular Dynamics Simulation

Molecular dynamics (MD) simulations are very important to studyphysical properties of the atoms and molecules. However, a huge amount of processing time is required to simulate a few nano-seconds of an actual experiment. Although the hardware accelerationusing FPGAs provides promising results, huge design time and hardware design skills are required to implement an accelerator successfully. […]

OpenCL

Jul, 5

Real-time colouring and filtering with graphics shaders

Despite the popularity of the Graphics Processing Unit (GPU) for general purpose computing, one should not forget about the practicality of the GPU for fast scientific visualisation. As astronomers have increasing access to three dimensional (3D) data from instruments and facilities like integral field units and radio interferometers, visualisation techniques such as volume rendering offer […]

Jul, 5

A Fast Method For Computing Principal Curvatures From Range Images

Estimation of surface curvature from range data is important for a range of tasks in computer vision and robotics, object segmentation, object recognition and robotic grasping estimation. This work presents a fast method of robustly computing accurate metric principal curvature values from noisy point clouds which was implemented on GPU. In contrast to existing readily […]

CUDA

Jul, 5

A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs

Analysis of DNA samples is an important step in forensics, and the speed of analysis can impact investigations. Comparison of DNA sequences is based on the analysis of short tandem repeats (STRs), which are short DNA sequences of 2-5 base pairs. Current forensics approaches use 20 STR loci for analysis. The use of single nucleotide […]

CUDA

Jul, 5

Parle: parallelizing stochastic gradient descent

We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been […]

CUDA

Jul, 2

GPU-acceleration for Large-scale Tree Boosting

In this paper, we present a novel massively parallel algorithm for accelerating the decision tree building procedure on GPUs (Graphics Processing Units), which is a crucial step in Gradient Boosted Decision Tree (GBDT) and random forests training. Previous GPU based tree building algorithms are based on parallel multi-scan or radix sort to find the exact […]

OpenCL

Jul, 2

Snowflake: A Lightweight Portable Stencil DSL

Stencil computations are not well optimized by general-purpose production compilers and the increased use of multicore, manycore, and accelerator-based systems makes the optimization problem even more challenging. In this paper we present Snowflake, a Domain Specific Language (DSL) for stencils that uses a "micro-compiler" approach, i.e., small, focused, domain-specific code generators. The approach is similar […]

OpenCL

Jul, 2

Speeding up lattice sieve with Xeon Phi coprocessor

Major substep in a lattice sieve algorithm which solves the Euclidean shortest vector problem (SVP) is the computation of sums and Euclidean norms of many vector pairs. Finding a solution to the SVP is the foundation of an attack against many lattice based crypto systems. We optimize the main subfunction of a sieve for the […]

Jul, 2

Synthesis of Embedded Software using Dataflow Schedule Graphs

In the design and implementation of digital signal processing (DSP) systems, dataflow is recognized as a natural model for specifying applications, and dataflow enables useful model-based methodologies for analysis, synthesis, and optimization of implementations. A wide range of embedded signal processing applications can be designed efficiently using the high level abstractions that are provided by […]

OpenCL

Jul, 2

Deep neural networks for direct, featureless learning through observation: the case of 2d spin models

We train a deep convolutional neural network to accurately predict the energies and magnetizations of Ising model configurations, using both the traditional nearest-neighbour Hamiltonian, as well as a long-range screened Coulomb Hamiltonian. We demonstrate the capability of a convolutional deep neural network in predicting the nearest-neighbour energy of the 4×4 Ising model. Using its success […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

DeepProf: Performance Analysis for Deep Learning Applications via Mining GPU Execution Patterns

A Similarity Measure for GPU Kernel Subgraph Matching

OpenCL-Based Implementation of an FPGA Accelerator for Molecular Dynamics Simulation

Real-time colouring and filtering with graphics shaders

A Fast Method For Computing Principal Curvatures From Range Images

A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs

Parle: parallelizing stochastic gradient descent

GPU-acceleration for Large-scale Tree Boosting

Snowflake: A Lightweight Portable Stencil DSL

Speeding up lattice sieve with Xeon Phi coprocessor

Synthesis of Embedded Software using Dataflow Schedule Graphs

Deep neural networks for direct, featureless learning through observation: the case of 2d spin models

Recent source codes

XaaS containers

microSYCL: SYCL micro-benchmarks repository

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)