high performance computing on graphics processing units: hgpu.org

Posts

Aug, 11

OP2: An Active Library Framework for Solving Unstructured Mesh-based Applications on Multi-Core and Many-Core Architectures

OP2 is an "active" library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 […]

CUDA

Aug, 11

Large Scale Monte Carlo Tree Search on GPU

Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in artificial intelligence (AI) problems, typically for move planning in combinatorial games. It combines the generality of random simulation with the precision of tree search. Research interest in MCTS has risen sharply due to its spectacular success with computer Go and its potential […]

CUDA

Aug, 10

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems

In this paper, we analyze the potential of using weights for block-asynchronous relaxation methods on GPUs. For this purpose, we introduce different weighting techniques similar to those applied in block-smoothers for multigrid methods. For test matrices taken from the University of Florida Matrix Collection we report the convergence behavior and the total runtime for the […]

CUDA

Aug, 10

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming languages (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product […]

CUDA

Aug, 10

Fine-Granular Parallel EBCOT and Optimization with CUDA for Digital Cinema Image Compression

JPEG2000 has been accepted by The Society of Motion Picture and Television Engineers (SMPTE) as the image compression standard for the digital distribution of motion pictures. In JPEG2000, the biggest contribution to the coding performance comes from the Embedded Block Coding with Optimized Truncation (EBCOT), which is also the most time-consuming module by occupying almost […]

CUDA

Aug, 10

Development of a GPU-accelerated MIKE 21 Solver for Water Wave Dynamics

With encouragement by the company DHI are the aim of this B.Sc. thesis to investigate, whether if it is possible to accelerate the simulation speed of DHIs commercial product MIKE 21 HD, by formulating a parallel solution scheme and implementing it to be executed on a CUDA-enabled GPU (massive parallel hardware). MIKE 21 HD is […]

CUDA

Aug, 10

Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

Block iterative methods are extremely important as smoothers for multigrid methods, as preconditioners for Krylov methods, and as solvers for diagonally dominant linear systems. Developing robust and efficient algorithms suitable for current and evolving GPU and multicore CPU systems is a significant challenge. We address this issue in the case of constant-coefficient stencils arising in […]

CUDA

Aug, 9

Massive parallelization of serial inference algorithms for a complex generalized linear model

Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. In this paper we […]

CUDA

Aug, 9

CPU-GPU Algorithms for Triangular Surface Mesh Simplification

Mesh simplification and mesh compression are important processes in computer graphics and scientific computing, as such contexts allow for a mesh which takes up less memory than the original mesh. Current simplification and compression algorithms do not take advantage of both the central processing unit (CPU) and the graphics processing unit (GPU). We propose three […]

CUDA

Aug, 9

OpenCL-based Algorithm for Heat Load Modelling of District Heating System

This paper presents a parallel approach to estimate the parameters in the heat loading of a district heating system by use of the traditional particle swarm optimisation (TPSO) on the Graphic Processing Unit (GPU) using OpenCL. The running time of the algorithm is greatly reduced compared to running on CPU. The heat load is approximated […]

OpenCL

Aug, 9

Kargus: a Highly-scalable Software-based Intrusion Detection System

As high-speed networks are becoming commonplace, it is increasingly challenging to prevent the attack attempts at the edge of the Internet. While many high-performance intrusion detection systems (IDSes) employ dedicated network processors or special memory to meet the demanding performance requirements, it often increases the cost and limits functional flexibility. In contrast, existing softwarebased IDS […]

CUDA

Aug, 9

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Data warehousing applications represent an emergent application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high core count architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

OP2: An Active Library Framework for Solving Unstructured Mesh-based Applications on Multi-Core and Many-Core Architectures

Large Scale Monte Carlo Tree Search on GPU

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Fine-Granular Parallel EBCOT and Optimization with CUDA for Digital Cinema Image Compression

Development of a GPU-accelerated MIKE 21 Solver for Water Wave Dynamics

Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

Massive parallelization of serial inference algorithms for a complex generalized linear model

CPU-GPU Algorithms for Triangular Surface Mesh Simplification

OpenCL-based Algorithm for Heat Load Modelling of District Heating System

Kargus: a Highly-scalable Software-based Intrusion Detection System

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Recent source codes

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)