high performance computing on graphics processing units: hgpu.org

Posts

Aug, 19

Parallel Graph Mining with GPUs

Frequent graph mining is an important though computationally hard problem because it requires enumerating possibly an exponential number of candidate subgraph patterns, and checking their presence in a database of graphs. In this paper, we propose a novel approach for parallel graph mining on GPUs, which have emerged as a relatively cheap but powerful architecture […]

CUDA

Aug, 19

Parallel Outlier Detection on Uncertain Data for GPUs

Outlier detection, also known as anomaly detection, is a common data mining task in identifying data points that are outside expected patterns in a given dataset. It has useful applications such as network intrusion, system faults, and fraudulent activity. In addition, real world data are uncertain in nature and they may be represented as uncertain […]

OpenCL

Aug, 19

Practical Symbolic Race Checking of GPU Programs

Even the careful GPU programmer can inadvertently introduce data races while writing and optimizing code. Currently available GPU race checking methods fall short either in terms of their formal guarantees, ease of use, or practicality. Existing symbolic methods: (1) do not fully support existing CUDA kernels; (2) may require user-specified assertions or invariants; (3) often […]

CUDA

Aug, 19

An Efficient Cell List Implementation for Monte Carlo Simulation on GPUs

Maximizing the performance potential of the modern day GPU architecture requires judicious utilization of available parallel resources. Although dramatic reductions can often be obtained through straightforward mappings, further performance improvements often require algorithmic redesigns to more closely exploit the target architecture. In this paper, we focus on efficient molecular simulations for the GPU and propose […]

CUDA

Aug, 19

Accelerated composite distribution function methods for computational fluid dynamics using GPU

The Kinetic Theory of Gases has long been established as a useful tool for the solution of modern Computational Fluid Dynamics (CFD) problems. Together with the Finite Volume Method, such approaches have been popular in CFD for over 30 years, with techniques such as the Equilibrium Flux Method (EFM) or Kinetic Flux Vector Splitting (KFVS), […]

CUDA

Aug, 18

An OpenCL implementation of a forward sampling algorithm for CP-logic

We present an approximate query answering algorithm for the Probabilistic Logic Programming language CP-logic. It complements existing sampling algorithms by using the rules from body to head instead of in the other direction. We present an implementation in OpenCL, which is able to exploit the multicore architecture of modern GPUs to compute a large number […]

OpenCL

Aug, 18

Ocelot/HyPE: Optimized Data Processing on Heterogeneous Hardware

The past years saw the emergence of highly heterogeneous server architectures that feature multiple accelerators in addition to the main processor. Efficiently exploiting these systems for data processing is a challenging research problem that comprises many facets, including how to find an optimal operator placement strategy, how to estimate runtime costs across different hardware architectures, […]

OpenCL

Aug, 18

Quantum Boolean Image Denoising

A quantum Boolean image processing methodology is presented in this work, with special emphasis in image denoising. A new approach for internal image representation is outlined together with two new interfaces: classical-to-quantum and quantum-to-classical. The new quantum-Boolean image denoising called quantum Boolean mean filter (QBMF) works with computational basis states (CBS), exclusively. To achieve this, […]

OpenCL

Aug, 18

High Level High Performance Computing for Multitask Learning of Time-varying Models

We propose an approach suitable to learn multiple time-varying models jointly and discuss an application in data-driven weather forecasting. The methodology relies on spectral regularization and encodes the typical multi-task learning assumption that models lie near a common low dimensional subspace. The arising optimization problem amounts to estimating a matrix from noisy linear measurements within […]

CUDA

Aug, 15

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters

Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We […]

Aug, 15

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of programming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as […]

OpenCL

Aug, 15

Energy Evaluation for Applications with Different Thread Affinities on the Intel Xeon Phi

The Intel Xeon Phi coprocessor offers high parallelism on energy-efficient hardware to minimize energy consumption while maintaining performance. Dynamic frequency and voltage scaling is not accessible on the Intel Xeon Phi. Hence, saving energy relies mainly on tuning application performance. One general optimization technique is thread affinity, which is an important factor in multi-core architectures. […]

high performance computing on graphics processing units: hgpu.org

Posts

Parallel Graph Mining with GPUs

Parallel Outlier Detection on Uncertain Data for GPUs

Practical Symbolic Race Checking of GPU Programs

An Efficient Cell List Implementation for Monte Carlo Simulation on GPUs

Accelerated composite distribution function methods for computational fluid dynamics using GPU

An OpenCL implementation of a forward sampling algorithm for CP-logic

Ocelot/HyPE: Optimized Data Processing on Heterogeneous Hardware

Quantum Boolean Image Denoising

High Level High Performance Computing for Multitask Learning of Time-varying Models

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Energy Evaluation for Applications with Different Thread Affinities on the Intel Xeon Phi

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)