high performance computing on graphics processing units: hgpu.org

Posts

Jan, 22

Heterogeneous (CPU+GPU) Working-set Hash Tables

In this paper, we propose heterogeneous (CPU+GPU) hash tables, that optimize operations for frequently accessed keys. The idea is to maintain a dynamic set of most frequently accessed keys in the GPU memory and the rest of the keys in the CPU main memory. Further, queries are processed in batches of fixed size. We measured […]

CUDA

Jan, 22

Exploring LLVM Infrastructure for Simplified Multi-GPU Programming

GPUs have established themselves in the computing landscape, convincing users and designers by their excellent performance and energy efficiency. They differ in many aspects from general-purpose CPUs, for instance their highly parallel architecture, their thread-collective bulk-synchronous execution model, and their programming model. In particular, languages like CUDA or OpenCL require users to express parallelism very […]

CUDA

•

OpenCL

Jan, 22

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Scalable sparse LU factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the Gilbert-Peierls algorithm for sparse LU factorization. As architectures evolve, there exists a need for algorithms that are hierarchical […]

Jan, 22

GPU Multisplit

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more […]

CUDA

Jan, 22

The 9th International Conference on Machine Vision (SPIE-ICMV), 2016

All accepted of ICMV 2016 will be published by SPIE, which will be indexed by [EI&Scopus]. Previous proceedings from 2007 to 2015 (http://www.icmv.org/pro.html), all have get published and indexed by EI&Scopus. Conference photos, please check 2015 (http://www.icmv.org/photo2015.html); 2014 (http://www.icmv.org/photo2014.html); 2013 (http://www.icmv.org/photo2013.html). Keynote &Plenary Speakers Prof. AntanasVerikas, Halmstad University, Sweden; Prof. PetiaRadeva, University of Barcelona, Spain; […]

Jan, 19

Adaptive and Hybrid Machine Learning Approaches Utilizing General Purpose Computing on Graphical Processing Units

Unlike machines, humans and animals have very complex reasoning capability that allows them to adapt to changes in the naturally world, while computers tend to be very limited in that same aspect. What limits machines from becoming adaptable can span many topics, but of these attributes which would limit a machines ability to adapt is […]

CUDA

Jan, 19

A framework for efficient execution on GPU and CPU+GPU systems

Technological limitations faced by the semi-conductor manufacturers in the early 2000’s restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically […]

CUDA

Jan, 18

The 2nd International Conference on Control, Automation and Robotics (ICCAR), 2016

ICCAR 2016 conference proceedings will be published by IEEE Conference Publication, which would be indexed by. ★ ICCAR 2016 is in the IEEE conference list. http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=38085 ★Publication and Indexing History of ICCAR: ICCAR 2015, Singapore, May 20-22, 2015. Publication: IEEE Conference Proceedings Online: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?reload=true&punumber=7153096 ★Keynote &Plenary Speakers Prof. Wei-Hsin Liao, The Chinese University of Hong […]

Jan, 16

Performance Analysis of Roberts Edge Detection Using CUDA and OpenGL

The evolution of high-performance and programmable graphics processing units (GPUs) has generated considerable advancements in graphics and parallel computing. In this paper we present a Roberts filter based on edge detection algorithm using CUDA and OpenGL architectures. The basic idea is to use the Pixel Buffer Object (PBO) to create images with CUDA on a […]

CUDA

•

OpenGL

Jan, 16

LHCb GPU acceleration project

The LHCb detector is due to be upgraded for processing high-luminosity collisions, which will increase data bandwidth to the event filter farm from 100 GB/s to 4 TB/s, encouraging us to look for new ways of accelerating Online reconstruction. The Coprocessor Manager is a new framework for integrating LHCb’s existing computation pipelines with massively parallel […]

CUDA

Jan, 16

Homomorphic Autocomplete

With the rapid progress in fully homomorpic encryption (FHE) and somewhat homomorphic encryption (SHE) schemes, we are witnessing renewed efforts to revisit privacy preserving protocols. Several works have already appeared in the literature that provide solutions to these problems by employing FHE or SHE techniques. These applications range from cloud computing to computation over confidential […]

CUDA

Jan, 16

Contributions of hybrid architectures to depth imaging: a CPU, APU and GPU comparative study

In an exploration context, Oil and Gas (O&G) companies rely on HPC to accelerate depth imaging algorithms. Solutions based on CPU clusters and hardware accelerators are widely embraced by the industry. The Graphics Processing Units (GPUs), with a huge compute power and a high memory bandwidth, had attracted significant interest. However, deploying heavy imaging workflows, […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Heterogeneous (CPU+GPU) Working-set Hash Tables

Exploring LLVM Infrastructure for Simplified Multi-GPU Programming

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

GPU Multisplit

The 9th International Conference on Machine Vision (SPIE-ICMV), 2016

Adaptive and Hybrid Machine Learning Approaches Utilizing General Purpose Computing on Graphical Processing Units

A framework for efficient execution on GPU and CPU+GPU systems

The 2nd International Conference on Control, Automation and Robotics (ICCAR), 2016

Performance Analysis of Roberts Edge Detection Using CUDA and OpenGL

LHCb GPU acceleration project

Homomorphic Autocomplete

Contributions of hybrid architectures to depth imaging: a CPU, APU and GPU comparative study

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)