high performance computing on graphics processing units: hgpu.org

Posts

Jan, 12

Importance-Driven Isosurface Decimation for Visualization of Large Simulation Data Based on OpenCL

For large simulation data, Parallel Marching Cubes algorithm is efficient and commonly used to extract isosurfaces in 3D scalar field. However, the isosurface meshes are sometimes too dense and it is difficult for scientists to specify the areas they are interested in. In this paper, we provide them a new way to define mesh importance […]

OpenCL

Jan, 12

A tool for mapping Single Nucleotide Polymorphisms using Graphics Processing Units

BACKGROUND: Single Nucleotide Polymorphism (SNP) genotyping analysis is very susceptible to SNPs chromosomal position errors. As it is known, SNPs mapping data are provided along the SNP arrays without any necessary information to assess in advance their accuracy. Moreover, these mapping data are related to a given build of a genome and need to be […]

CUDA

Jan, 12

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation

High throughput architectures rely on high thread-level parallelism (TLP) to hide execution latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of thread blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it […]

CUDA

Jan, 12

GPU-Accelerated parallel FDTD on Distributed Heterogeneous Platform

This paper introduces a (Finite-Difference Time-Domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multi-Processing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure […]

CUDA

Jan, 11

Implementations of the Hough Transform on the Embedded Multicore Processors

Embedded multicore processors represented by FPGAs and GPUs have lately attracted considerable attention for their potential computation ability and power consumption. Recent FPGAs have hundreds of embedded DSP slices and block RAMs. For example, Xilinx Virtex-6 Family FPGAs have a DSP48E1 slice, which is a configurable logic block equipped with fast multipliers, adders, pipeline registers, […]

CUDA

Jan, 11

Maximal Information Coefficient Analysis

In the domain of the Side Channel Attacks, various statistical tools have succeeded to retrieve a secret key, as the Pearson coefficient or the Mutual Information. In this paper we propose to study the Maximal Information Coefficient (MIC) which is a non-parametric method introduced by Reshef et al. [13] to compare two random variables. The […]

OpenCL

Jan, 11

Mining Rare Features in Fingerprints Using Core Points and Triplet-based Features

A fingerprint matching algorithm with a novel set of matching parameters based on core points and triangular descriptors is proposed to discover rarity in fingerprints. The algorithm uses a mathematical and statistical approach to discover rare features in fingerprints which provides scientific validation for both ten-print and latent fingerprint evidence. A feature is considered rare […]

CUDA

Jan, 11

Framework for utilizing computational devices within simulation

Nowadays there exist several frameworks to utilize a computation power of graphics cards and other computational devices such as FPGA, ARM and multi-core processors. The best known are either low-level and need a lot of controlling code or are bounded only to special graphic cards. Furthermore there exist more specialized frameworks, mainly aimed to the […]

OpenCL

Jan, 11

Toward a Generic Hybrid CPU-GPU Parallelization of Divide-and-Conquer Algorithms

In the last few years, the development of programming languages for general purpose computing on Graphic Processing Units (GPUs) has led to the design and implementation of fast parallel algorithms for this architecture for a large spectrum of applications. Given the streaming-processing characteristics of GPUs, most practical applications consist of tasks that admit highly data-parallel […]

OpenCL

Jan, 10

An octree-based proxy for collision detection in large-scale particle systems

Particle systems are important building block for simulating vivid and detail-rich effects in virtual world. One of the most difficult aspects of particle systems has been detecting collisions between particlesand mesh surface. Due to the huge computation, a variety of proxy-based approaches have been proposed recently to perform visually correct simulation. However, all either limit […]

CUDA

Jan, 10

Integrating Occlusion Culling with Parallel LOD for Rendering Complex 3D Environments on GPU

Real-time rendering of complex 3D models is still a very challenging task. Recently, many GPU-based level-of-detail (LOD) algorithms have been proposed to decrease the complexity of 3D models in a parallel fashion. However, LOD approaches alone are not sufficient to reduce the amount of geometry data for interactive rendering of massive scale models. Visibility-based culling, […]

CUDA

•

OpenGL

Jan, 10

Saddle Vertex Graph (SVG): A Novel Solution to the Discrete Geodesic Problem

This paper presents the Saddle Vertex Graph (SVG), a novel solution to the discrete geodesic problem. The SVG is a sparse undirected graph that encodes complete geodesic distance information: a geodesic path on the mesh is equivalent to a shortest path on the SVG, which can be solved efficiently using the shortest path algorithm (e.g., […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Importance-Driven Isosurface Decimation for Visualization of Large Simulation Data Based on OpenCL

A tool for mapping Single Nucleotide Polymorphisms using Graphics Processing Units

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation

GPU-Accelerated parallel FDTD on Distributed Heterogeneous Platform

Implementations of the Hough Transform on the Embedded Multicore Processors

Maximal Information Coefficient Analysis

Mining Rare Features in Fingerprints Using Core Points and Triplet-based Features

Framework for utilizing computational devices within simulation

Toward a Generic Hybrid CPU-GPU Parallelization of Divide-and-Conquer Algorithms

An octree-based proxy for collision detection in large-scale particle systems

Integrating Occlusion Culling with Parallel LOD for Rendering Complex 3D Environments on GPU

Saddle Vertex Graph (SVG): A Novel Solution to the Discrete Geodesic Problem

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)