high performance computing on graphics processing units: hgpu.org

Posts

Mar, 12

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models

The move from single core and processor systems to multi-core and many-processors systemscomes with the requirement of implementing computations in a way that can utilizethese multiple units eciently. This task of writing ecient multi-threaded algorithmswill not be possible with out improving programming languages and compilers to providethe mechanisms to do so. Computer aided mathematical modeling […]

OpenCL

Mar, 10

Accelerating Wright-Fisher Forward Simulations on the Graphics Processing Unit

Forward Wright-Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the CPU, thus limiting their usefulness. The single-locus Wright-Fisher forward algorithm is, however, exceedingly parallelizable, with many steps which are so-called embarrassingly parallel, consisting of a vast number of individual computations that are all […]

CUDA

Mar, 10

Automatic Data Layout Generation and Kernel Mapping for CPU+GPU Architectures

The ubiquity of hybrid CPU+GPU architectures has led to renewed interest in automatic data layout generation owing to the fact that data layouts have a large impact on performance, and that different data layouts yield the best performance on CPUs vs. GPUs. Unfortunately, current programming models still fail to provide an effective solution to the […]

OpenCL

Mar, 10

Pragma Directed Shared Memory Centric Optimizations on GPUs

GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is […]

CUDA

•

OpenCL

Mar, 10

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures

One area of Computing applications which poses significant challenge of performance scalability on Chip Multiprocessors(CMP’s) are Irregular applications. Such applications have very little computation and unpredictable memory access patterns making them memory-bound in contrast to compute-bound applications. Since the gap between processor and memory performance continues to exist, difficulty to hide and decrease this gap […]

CUDA

Mar, 10

Testing fine-grained parallelism for the ADMM on a factor-graph

There is an ongoing effort to develop tools that apply distributed computational resources to tackle large problems or reduce the time to solve them. In this context, the Alternating Direction Method of Multipliers (ADMM) arises as a method that can exploit distributed resources like the dual ascent method and has the robustness and improved convergence […]

CUDA

Mar, 8

D-face: Parallel Implementation of CNN Based Face Classifier using Drone Data On K40 & Jetson TK1

Convolutional Neural Networks (CNNs) are shown to perform very well in the areas such as video surveillance, object classification and face classification. Face classification has become pertinent to numerous applications, especially in this big data era of social platforms and social media. With the usage of unmanned air-borne vehicles like drones, the problem of face […]

CUDA

Mar, 8

Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations

Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, […]

OpenCL

Mar, 8

Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures

The rising pressure to simultaneously improve performance and reduce power consumption is driving more heterogeneity into all aspects of computing devices. However, wide adoption of specialized computing devices such as GPUs and Xeon Phis comes with a programming challenge. A carefully optimized program that is well matched to the target hardware can run many times […]

OpenCL

Mar, 8

A Novel Mapping of Arbitrary Precision Integer Operations to the GPU

With modern processing hardware converging on the physical barrier in terms of transistor size and speed per single core, hardware manufacturers have shifted their focus to improve performance from raw clock power towards parallelization. Solutions to utilize the computation power of GPUs are published and supported by graphics card manufacturers. While there exist solutions for […]

OpenCL

Mar, 7

Topology optimization design of 3D electrothermomechanical actuators by using GPU as a co-processor

The topology optimization method (TOM) requires high computational resources to be solved, especially in multiphysics problems. The high number of computational requirements is because TOM is an iterative technique, in which the iterations go from tens to thousands. Furthermore, at each TOM iteration, it is necessary to execute several routines such as the finite element […]

CUDA

Mar, 5

Performance Analysis of kNN on large datasets using CUDA & Pthreads

Several organizations have large databases which are growing at a rapid rate day by day, which need to be regularly maintained. Content based searches are similar searched based on certain features that are obtained from various multi media data. For various applications like multimedia content retrieval, data mining, pattern recognition, etc., performing the nearest neighbor […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models

Accelerating Wright-Fisher Forward Simulations on the Graphics Processing Unit

Automatic Data Layout Generation and Kernel Mapping for CPU+GPU Architectures

Pragma Directed Shared Memory Centric Optimizations on GPUs

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures

Testing fine-grained parallelism for the ADMM on a factor-graph

D-face: Parallel Implementation of CNN Based Face Classifier using Drone Data On K40 & Jetson TK1

Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations

Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures

A Novel Mapping of Arbitrary Precision Integer Operations to the GPU

Topology optimization design of 3D electrothermomechanical actuators by using GPU as a co-processor

Performance Analysis of kNN on large datasets using CUDA & Pthreads

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)