high performance computing on graphics processing units: hgpu.org

Posts

Sep, 11

Software-based branch predication for AMD GPUs

Branch predication is a program transformation technique that combines instructions of multiple branches of an if statement into a straight-line sequence and associates each instruction of the sequence with a predicate. The branch predication improves the execution of branch statements on processors that support predicated execution of instruction, e.g., Intel IA-64, because such transformation improves […]

Sep, 11

Solving diffractive optics problems using graphics processing units

Techniques for applying graphics processing units (GPU) to the general-purpose nongraphics computations proposed in recent years by the companies ATI (AMD FireStream, 2006) and NVIDIA (CUDA: Compute Unified Device Architecture, 2007) have given an impetus to developing algorithms and software packages for solving problems of diffractive optics with the aid of the GPU. The computations […]

CUDA

•

OpenGL

Sep, 9

Enabling multiple accelerator acceleration for Java/OpenMP

While using a single GPU is fairly easy, using multiple CPUs and GPUs potentially distributed over multiple machines is hard because data needs to be kept consistent using message exchange and the load needs to be balanced. We propose (1) an array package that provides partitioned and replicated arrays and (2) a compute-device library to […]

CUDA

•

OpenCL

Sep, 9

Heterogeneous multicore parallel programming for graphics processing units

Hybrid parallel multicore architectures based on graphics processing units (GPUs) can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a […]

Sep, 9

Beyond programmable shading (parts I and II)

There are strong indications that the future of interactive graphics programming is a more flexible model than today’s OpenGL/Direct3D pipelines. Graphics developers need a basic understanding of how to combine emerging parallel programming techniques and more flexible graphics processors with the traditional interactive rendering pipeline. As the first in a series, this course introduces the […]

OpenGL

Sep, 9

Data classification for artificial intelligence construct training to aid in network incident identification using network telescope data

This paper considers the complexities involved in obtaining training data for use by artificial intelligence constructs to identify potential network incidents using passive network telescope data. While a large amount of data obtained from network telescopes exists, this data is not currently marked for known incidents. Problems related to this marking process include the accuracy […]

Sep, 9

A stream-computing extension to OpenMP

This paper introduces an extension to OpenMP3.0 enabling stream programming with minimal, incremental additions that seamlessly integrate into the current specification. The stream programming model decomposes programs into tasks and explicits the flow of data among them, thus exposing data, task and pipeline parallelism. It helps the programmers to express concurrency and data locality properties, […]

Sep, 9

CUDACS: securing the cloud with CUDA-enabled secure virtualization

While on the one hand unresolved security issues pose a barrier to the widespread adoption of cloud computing technologies, on the other hand the computing capabilities of even commodity HW are boosting, in particular thanks to the adoption of *-core technologies. For instance, the Nvidia Compute Unified Device Architecture (CUDA) technology is increasingly available on […]

CUDA

Sep, 9

KAdvice: infering synchronization patterns from an existing codebase

Operating system kernels are complex software systems. The kernels of todays mainstream OSs, such as Linux or Windows, are composed from a number of modules, which contain code and data. Even when providing synchronous interfaces (APIs) to the programmer, large portions of the OS kernel operate in an asynchronous manner. Synchronizing access to kernel data […]

Sep, 9

Attaining system performance points: revisiting the end-to-end argument in system design for heterogeneous many-core systems

Trends indicate a rapid increase in the number of cores on chip, exhibiting various types of performance and functional asymmetries present in hardware to gain scalability with balanced power vs. performance requirements. This poses new challenges in platform resource management, which are further exacerbated by the need for runtime power budgeting and by the increased […]

Sep, 9

The architecture of the DecentVM: towards a decentralized virtual machine for many-core computing

Fully decentralized systems avoid bottlenecks and single points of failure. Thus, they can provide excellent scalability and very robust operation. The DecentVM is a fully decentralized, distributed virtual machine. Its simplified instruction set allows for a small VM code footprint. Its partitioned global address space (PGAS) memory model helps to easily create a single system […]

Sep, 9

Piccolo: building fast, distributed programs with partitioned tables

Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centers. Unlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Piccolo enables efficient application implementations. In particular, applications can specify locality policies to exploit the locality of shared state […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Software-based branch predication for AMD GPUs

Solving diffractive optics problems using graphics processing units

Enabling multiple accelerator acceleration for Java/OpenMP

Heterogeneous multicore parallel programming for graphics processing units

Beyond programmable shading (parts I and II)

Data classification for artificial intelligence construct training to aid in network incident identification using network telescope data

A stream-computing extension to OpenMP

CUDACS: securing the cloud with CUDA-enabled secure virtualization

KAdvice: infering synchronization patterns from an existing codebase

Attaining system performance points: revisiting the end-to-end argument in system design for heterogeneous many-core systems

The architecture of the DecentVM: towards a decentralized virtual machine for many-core computing

Piccolo: building fast, distributed programs with partitioned tables

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)