5345

Posts

Aug, 23

Implementing the PGI Accelerator model

The PGI Accelerator model is a high-level programming model for accelerators, such as GPUs, similar in design and scope to the widely-used OpenMP directives. This paper presents some details of the design of the compiler that implements the model, focusing on the Planner, the element that maps the program parallelism onto the hardware parallelism.
Aug, 23

MDR: performance model driven runtime for heterogeneous parallel platforms

We present a runtime framework for the execution of work-loads represented as parallel-operator directed acyclic graphs (PO-DAGs) on heterogeneous multi-core platforms. PO-DAGs combine coarse-grained parallelism at the graph level with fine-grained parallelism within each node, lending naturally to exploiting the intra — and inter-processing element parallelism present in heterogeneous platforms. We identify four important criteria […]
Aug, 23

Bounding the effect of partition camping in GPU kernels

Current GPU tools and performance models provide some common architectural insights that guide the programmers to write optimal code. We challenge and complement these performance models and tools, by modeling and analyzing a lesser known, but very severe performance pitfall, called Partition Camping, in NVIDIA GPUs. Partition Camping is caused by memory accesses that are […]
Aug, 23

Ypnos: declarative, parallel structured grid programming

A fully automatic, compiler-driven approach to parallelisation can result in unpredictable time and space costs for compiled code. On the other hand, a fully manual approach to parallelisation can be long, tedious, prone to errors, hard to debug, and often architecture-specific. We present a declarative domain-specific language, Ypnos, for expressing structured grid computations which encourages […]
Aug, 23

Fast parallel surface and solid voxelization on GPUs

This paper presents data-parallel algorithms for surface and solid voxelization on graphics hardware. First, a novel conservative surface voxelization technique, setting all voxels overlapped by a mesh’s triangles, is introduced, which is up to one order of magnitude faster than previous solutions leveraging the standard rasterization pipeline. We then show how the involved new triangle/box […]
Aug, 23

High performance content-based matching using GPUs

Matching incoming event notifications against received subscriptions is a fundamental part of every publish-subscribe infrastructure. In the case of content-based systems this is a fairly complex and time consuming task, whose performance impacts that of the entire system. In the past, several algorithms have been proposed for efficient content-based event matching. While they differ in […]
Aug, 23

Workload and network-optimized computing systems

This paper describes a recent system-level trend toward the use of massive on-chip parallelism combined with efficient hardware accelerators and integrated networking to enable new classes of applications and computing-systems functionality. This system transition is driven by semiconductor physics and emerging network-application requirements. In contrast to general-purpose approaches, workload and network-optimized computing provides significant cost, […]
Aug, 23

Efficient implementation of GPGPU synchronization primitives on CPUs

The GPGPU model represents a style of execution where thousands of threads execute in a data-parallel fashion, with a large subset (typically 10s to 100s) needing frequent synchronization. As the GPGPU model evolves target both GPUs and CPUs as acceleration targets, thread synchronization becomes an important problem when running on CPUs. CPUs have little hardware […]
Aug, 23

Performance Modelling and Traffic Characterisation of Optical Networks

A review is carried out on the traffic characteristics of an optical carrier’s OC-192 link, based on the IP packet size distribution, traffic burstiness and self-similarity. The generalised exponential (GE) distribution is employed to model the interarrival times of bursty traffic flows of IP packets whilst self-similar traffic is generated for each wavelength of each […]
Aug, 22

Auto-tuning 3-D FFT library for CUDA GPUs

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this […]
Aug, 22

Evaluation of streaming aggregation on parallel hardware architectures

We present a case study parallelizing streaming aggregation on three different parallel hardware architectures. Aggregation is a performance-critical operation for data summarization in stream computing, and is commonly found in sense-and-respond applications. Currently available commodity parallel hardware provides promise as accelerators for streaming aggregation. However, how streaming aggregation can map to the different parallel architectures […]
Aug, 22

A taxonomy of accelerator architectures and their programming models

As the clock frequency of silicon chips is leveling off, the computer architecture community is looking for different solutions to continue application performance scaling. One such solution is the multicore approach, i.e., using multiple simple cores that enable higher performance than wide superscalar processors, provided that the workload can exploit the parallelism. Another emerging alternative […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: