5337

Posts

Aug, 23

Performance Modelling and Traffic Characterisation of Optical Networks

A review is carried out on the traffic characteristics of an optical carrier’s OC-192 link, based on the IP packet size distribution, traffic burstiness and self-similarity. The generalised exponential (GE) distribution is employed to model the interarrival times of bursty traffic flows of IP packets whilst self-similar traffic is generated for each wavelength of each […]
Aug, 22

Auto-tuning 3-D FFT library for CUDA GPUs

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this […]
Aug, 22

Evaluation of streaming aggregation on parallel hardware architectures

We present a case study parallelizing streaming aggregation on three different parallel hardware architectures. Aggregation is a performance-critical operation for data summarization in stream computing, and is commonly found in sense-and-respond applications. Currently available commodity parallel hardware provides promise as accelerators for streaming aggregation. However, how streaming aggregation can map to the different parallel architectures […]
Aug, 22

A taxonomy of accelerator architectures and their programming models

As the clock frequency of silicon chips is leveling off, the computer architecture community is looking for different solutions to continue application performance scaling. One such solution is the multicore approach, i.e., using multiple simple cores that enable higher performance than wide superscalar processors, provided that the workload can exploit the parallelism. Another emerging alternative […]
Aug, 22

Floating-point data compression at 75 Gb/s on a GPU

Numeric simulations often generate large amounts of data that need to be stored or sent to other compute nodes. This paper investigates whether GPUs are powerful enough to make real-time data compression and decompression possible in such environments, that is, whether they can operate at the 32- or 40-Gb/s throughput of emerging network cards. The […]
Aug, 22

A fast GEMM implementation on the cypress GPU

We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ~ 2 Top/s and ~ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of […]
Aug, 22

Cost-aware function migration in heterogeneous systems

Today’s approaches towards heterogeneous computing rely on either the programmer or dedicated programming models to efficiently integrate heterogeneous components. In this work, we propose an adaptive cost-aware function-migration mechanism built on top of a light-weight hardware abstraction layer. With this mechanism, the highly dynamic task of choosing the most beneficial processing unit will be hidden […]
Aug, 22

Towards metaprogramming for parallel systems on a chip

We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata […]
Aug, 22

Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

Programs developed under the Compute Unified Device Architecture obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among threads and a high value of processor occupancy, i.e. the ratio of active threads, are indispensable. However, in certain […]
Aug, 22

Top ten ways to make formal methods for HPC practical

Almost all fundamental advances in science and engineering crucially depend on the availability of extremely capable high performance computing (HPC) systems. Future HPC systems will increasingly be based on heterogeneous multi-core CPUs, and their programming will involve multiple concurrency models, with the message passing interface (MPI) serving as the dominant model for many years. These […]
Aug, 22

The VRE volume rendering engine

We present the extendable volume rendering engine VRE which provides an open and flexible environment for both experimental and production level implementation of a wide range of volume visualisation algorithms, including various CPU and GPU based ones. We identify parts of renderer functionality suitable for isolation in logical units and propose various types of plugins. […]
Aug, 22

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: