Posts
Aug, 22
Floating-point data compression at 75 Gb/s on a GPU
Numeric simulations often generate large amounts of data that need to be stored or sent to other compute nodes. This paper investigates whether GPUs are powerful enough to make real-time data compression and decompression possible in such environments, that is, whether they can operate at the 32- or 40-Gb/s throughput of emerging network cards. The […]
Aug, 22
A fast GEMM implementation on the cypress GPU
We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ~ 2 Top/s and ~ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of […]
Aug, 22
Cost-aware function migration in heterogeneous systems
Today’s approaches towards heterogeneous computing rely on either the programmer or dedicated programming models to efficiently integrate heterogeneous components. In this work, we propose an adaptive cost-aware function-migration mechanism built on top of a light-weight hardware abstraction layer. With this mechanism, the highly dynamic task of choosing the most beneficial processing unit will be hidden […]
Aug, 22
Towards metaprogramming for parallel systems on a chip
We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata […]
Aug, 22
Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study
Programs developed under the Compute Unified Device Architecture obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among threads and a high value of processor occupancy, i.e. the ratio of active threads, are indispensable. However, in certain […]
Aug, 22
Top ten ways to make formal methods for HPC practical
Almost all fundamental advances in science and engineering crucially depend on the availability of extremely capable high performance computing (HPC) systems. Future HPC systems will increasingly be based on heterogeneous multi-core CPUs, and their programming will involve multiple concurrency models, with the message passing interface (MPI) serving as the dominant model for many years. These […]
Aug, 22
The VRE volume rendering engine
We present the extendable volume rendering engine VRE which provides an open and flexible environment for both experimental and production level implementation of a wide range of volume visualisation algorithms, including various CPU and GPU based ones. We identify parts of renderer functionality suitable for isolation in logical units and propose various types of plugins. […]
Aug, 22
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously […]
Aug, 22
Reusable software components for accelerator-based clusters
The emerging accelerator-based heterogeneous clusters, comprising specialized processors such as the IBM Cell and GPUs, have exhibited excellent price to performance ratio as well as high energy-efficiency. However, developing and maintaining software for such systems is fraught with challenges, especially for modern high-performance computing (HPC) applications that can benefit the most from leveraging accelerators. If […]
Aug, 22
Improving programmability of heterogeneous many-core systems via explicit platform descriptions
In this paper we present ongoing work towards a programming framework for heterogeneous hardware- and software environments. Our framework aims at improving programmability and portability for heterogeneous many-core systems via a Platform Description Language (PDL) for expressing architectural patterns and platform information. We developed a prototypical code generator that takes as input an annotated serial […]
Aug, 22
Accelerating Haskell array codes with multicore GPUs
Current GPUs are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction, we propose a domain-specific high-level language of array computations that captures appropriate idioms in the form of […]
Aug, 22
A programming model for GPU-based parallel computing with scalability and abstraction
In this paper, we present a multi-level programming model for recent GPU-based high performance computing systems. Involving cooperative stream threads and symmetric multiprocessing threads our model gives a computational framework that scales through multi-GPU environments to GPU-cluster systems. Instead of hiding the execution environment from the programmer using compiler extensions or metaprogramming techniques we aim […]