Posts
Mar, 16
Multi-platform Linear Algebra
HiFlow3 is a multi-purpose finite element software providing powerful tools for efficient and accurate solution of a wide range of problems modeled by partial differential equations (PDEs). Based on object-oriented concepts and the full capabilities of C++ the HiFlow3 project follows a modular and generic approach for building efficient parallel numerical solvers. It provides highly […]
Mar, 15
On the Use of Small 2D Convolutions on GPUs
Computing many small 2D convolutions using FFTs is a basis for a large number of applications in many domains in science and engineering, among them electromagnetic diffraction modeling in physics. The GPU architecture seems to be a suitable architecture to accelerate these convolutions, but reaching high application performance requires substantial development time and non-portable optimizations. […]
Mar, 15
Iterative Statistical Kernels on Contemporary GPUs
We present a study of three important kernels that occur frequently in iterative statistical applications: Multi-Dimensional Scaling (MDS), PageRank, and K-Means. We implemented each kernel using OpenCL and evaluated their performance on NVIDIA Tesla and NVIDIA Fermi GPGPU cards using dedicated hardware, and in the case of Fermi, also on the Amazon EC2 cloud-computing environment. […]
Mar, 15
Performance analysis and optimization of the OP2 framework on many-core architectures
This paper presents a benchmarking, performance analysis and optimization study of the OP2 ‘active’ library, which provides an abstraction framework for the parallel execution of unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, and thereby achieve code longevity and near-optimal performance through re-targeting the application to […]
Mar, 15
Compressed Multiple-Row Storage Format
A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation on modern throughput-oriented computer architectures. This format extends the standard compressed row storage (CRS) format and is easily convertible to and from it without any memory overhead. Computational performance of an SpMV kernel for the new format is determined […]
Mar, 15
A Spiking Neural P system simulator based on CUDA
In this paper we present a Spiking Neural P system (SNP system) simulator based on graphics processing units (GPUs). In particular we implement the simulator using NVIDIA CUDA enabled GPUs. The massively parallel architecture of current GPUs is very suitable for the maximally parallel computations of SNP systems. We simulate a wider variety of SNP […]
Mar, 13
Targeting heterogeneous architectures via macro data flow
We propose a data flow based run time system as an efficient tool for supporting execution of parallel code on heterogeneous architectures hosting both multicore CPUs and GPUs. We discuss how the proposed run time system may be the target of both structured parallel applications developed using algorithmic skeletons/parallel design patterns and also more "domain […]
Mar, 13
Expressive Array Constructs in an Embedded GPU Kernel Programming Language
Graphics Processing Units (GPUs) are powerful computing devices that with the advent of CUDA/OpenCL are becomming useful for general purpose computations. Obsidian is an embedded domain specific language that generates CUDA kernels from functional descriptions. A symbolic array construction allows us to guarantee that intermediate arrays are fused away. However, the current array construction has […]
Mar, 13
Parallel Branch and Bound on a CPU-GPU System
Hybrid implementation via CUDA of a branch and bound method for knapsack problems is proposed. Branch and bound computations can be carried out either on the CPU or on the GPU according to the size of the branch and bound list, i.e. the number of nodes. Tests are carried out on a Tesla C2050 GPU. […]
Mar, 13
Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries
With GPU architectures becoming increasingly important due to their large number of parallel processors, NVIDIA’s CUDA environment is becoming widely used to support general purpose applications. To efficiently use the parallel processing power, programmers need to efficiently parallelize and map their algorithms. The difficulty of this task leads to the idea to investigate CUDA’s compiler. […]
Mar, 13
Real-time execution of image change detection
State-of-the-art video analysis systems feature multiple complex processing steps and operate on high resolution images. Intensive computation power is needed for real-time execution. In this project an image change detection application is mapped to a heterogeneous multicore CPU/GPU platform. It is investigated what hardware configuration is required to execute the application in real-time. For optimal […]
Mar, 12
Dynamic Compilation of Data-Parallel Kernels for Vector Processors
Modern processors enjoy augmented throughput and power efficiency through specialized functional units leveraged via instruction set extensions. These functional units accelerate performance for specific types of operations but must be programmed explicitly. Moreover, applications targeting these specialized units will not take advantage of future ISA extensions and tend not to be portable across multiple ISAs. […]