Posts
Oct, 11
Meta-programming and Multi-stage Programming for GPGPUs
GPGPUs and other accelerators are becoming a mainstream asset for high-performance computing. Raising the programmability of such hardware is essential to enable users to discover, master and subsequently use accelerators in day-to-day simulations. Furthermore, tools for high-level programming of parallel architectures are becoming a great way to simplify the exploitation of such systems. For this […]
Oct, 11
GPU Accelarated Multi-Block Lattice Boltzmann Solver for Viscous Flow Problems
We developed a lattice Boltzmann Solver, which can be used for the solution of low Reynolds number flow problems. Then, we modified it to run on Graphical Processing Unit using Compute Unified Device Architecture, which is a parallel computing platform and programming model created by NVIDIA. Comparison of the results that we obtained on Graphical […]
Oct, 11
Performance Analysis of an Astrophysical Simulation Code on the Intel Xeon Phi Architecture
We have developed the astrophysical simulation code XFLAT to study neutrino oscillations in supernovae. XFLAT is designed to utilize multiple levels of parallelism through MPI, OpenMP, and SIMD instructions (vectorization). It can run on both CPU and Xeon Phi co-processors based on the Intel Many Integrated Core Architecture (MIC). We analyze the performance of XFLAT […]
Oct, 11
Accelerating the D3Q19 Lattice Boltzmann Model with OpenACC and MPI
Multi-GPU implementations of the Lattice Boltzmann method are of practical interest as they allow the study of turbulent flows on large-scale simulations at high Reynolds numbers. Although programming GPUs, and in general power-efficient accelerators, typically guarantees high performances, the lack of portability in their low-level programming models implies significant efforts for maintainability and porting of […]
Oct, 11
GPU acceleration of preconditioned solvers for ill-conditioned linear systems
In this work we study the implementations of deflation and preconditioning techniques for solving ill-conditioned linear systems using iterative methods. Solving such systems can be a time-consuming process because of the jumps in the coefficients due to large difference in material properties. We have developed implementations of the iterative methods with these preconditioning techniques on […]
Oct, 8
Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit
In this article, we introduce CURRENNT, an open-source parallel implementation of deep recurrent neural networks (RNNs) supporting graphics processing units (GPUs) through NVIDIA’s Computed Unified Device Architecture (CUDA). CURRENNT supports uni- and bidirectional RNNs with Long Short-Term Memory (LSTM) memory cells which overcome the vanishing gradient problem. To our knowledge, CURRENNT is the first publicly […]
Oct, 8
GPU-Based Computation of 2D Least Median of Squares with Applications to Fast and Robust Line Detection
The 2D Least Median of Squares (LMS) is a popular tool in robust regression because of its high breakdown point: up to half of the input data can be contaminated with outliers without affecting the accuracy of the LMS estimator. The complexity of 2D LMS estimation has been shown to be $Omega(n^2)$ where $n$ is […]
Oct, 8
Kinematic Modelling of Disc Galaxies using Graphics Processing Units
With large-scale Integral Field Spectroscopy (IFS) surveys of thousands of galaxies currently under-way or planned, the astronomical community is in need of methods, techniques and tools that will allow the analysis of huge amounts of data. We focus on the kinematic modelling of disc galaxies and investigate the potential use of massively parallel architectures, such […]
Oct, 8
Solving the Quadratic Assignment Problem on heterogeneous environment (CPUs and GPUs) with the application of Level 2 Reformulation and Linearization Technique
The Quadratic Assignment Problem, QAP, is a classic combinatorial optimization problem, classified as NP-hard and widely studied. This problem consists in assigning N facilities to N locations obeying the relation of 1 to 1, aiming to minimize costs of the displacement between the facilities. The application of Reformulation and Linearization Technique, RLT, to the QAP […]
Oct, 8
Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization
OmpSs is a task-parallel programming model consisting of a reduced collection of OpenMP-like directives, a front-end compiler, and a runtime system. This directive-based programming interface helps developers accelerate their application’s execution, e.g. in a cluster equipped with graphics processing units (GPUs), with a low programming effort. On the other hand, the virtualization package rCUDA provides […]
Oct, 6
CVC: The Contourlet Video Compression algorithm for real-time applications
Nowadays, real-time video communication over the internet through video conferencing applications has become an invaluable tool in everyone’s professional and personal life. This trend underlines the need for video coding algorithms that provide acceptable quality on low bitrates and can support various resolutions inside the same stream in order to cope with limitations on computational […]
Oct, 6
MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing
Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries […]