Posts
Oct, 8
Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization
OmpSs is a task-parallel programming model consisting of a reduced collection of OpenMP-like directives, a front-end compiler, and a runtime system. This directive-based programming interface helps developers accelerate their application’s execution, e.g. in a cluster equipped with graphics processing units (GPUs), with a low programming effort. On the other hand, the virtualization package rCUDA provides […]
Oct, 6
CVC: The Contourlet Video Compression algorithm for real-time applications
Nowadays, real-time video communication over the internet through video conferencing applications has become an invaluable tool in everyone’s professional and personal life. This trend underlines the need for video coding algorithms that provide acceptable quality on low bitrates and can support various resolutions inside the same stream in order to cope with limitations on computational […]
Oct, 6
MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing
Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries […]
Oct, 6
Parallel Graph Algorithms on the Xeon Phi Coprocessor
Complex networks have received interest in a wide area of applications, ranging from road networks over hyperlink connections in the world wide web to interactions between people. Advanced algorithms are required for the generation as well as visualization of such graphs. In this work two graph algorithms, one for graph generation, the other for graph […]
Oct, 6
Optimizing GPU-accelerated Group-By and Aggregation
The massive parallelism and faster random memory access of Graphics Processing Units (GPUs) promise to further accelerate complex analytics operations such as joins and grouping, but also provide additional challenges to optimizing their performance. There are more implementation alternatives to consider on the GPU, such as exploiting different types of memory on the device and […]
Oct, 6
A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs
Array-based languages such as MATLAB and Python (with NumPy) have become very popular for scientific computing. However, the performance of the implementations of these languages is often lacking. For example, some of the implementations are interpreted. Further, these languages were not designed with multi-core CPUs and GPUs in mind and thus don’t take full advantage […]
Oct, 3
Tuned and GPU-accelerated parallel data mining from comparable corpora
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on […]
Oct, 3
Fast Algorithms for Convolutional Neural Networks
We derive a new class of fast algorithms for convolutional neural networks using Winograd’s minimal filtering algorithms. Specifically we derive algorithms for network layers with 3×3 kernels, which are the preferred kernel size for image recognition tasks. The best of our algorithms reduces arithmetic complexity up to 4X compared with direct convolution, while using small […]
Oct, 3
Brute-Force k-Nearest Neighbors Search on the GPU
We present a brute-force approach for finding k-nearest neighbors on the GPU for many queries in parallel. Our program takes advantage of recent advances in fundamental GPU computing primitives. We modify a matrix multiplication subroutine in MAGMA library [6] to calculate the squared Euclidean distances between queries and references. The nearest neighbors selection is accomplished […]
Sep, 30
Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards
We discuss an approach for solving sparse or dense banded linear systems ${bf A} {bf x} = {bf b}$ on a Graphics Processing Unit (GPU) card. The matrix ${bf A} in {mathbb{R}}^{N times N}$ is possibly nonsymmetric and moderately large; i.e., $10000 leq N leq 500000$. The ${it split and parallelize}$ (${tt SaP}$) approach seeks […]
Sep, 29
Performance Testing of GPU-Based Approximate Matching Algorithm on Network Traffic
Insider threat is one of the risks both government and private organizations have to deal with in protecting their important information. Data exfiltration and data leakage resulting from insiders activities can be very difficult to identify and quantify. Unfortunately, existing solutions that efficiently check whether data moving across a network is known to be sensitive […]
Sep, 29
The Dynamical Kernel Scheduler – Part 1
Emerging processor architectures such as GPUs and Intel MICs provide a huge performance potential for high performance computing. However developing software using these hardware accelerators introduces additional challenges for the developer such as exposing additional parallelism, dealing with different hardware designs and using multiple development frameworks in order to use devices from different vendors. The […]