Fast sorting is an important step in many parallel algorithms, which require data ranking, ordering or partitioning. Parallel sorting is a widely researched subject, and many algorithms were developed in the past. In this paper, the focus is on implementing highly efficient sorting routines for the sparse linear algebra operations, such as parallel sparse matrix […]

March 18, 2015 by hgpu

BACKGROUND: In the last few years, the Non-negative Matrix Factorization (NMF) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. […]

February 19, 2015 by hgpu

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call […]

January 15, 2015 by hgpu

Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate […]

January 5, 2015 by hgpu

We present a block structured orthogonal factorization (BSOF) algorithm and its parallelization for computing the inversion of block p-cyclic matrices.We aim at the high performance on multicores with GPU accelerators. We provide a quantitative performance model for optimal host-device load balance, and validate the model through numerical tests. Benchmarking results show that the parallel BSOF […]

August 23, 2014 by hgpu

Aiming to fully exploit the computing power of all CPUs and all GPUs on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, as well as to accommodate the heterogeneity between CPUs and GPUs. The new […]

August 1, 2014 by hgpu

Lossless data compression is used to reduce storage requirements, allowing for the relief of I/O channels and better utilization of bandwidth. The Lempel-Ziv lossless compression algorithms form the basis for many of the most commonly used compression schemes. General purpose computing on graphic processing units (GPGPUs) allows us to take advantage of the massively parallel […]

July 29, 2014 by hgpu

Parallel computing is a topic that became very popular in the last few decades. Parallel computers are being used in many different areas of science such as astrophysics, climate modelling, quantum chemistry, fluid dynamics and medicine. Parallel programming is a type of programming where computations can be performed concurrently on different processors or devices. There […]

July 7, 2014 by hgpu

In this paper we present new hybrid CPU-GPU routines to accelerate the solution of linear systems, with band coefficient matrix, by off-loading the major part of the computations to the GPU and leveraging highly tuned implementations of the BLAS for the graphics processor. Our experiments with an nVidia S2070 GPU report speed-ups up to 6x […]

June 23, 2014 by hgpu

Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms – nonblocked, blocked, and recursive blocked – […]

June 12, 2014 by hgpu

We show how the cofactorization step, a compute-intensive part of the relation collection phase of the number field sieve (NFS), can be farmed out to a graphics processing unit. Our implementation on a GTX 580 GPU, which is integrated with a state-of-the-art NFS implementation, can serve as a cryptanalytic co-processor for several Intel i7-3770K quad-core […]

June 3, 2014 by hgpu

This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros in the incomplete factors can be computed in parallel and asynchronously, using one or more sweeps that iteratively improve the accuracy of the factorization. Unlike existing parallel algorithms, the new algorithm does not depend on reordering the matrix. Numerical […]

May 17, 2014 by hgpu