Posts
Jan, 30
Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips
Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the efficiency of different SIMD-vectorized implementations of the RabbitCT benchmark. RabbitCT performs 3D image reconstruction by back projection, a vital operation in computed […]
Jan, 30
GPU-Accelerated BWT Construction for Large Collection of Short Reads
Advances in DNA sequencing technology have stimulated the development of algorithms and tools for processing very large collections of short strings (reads). Short-read alignment and assembly are among the most well-studied problems. Many state-of-the-art aligners, at their core, have used the Burrows-Wheeler transform (BWT) as a main-memory index of a reference genome (typical example, NCBI […]
Jan, 30
A GPU accelerated algorithm for 3D Delaunay triangulation
We propose the first algorithm to compute the 3D Delaunay triangulation (DT) on the GPU. Our algorithm uses massively parallel point insertion followed by bilateral flipping, a powerful local operation in computational geometry. Although a flipping algorithm is very amenable to parallel processing and has been employed to construct the 2D DT and the 3D […]
Jan, 30
A CUDA Monte Carlo simulator for radiation therapy dosimetry based on Geant4
Geant4 is a large-scale particle physics package that facilitates every aspect of particle transport simulation. This includes, but is not limited to, geometry description, material definition, tracking of particles passing through and interacting with matter, storage of event data, and visualization. As more detailed and complex simulations are required in different application domains, there is […]
Jan, 30
A QUDA-branch to compute disconnected diagrams in GPUs
Although QUDA allows for an efficient computation of many QCD quantities, it is surprinsingly lacking tools to evaluate disconnected diagrams, for which GPUs are specially well suited. We aim to fill this gap by creating our own branch of QUDA, which includes new kernels and functions required to calculate fermion loops using several methods and […]
Jan, 29
A Detailed GPU Cache Model Based on Reuse Distance Theory
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means […]
Jan, 29
Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators
The use of linear algebra routines is fundamental to many areas of computational science, yet their implementation in software still forms the main computational bottleneck in many widely used algorithms. In machine learning and computational statistics, for example, the use of Gaussian distributions is ubiquitous, and routines for calculating the Cholesky decomposition, matrix inverse and […]
Jan, 29
Consolidating Applications for Energy Efficiency in Heterogeneous Computing Systems
By scheduling multiple applications with complementary resource requirements on a smaller number of compute nodes, we aim to improve performance, resource utilization, energy consumption, and energy efficiency simultaneously. In addition to our naive consolidation approach, which already achieves the aforementioned goals, we propose a new energy efficiency-aware (EEA) scheduling policy and compare its performance with […]
Jan, 29
Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors
Wideband channelization is a computationally intensive task within software-defined radio (SDR). To support this task, the underlying hardware should provide high performance and allow flexible implementations. Traditional solutions use field-programmable gate arrays (FPGAs) to satisfy these requirements. While FPGAs allow for flexible implementations, realizing a FPGA implementation is a difficult and time-consuming process. On the […]
Jan, 29
On the Programmability and Performance of Heterogeneous Platforms
General-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures presents many challenges to the domain scientist, including device selection, programming model, and level of investment in optimization. All of these choices influence the […]
Jan, 29
A Performance Criteria for parallel Computation on basis of block size using CUDA Architecture
GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device. Multiplication of matrices of large order can be computed in few seconds using GPU based on CUDA Architecture. A modern GPU consists of 16 highly threaded streaming multiprocessors (SMs). GPU named Fermi consists of 32 SMs. These are computing intensive devices. […]
Jan, 29
Impact of communication times on mixed CPU/GPU applications scheduling using KAAPI
High Performance Computing machines use more and more Graphical Processing Units as they are very efficient for homogeneous computation such as matrix operations. However before using these accelerators, one has to transfer data from the processor to them. Such a transfer can be slow. In this report, our aim is to study the impact of […]