Posts
Mar, 12
Fast hydrodynamics on heterogenous many-core hardware
In this chapter, we present details of a heterogenous and massively parallel GPU library implementation in CUDA C/C++ of a nonlinear free surface water wave model [15]. We describe how flexible-order finite difference approximations to the partial differential equations of the model can be proto- typed using library components provided in an in-house library. In […]
Mar, 12
Development of High-Performance Software Components for Emerging Architectures
Massively parallel processors, such as graphical processing units (GPUs), have in recent years proven to be effective for a vast amount of scientific appli- cations. Today, most desktop computers are equipped with one or more pow- erful GPUs, offering heterogeneous high-performance computing to a broad range of scientific researchers and software developers. Though GPUs are […]
Mar, 12
2014 7th International Conference on Advanced Computer Theory and Engineering, ICACTE 2014
Submission Deadline: 2014-06-05 Publication: All accepted papers of ICACTE 2014 will be published in the conference proceedings, under an ISBN reference by ASME Press, which will be included in the ASME Digital Library, and the publisher will send the proceeding to be reviewed by the Ei Compendex, ISI Proceeding and other major indexing services. Call […]
Mar, 12
Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors
Intel Xeon Phi coprocessors allow symmetric heterogeneous clustering models, in which MPI processes are run fully on coprocessors, as opposed to offload-based clustering. These symmetric models are attractive, because they allow effortless porting of CPU-based applications to clusters with manycore computing accelerators. However, with the default software configuration and without specialized networking hardware, peer-to-peer communication […]
Mar, 12
Locality optimization on a NUMA architecture for hybrid LU factorization
We study the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and memory on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We […]
Mar, 12
Reduced Vlasov-Maxwell simulations
In this paper we review two different numerical methods for Vlasov-Maxwell simulations. The first method is based on a coupling between a Discontinuous Galerkin (DG) Maxwell solver and a Particle-In-Cell (PIC) Vlasov solver. The second method only uses a DG approach for the Vlasov and Maxwell equations. The Vlasov equation is first reduced to a […]
Mar, 12
Genetically Improved CUDA kernels for StereoCamera
Genetic Programming (GP) may dramatically increase the performance of software written by domain experts. GP and autotuning are used to optimise and refactor legacy GPGPU C code for modern parallel graphics hardware and software. Speed ups of more than six times on recent nVidia GPU cards are reported compared to the original kernel on the […]
Mar, 12
Efficient Preconditioned Conjugate Gradient Parallelization on GPU
We present a performance analysis of a parallel implementation of both conjugate gradient and preconditioned conjugate gradient solvers using graphic processing units with CUDA parallel programming model. The solvers were optimized for a fast solution of sparse systems of equations arising from Finite Element Analysis (FEA) of electromagnetic phenomena. The preconditioners were Incomplete Cholesky factorization […]
Mar, 12
MaxSSmap: A GPU program for short read mapping with the maximum scoring subsequence
Exact short read mapping to whole genomes with the Smith-Waterman algorithm is computationally expensive yet highly accurate when aligning reads with mismatches and gaps. We introduce a GPU program called MaxSSmap with the aim of achieving comparable accuracy to Smith-Waterman but with faster runtimes. Similar to mainstream approaches MaxSSmap identifies a local region of the […]
Mar, 10
OpenCL-Accelerated Simplified General Perturbations 4 Algorithm
The number of space objects such as satellites, spacecraft, and debris are increasing significantly, and so is the need for tracking them for security and collision avoidance purposes. In this context, as parallelism is becoming a new paradigm, the need of implementing high performance propagators remain unmet. For this, we implemented Simplified General Perturbations No. […]
Mar, 10
GPU-EvR: Run-time Event Based Real-time Scheduling Framework on GPGPU Platform
GPU architecture has traditionally been used in graphics application because of its enormous computing capability. Moreover, GPU architecture has also been used for general purpose computing in these days. Most of the current scheduling frameworks that are developed to handle GPGPU workload operate sequentially. This is problematic since this sequential approach may not be scalable […]
Mar, 10
Massively parallel read mapping on GPUs with PEANUT
We present PEANUT (ParallEl AligNment UTility), a highly parallel GPU-based read mapper with several distinguishing features, including a novel q-gram index (called the q-group index) with small memory footprint built on-the-fly over the reads and the possibility to output both the best hits or all hits of a read. Designing the algorithm particularly for the […]