Posts
Jan, 14
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU’s memory hierarchy. The algorithm may outperform MAGMA’s dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies […]
Jan, 14
Towards Portable Performance for Explicit Hydrodynamics Codes
Significantly increasing intra-node parallelism is widely recognised as being a key prerequisite for reaching exascale levels of computational performance. In future exascale systems it is likely that this performance improvement will be realised by increasing the parallelism available in traditional CPU devices and using massively-parallel hardware accelerators. The MPI programming model is starting to reach […]
Jan, 14
Parallelization and Optimization of Feature Detection Algorithms on Embedded GPU
In this paper, we parallelize and optimize the popular feature detection algorithms, i.e. SIFT and SURF, on the latest embedded GPU. Using conventional OpenGL shading language and recently developed OpenCL as the GPGPU software platforms, we compare the implementation efficiency and speed performance between each other as well as between GPU and CPU. Experimental result […]
Jan, 14
k+-buffer: Fragment Synchronized k-buffer
k-buffer facilitates novel approaches to multi-fragment rendering and visualization for developing interactive applications on the GPU. Various alternatives have been proposed to alleviate its memory hazards and to avoid completely or partially the necessity of geometry pre-sorting. However, that came with the burden of excessive memory allocation and depth precision artifacts. We introduce k+-buffer, a […]
Jan, 14
High Performance Programming for Soft Computing
This book examines the present and future of soft computer techniques. It explains how to use the latest technological tools, such as multicore processors and graphics processing units, to implement highly efficient intelligent system methods using a general purpose computer.
Jan, 14
GPUs for real-time processing in HEP trigger systems
We describe a pilot project (GAP – GPU Application Project) for the use of GPUs (Graphics processing units) in online triggering applications for High Energy Physics experiments. Two major trends can be identified in the development of trigger and DAQ systems for particle physics experiments: the massive use of general-purpose commodity systems such as commercial […]
Jan, 12
A Framework for Productive, Efficient and Portable Parallel Computing
Developing efficient parallel implementations and fully utilizing the available resources of parallel platforms is now required for software applications to scale to new generations of processors. Yet, parallel programming remains challenging to programmers due to the requisite low-level knowledge of the underlying hardware and parallel computing constructs. These restrictions in turn impede experimentation with various […]
Jan, 12
Importance-Driven Isosurface Decimation for Visualization of Large Simulation Data Based on OpenCL
For large simulation data, Parallel Marching Cubes algorithm is efficient and commonly used to extract isosurfaces in 3D scalar field. However, the isosurface meshes are sometimes too dense and it is difficult for scientists to specify the areas they are interested in. In this paper, we provide them a new way to define mesh importance […]
Jan, 12
A tool for mapping Single Nucleotide Polymorphisms using Graphics Processing Units
BACKGROUND: Single Nucleotide Polymorphism (SNP) genotyping analysis is very susceptible to SNPs chromosomal position errors. As it is known, SNPs mapping data are provided along the SNP arrays without any necessary information to assess in advance their accuracy. Moreover, these mapping data are related to a given build of a genome and need to be […]
Jan, 12
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation
High throughput architectures rely on high thread-level parallelism (TLP) to hide execution latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of thread blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it […]
Jan, 12
GPU-Accelerated parallel FDTD on Distributed Heterogeneous Platform
This paper introduces a (Finite-Difference Time-Domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multi-Processing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure […]
Jan, 11
Implementations of the Hough Transform on the Embedded Multicore Processors
Embedded multicore processors represented by FPGAs and GPUs have lately attracted considerable attention for their potential computation ability and power consumption. Recent FPGAs have hundreds of embedded DSP slices and block RAMs. For example, Xilinx Virtex-6 Family FPGAs have a DSP48E1 slice, which is a configurable logic block equipped with fast multipliers, adders, pipeline registers, […]