12324

Posts

Jun, 16

An in-depth performance analysis of irregular workloads on VLIW APU

Heterogeneous multi-core architectures have a higher performance/power ratio than traditional homogeneous architectures. Due to their heterogeneity, these architectures support diverse applications but developing parallel algorithms on these architectures can be difficult. In implementing algorithms for heterogeneous systems, proprietary languages are often required, limiting portability. Although general purpose graphics processing units (GPUs) have shown great promise […]
Jun, 16

Improved Distance Weighted GPU-based 3D Ultrasound Reconstruction Methods

Ultrasound is a flexible medical imaging modality with many uses, one of them being intra-operative imaging for use in navigation. In order to obtain the highest possible spatial resolution and avoiding big, clunky 3D ultra-sound probes, reconstruction of many 2D ultrasound images obtained by a conventional 2D ultrasound probe with a tracking system attached has […]
Jun, 16

NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

While the GPGPU paradigm is widely recognized as an effective approach to high performance computing, its adoption in low-latency, real-time systems is still in its early stages. Although GPUs typically show deterministic behaviour in terms of latency in executing computational kernels as soon as data is available in their internal memories, assessment of real-time features […]
Jun, 16

Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels

Due to the diversity of processor architectures and application memory access patterns, the performance impact of using local memory in OpenCL kernels has become unpredictable. For example, enabling the use of local memory for an OpenCL kernel can be beneficial for the execution on a GPU, but can lead to performance losses when running on […]
Jun, 15

Toward OpenCL Automatic Multi-Device Support

To fully tap into the potential of today heterogeneous machines, offloading parts of an application on accelerators is no longer sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. […]
Jun, 15

Evaluating CP2K on Exascale Hardware: Intel Xeon Phi

CP2K, a popular open-source European atomistic simulation package has been ported to the Intel Xeon Phi architecture, requiring no code modifications except minor bug fixes. Benchmarking of a small molecular dynamics simulation has been carried out using CP2K’s existing MPI, OpenMP and mixed-mode MPI/OpenMP implementations to achieve full utilisation of the Xeon Phi’s 240 virtual […]
Jun, 15

Airborne radar clutter simulation using GPU (CUDA)

Radar is an object detection system. Airborne radar is meant to search, detect and track aerial objects. Clutter is an unwanted echo that interferes with the observation of signal on radar screen. This paper discusses the use of GPU and CUDA. Graphic Processor Unit or GPU computing is the use of GPU together with CPU […]
Jun, 15

A Parallel Algorithm of PCA-SIFT Based on CUDA

PCA-SIFT is an algorithm to extract invariant features from images, it has been widely applied to many application fields including image processing, computer vision and pattern recognition. However, the execution of PCA-SIFT is time-consuming. A parallel algorithm of PCA-SIFT based on Compute Unified Device Architecture (CUDA) is proposed in this paper, in which each step […]
Jun, 15

Scalable Lattice Boltzmann Solvers for CUDA GPU Clusters

The lattice Boltzmann method (LBM) is an innovative and promising approach in computational fluid dynamics. From an algorithmic standpoint it reduces to a regular data parallel procedure and is therefore well-suited to high performance computations. Numerous works report efficient implementations of the LBM for the GPU, but very few mention multi-GPU versions and even fewer […]
Jun, 14

Image Denoising Using Wavelet Transform and CUDA

The discrete wavelet transform has a huge number of applications in science, engineering, mathematics and computer science. Most notably, it is used for signal coding to represent a discrete signal in a more redundant form, often as a preconditioning for data compression. Beginning in the 1990s, wavelets have been found to be a powerful tool […]
Jun, 14

Dynamic loop vectorization for executing OpenCL kernels on CPUs

Heterogeneous computing platforms are becoming increasingly important in supercomputing. Many systems now integrate CPUs and GPUs cooperating together on a single node. Much effort is invested in tuning GPU-kernels. However, it can be the case that some systems may not have GPUs or the GPUs are busy. Maintaining two versions of the same code for […]
Jun, 14

A GPU Implementation of Large Neighborhood Search for Solving Constraint Optimization Problems

Constraint programming has gained prominence as an effective and declarative paradigm for modeling and solving complex combinatorial problems. In particular, techniques based on local search have proved practical to solve real-world problems, providing a good compromise between optimality and efficiency. In spite of the natural presence of concurrency, there has been relatively limited effort to […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: