12889

Posts

Sep, 29

Decoupling algorithms from the organization of computation for high performance image processing

Future graphics and imaging applications-from self-driving cards, to 4D light field cameras, to pervasive sensing-demand orders of magnitude more computation than we currently have. This thesis argues that the efficiency and performance of an application are determined not only by the algorithm and the hardware architecture on which it runs, but critically also by the […]
Sep, 29

XBOOLE-CUDA: Fast Boolean Operations on the GPU

The Boolean domain faces us with the exponential complexity of Boolean functions and the technological progress in micro- and nano-electronics allows increasing numbers of Boolean variables. This requires very powerful Boolean computations. The progress in the performance of Graphics Processing Units (GPUs) and the possibility to utilize the GPU to solve tasks of many application […]
Sep, 28

An hybrid AES-256-GCM implementation for NEON CPU & CUDA GPU

This paper describes & evaluates a fast, hybrid implementation of the Advanced Encryption Standard with 256 bit keys (AES-256) block encryption in Galois/Counter Mode (GCM). The implementation is bit-compatible with the implemented standard in both the OpenSSL and Crypto++ libraries, while significantly (up to three times) faster for large amount of data. In this implementation, […]
Sep, 28

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or […]
Sep, 28

High-performance Implementations and Large-scale Validation of the Link-wise Artificial Compressibility Method

The link-wise artificial compressibility method (LW-ACM) is a recent formulation of the artificial compressibility method for solving the incompressible Navier-Stokes equations. Two implementations of the LW-ACM in three dimensions on CUDA enabled GPUs are described. The first one is a modified version of a state-of-the-art CUDA implementation of the lattice Boltzmann method (LBM), showing that […]
Sep, 28

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create […]
Sep, 28

Accelerating Phylogenetic Inference on GPUs: an OpenACC and CUDA comparison

Phylogenetic inference is used to derive a "tree of life" for a collection of species whose DNA sequences are known. While several software packages have already been developed to take advantage of GPUs to accelerate phylogenetic inference, they typically require significant changes to the original code, constraining code maintenance. Recently, the OpenACC API was proposed […]
Sep, 25

An open source finite-difference time-domain solver for room acoustics using graphics processing units

Wave based simulation methods have been utilized to numerically estimate wave propagation in domains where low-frequency wave effects dominate the response. Finite-difference time-domain (FDTD) methods are increasingly useful for such problems, but they require massive spatial oversampling to increase the bandwidth of the simulation, which leads to significant computational expense. The advantage of explicit time-stepping […]
Sep, 25

Study on semi-global matching algorithm extended for multi baseline matching and parallel processing method based on GPU

This paper extended semi-global matching algorithm into multi baseline matching to improve matching reliability, especially studies kernel function optimization strategies and GPU threads’ executing scheme of matching cost cube computing and aggregating, and realized its fine granularity parallel processing based on GPU. The experiment results using three UCD aerial images based on Tesla C2050 GPU […]
Sep, 25

Performance Evaluation of Edge Detection Techniques on GPU Using OpenCL

GPU (Graphic processing system) enhance the performance of the performance of the computing field due to its hundreds of cores in parallel. CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language) programming models are included in GPU. The advantage of these two programming models in GPU is that developers don’t have to understand any […]
Sep, 25

MASCOT: Fast and Highly Scalable SVM Cross-validation using GPUs and SSDs

Cross-validation is a commonly used method for evaluating the effectiveness of Support Vector Machines (SVMs). However, existing SVM cross-validation algorithms are not scalable to large datasets because they have to (i) hold the whole dataset in memory and/or (ii) perform a very large number of kernel value computation. In this paper, we propose a scheme […]
Sep, 25

Scalability Analysis of Parallel Algorithms on GPU Clusters

Scalability is an important concept in the domain of parallel computing. Since Graphics Processing Unit (GPU) clusters are and will be widely utilized in high performance computing platforms, we investigate the factors influencing the scalability for combinations of parallel algorithms (PA) and GPU clusters (GC).We present a scalability model for combination PA-GC and then validate […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: