12906

Posts

Sep, 30

Harnessing GPU Computing in System-Level Software

As the base of the software stack, system-level software is expected to provide efficient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What’s more, today’s application workloads have […]
Sep, 30

Implementation of a Multi-User Detector for Satellite Return Links on a GPU Platform

Due to the scarcity and high cost of satellite frequency spectrum, it is very important to utilize the available spectrum as efficiently as possible. The efficient usage of the spectrum in the satellite return link is a challenging task, especially if multiple users are present. In previous works Multi-User Detection (MUD) techniques have been widely […]
Sep, 30

A (Somewhat Dated) Comparative Study of Betweenness Centrality Algorithms on GPU

The problem of computing the Betweenness Centrality (BC) is important in analyzing graphs in many practical applications like social networks, biological networks, transportation networks, electrical circuits, etc. Since this problem is computation intensive, researchers have been developing algorithms using high performance computing resources like supercomputers, clusters, and Graphics Processing Units (GPUs). Current GPU algorithms for […]
Sep, 29

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it […]
Sep, 29

A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction

Background subtraction is an essential first stage in many vision applications differentiating foreground pixels from the background scene, with Mixture of Gaussians (MoG) being a widely used implementation choice. MoG’s high computation demand renders a real-time single threaded realization infeasible. With it’s pixel level parallelism, deploying MoG on top of parallel architectures such as a […]
Sep, 29

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel MIC/Xeon Phi) are fueling the growth of next-generation ultra-scale systems that have high compute density and high performance per watt. However, these many-core architectures cause systems to be heterogeneous by introducing multiple levels of parallelism and varying computation/communication costs at each level. Application developers also […]
Sep, 29

Decoupling algorithms from the organization of computation for high performance image processing

Future graphics and imaging applications-from self-driving cards, to 4D light field cameras, to pervasive sensing-demand orders of magnitude more computation than we currently have. This thesis argues that the efficiency and performance of an application are determined not only by the algorithm and the hardware architecture on which it runs, but critically also by the […]
Sep, 29

XBOOLE-CUDA: Fast Boolean Operations on the GPU

The Boolean domain faces us with the exponential complexity of Boolean functions and the technological progress in micro- and nano-electronics allows increasing numbers of Boolean variables. This requires very powerful Boolean computations. The progress in the performance of Graphics Processing Units (GPUs) and the possibility to utilize the GPU to solve tasks of many application […]
Sep, 28

An hybrid AES-256-GCM implementation for NEON CPU & CUDA GPU

This paper describes & evaluates a fast, hybrid implementation of the Advanced Encryption Standard with 256 bit keys (AES-256) block encryption in Galois/Counter Mode (GCM). The implementation is bit-compatible with the implemented standard in both the OpenSSL and Crypto++ libraries, while significantly (up to three times) faster for large amount of data. In this implementation, […]
Sep, 28

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or […]
Sep, 28

High-performance Implementations and Large-scale Validation of the Link-wise Artificial Compressibility Method

The link-wise artificial compressibility method (LW-ACM) is a recent formulation of the artificial compressibility method for solving the incompressible Navier-Stokes equations. Two implementations of the LW-ACM in three dimensions on CUDA enabled GPUs are described. The first one is a modified version of a state-of-the-art CUDA implementation of the lattice Boltzmann method (LBM), showing that […]
Sep, 28

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org