Posts
Sep, 30
A (Somewhat Dated) Comparative Study of Betweenness Centrality Algorithms on GPU
The problem of computing the Betweenness Centrality (BC) is important in analyzing graphs in many practical applications like social networks, biological networks, transportation networks, electrical circuits, etc. Since this problem is computation intensive, researchers have been developing algorithms using high performance computing resources like supercomputers, clusters, and Graphics Processing Units (GPUs). Current GPU algorithms for […]
Sep, 29
MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it […]
Sep, 29
A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction
Background subtraction is an essential first stage in many vision applications differentiating foreground pixels from the background scene, with Mixture of Gaussians (MoG) being a widely used implementation choice. MoG’s high computation demand renders a real-time single threaded realization infeasible. With it’s pixel level parallelism, deploying MoG on top of parallel architectures such as a […]
Sep, 29
Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects
Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel MIC/Xeon Phi) are fueling the growth of next-generation ultra-scale systems that have high compute density and high performance per watt. However, these many-core architectures cause systems to be heterogeneous by introducing multiple levels of parallelism and varying computation/communication costs at each level. Application developers also […]
Sep, 29
Decoupling algorithms from the organization of computation for high performance image processing
Future graphics and imaging applications-from self-driving cards, to 4D light field cameras, to pervasive sensing-demand orders of magnitude more computation than we currently have. This thesis argues that the efficiency and performance of an application are determined not only by the algorithm and the hardware architecture on which it runs, but critically also by the […]
Sep, 29
XBOOLE-CUDA: Fast Boolean Operations on the GPU
The Boolean domain faces us with the exponential complexity of Boolean functions and the technological progress in micro- and nano-electronics allows increasing numbers of Boolean variables. This requires very powerful Boolean computations. The progress in the performance of Graphics Processing Units (GPUs) and the possibility to utilize the GPU to solve tasks of many application […]
Sep, 28
An hybrid AES-256-GCM implementation for NEON CPU & CUDA GPU
This paper describes & evaluates a fast, hybrid implementation of the Advanced Encryption Standard with 256 bit keys (AES-256) block encryption in Galois/Counter Mode (GCM). The implementation is bit-compatible with the implemented standard in both the OpenSSL and Crypto++ libraries, while significantly (up to three times) faster for large amount of data. In this implementation, […]
Sep, 28
A Study of the Potential of Locality-Aware Thread Scheduling for GPUs
Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or […]
Sep, 28
High-performance Implementations and Large-scale Validation of the Link-wise Artificial Compressibility Method
The link-wise artificial compressibility method (LW-ACM) is a recent formulation of the artificial compressibility method for solving the incompressible Navier-Stokes equations. Two implementations of the LW-ACM in three dimensions on CUDA enabled GPUs are described. The first one is a modified version of a state-of-the-art CUDA implementation of the lattice Boltzmann method (LBM), showing that […]
Sep, 28
NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model
The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create […]
Sep, 28
Accelerating Phylogenetic Inference on GPUs: an OpenACC and CUDA comparison
Phylogenetic inference is used to derive a "tree of life" for a collection of species whose DNA sequences are known. While several software packages have already been developed to take advantage of GPUs to accelerate phylogenetic inference, they typically require significant changes to the original code, constraining code maintenance. Recently, the OpenACC API was proposed […]
Sep, 25
An open source finite-difference time-domain solver for room acoustics using graphics processing units
Wave based simulation methods have been utilized to numerically estimate wave propagation in domains where low-frequency wave effects dominate the response. Finite-difference time-domain (FDTD) methods are increasingly useful for such problems, but they require massive spatial oversampling to increase the bandwidth of the simulation, which leads to significant computational expense. The advantage of explicit time-stepping […]