17129

Posts

Apr, 15

A Domain Specific Language for Performance Portable Molecular Dynamics Algorithms

Developers of Molecular Dynamics (MD) codes face significant challenges when adapting existing simulation packages to new hardware. In a continuously diversifying hardware landscape it becomes increasingly difficult for scientists to be experts both in their own domain (physics/chemistry/biology) and specialists in the low level parallelisation and optimisation of their codes. To address this challenge, we […]
Apr, 15

Parallelized Kendall’s Tau Coefficient Computation via SIMD Vectorized Sorting On Many-Integrated-Core Processors

Pairwise association measure is an important operation in data analytics. Kendall’s tau coefficient is one widely used correlation coefficient identifying non-linear relationships between ordinal variables. In this paper, we investigated a parallel algorithm accelerating all-pairs Kendall’s tau coefficient computation via single instruction multiple data (SIMD) vectorized sorting on Intel Xeon Phis by taking advantage of […]
Apr, 11

Acceleration of Linear Finite-Difference Poisson-Boltzmann Methods on Graphics Processing Units

Electrostatic interactions play crucial roles in biophysical processes such as protein folding and molecular recognition. Poisson-Boltzmann equation (PBE)-based models have emerged as widely used in modeling these important processes. Though great efforts have been put into developing efficient PBE numerical models, challenges still remain due to the high dimensionality of typical biomolecular systems. In this […]
Apr, 11

Machine Learning from Streaming Data in Heterogeneous Computing Environments

With the advent of many-core general-purpose processors (CPUs), the use of an increased number of cores has provided a certain speedup for algorithms that can be parallized. Nowadays, there are distributed and parallel data processing platforms, such as Apache Flink, which inherently makes use of parallel computing. On the other hand, graphics processors(GPUs) offers high […]
Apr, 11

Performance and energy optimization of the iterative solution of sparse linear systems on multicore processors

Large sparse systems of linear equations are ubiquitous problems in diverse scientific and engineering applications and big-data analytics. The interest of these applications and the fact that the solution of the linear system is usually a significant time-consuming stage has promoted the design and high-performance implementation of numerous matrix storage formats, algorithms, and libraries to […]
Apr, 11

A modular GPU raytracer using OpenCL for non-interactive graphics

We describe the development of a modular plugin based raytracer renderer called RenderGirl suitable for running inside the OpenCL framework. We aim to take advantage of heterogeneous computing devices such as GPUs and many-core CPUs, focusing on parallelism. We implemented the traditional partitioning scheme called bounding volume hierarchies, where each scene is hierarchically subdivided into […]
Apr, 11

Vectorization of Hybrid Breadth First Search on the Intel Xeon Phi

The Breadth-First Search (BFS) algorithm is an important building block for graph analysis of large datasets. The BFS parallelisation has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. We investigate the optimisation and vectorisation of the hybrid BFS (a […]
Apr, 7

Design Exploration of AES Accelerators on FPGAs and GPUs

The embedded systems are increasingly becoming a key technological component of all kinds of complex technical systems and an exhaustive analysis of the state of the art of all current performance with respect to architectures, design methodologies, test and applications could be very interesting. The Advanced Encryption Standard (AES), based on the well-known algorithm Rijndael, […]
Apr, 7

In-Datacenter Performance Analysis of a Tensor Processing Unit

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU) – deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers […]
Apr, 7

Cascaded Segmentation-Detection Networks for Word-Level Text Spotting

We introduce an algorithm for word-level text spotting that is able to accurately and reliably determine the bounding regions of individual words of text "in the wild". Our system is formed by the cascade of two convolutional neural networks. The first network is fully convolutional and is in charge of detecting areas containing text. This […]
Apr, 7

Load Balancing for Constraint Solving with GPUs

Solving a complex Constraint Satisfaction Problem (CSP) is a computationally hard task which may require a considerable amount of time. Parallelism has been applied successfully to the job and there are already many applications capable of harnessing the parallel power of modern CPUs to speed up the solving process. Current Graphics Processing Units (GPUs), containing […]
Apr, 7

Optimizing Communication by Compression for Multi-GPU Scalable Breadth-First Searches

The Breadth First Search (BFS) algorithm is the foundation and building block of many higher graph-based operations such as spanning trees, shortest paths and betweenness centrality. The importance of this algorithm increases each day due to it is a key requirement for many data structures which are becoming popular nowadays. When the BFS algorithm is […]

Recent source codes

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Contact us: