11433

Posts

Feb, 15

Optimizing exact computation of Betweenness Centrality for CUDA

Betweenness centrality is an important metric in the study of network analysis. This report discusses the problem of exact computation of betweenness cenrality index in network analysis. BC is an important metric in small world network analysis which is expensive to compute. A new strategy is presented to parallelize the best known serial algorithm for […]
Feb, 15

Accelerator Aware MPI Micro-benchmarking using CUDA, OpenACC and OpenCL

Recently MPI implementations have been extended to support accelerator devices, Intel Many Integrated Core (MIC) and nVidia GPU. This has been accomplished by changes to different levels of the software stacks and MPI implementations. In order to evaluate performance and scalability of accelerator aware MPI libraries, we developed portable micro-benchmarks to identify factors that influence […]
Feb, 14

Effective Multi-Modal Retrieval based on Stacked Auto-Encoders

Multi-modal retrieval is emerging as a new search paradigm that enables seamless information retrieval from various types of media. For example, users can simply snap a movie poster to search relevant reviews and trailers. To solve the problem, a set of mapping functions are learned to project high-dimensional features extracted from data of different media […]
Feb, 14

High-Performance Zonal Histogramming on Large-Scale Geospatial Rasters Using GPUs and GPU-Accelerated Clusters

Hardware Accelerators are playing increasingly important roles in achieving desired performance from desktop to cluster computing. While General Purpose computing on Graphics Processing Units (GPGPU) technologies have been widely applied to computing intensive applications, there are relatively little work on using GPUs and GPU-accelerated clusters for data intensive computing that typically involves significant irregular data […]
Feb, 14

High-Performance Spatial Query Processing on Big Taxi Trip Data using GPGPUs

City-wide GPS recorded taxi trip data contains rich information for traffic and travel analysis to facilitate transportation planning and urban studies. However, traditional data management techniques are largely incapable of processing big taxi trip data at the scale of hundreds of millions. In this study, we aim at utilizing the General Purpose computing on Graphics […]
Feb, 14

Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned

In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectorization. Since the straightforward porting process of the already existing OpenCL version of the […]
Feb, 14

Multi-Kepler GPU vs. Multi-Intel MIC for spin systems simulations

We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the […]
Feb, 12

Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance

Developing high performance GPU code is labor intensive. Ideally, developers could recoup high GPU development costs by generating high-performance programs for CPUs and other architectures from the same source code. However, current OpenCL compilers for non-GPUs do not fully exploit optimizations in well-tuned GPU codes. To address this problem, we develop an OpenCL implementation that […]
Feb, 12

Increasing precision of uniform pseudorandom number generators

A general method to produce uniformly distributed pseudorandom numbers with extended precision by combining two pseudorandom numbers with lower precision is proposed. In particular, this method can be used for pseudorandom number generation with extended precision on graphics processing units (GPU), where the performance of single and double precision operations can vary significantly.
Feb, 12

Designing Bit-Reproducible Portable High-Performance Applications

Bit-reproducibility has many advantages in the context of high-performance computing. Besides simplifying and making more accurate the process of debugging and testing the code, it can allow the deploying of applications on heterogeneous systems, maintaining the consistency of the computations. In this work we analyze the basic operations performed by scientific applications and identify the […]
Feb, 12

GROMACS on Hybrid CPU-GPU and CPU-MIC Clusters: Preliminary Porting Experiences, Results and Next Steps

This report introduces hybrid implementation of the Gromacs application, and provides instructions on building and executing on PRACE prototype platforms with Graphical Processing Units (GPU) and Many Intergrated Cores (MIC) accelerator technologies. GROMACS currently employs message-passing MPI parallelism, multi-threading using OpenMP and contains kernels for non-bonded interactions that are accelerated using the CUDA programming language. […]
Feb, 12

Transparent use of Java objects on the GPU in the JaMP/OpenMP framework

Many computationally intensive applications profit by parallel execution, based on using multiple cores in CPUs, data-parallel GPGPU processing or even several machines like in clusters. However, changing a program to run in parallel requires a high effort and is therefore a time-consuming step during development. During the implementation, it is necessary to consider many steps […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: