17743

Posts

Nov, 5

A Dynamic Hash Table for the GPU

We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this […]
Nov, 5

Data Coherence Analysis and Optimization for Heterogeneous Computing

Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance […]
Nov, 5

ChainerMN: Scalable Distributed Deep Learning Framework

One of the keys for deep learning to have made a breakthrough in various fields was to utilize high computing powers centering around GPUs. Enabling the use of further computing abilities by distributed processing is essential not only to make the deep learning bigger and faster but also to tackle unsolved challenges. We present the […]
Nov, 5

Deep and Shallow convections in Atmosphere Models on Intel Xeon Phi Coprocessor Systems

Deep and shallow convection calculations occupy significant times in atmosphere models. These calculations also present significant load imbalances due to varying cloud covers over different regions of the grid. In this work, we accelerate these calculations on Intel Xeon Phi Coprocessor Systems. By employing dynamic scheduling in OpenMP, we demonstrate large reductions in load imbalance […]
Oct, 31

PCIeHLS: an OpenCL HLS framework

One of the goals of high level synthesis (HLS) is to make designing hardware accelerators running on FPGAs accessible to developers with a software background (usually implying developers with little foundations in hardware design). While high level synthesis generates accelerator kernels, it generally does not assist with integrating the generated kernels into a system. In […]
Oct, 31

Automatic Scan Parallelization in OpenMP

Prefix Scan (or simply scan) is an operator that computes all the partial sums of a vector. A scan operation results in a vector where each element is the sum of the preceding elements in the original vector up to the corresponding position. Scan is a key operation in many relevant problems like sorting, lexical […]
Oct, 31

An efficient GPU algorithm for tetrahedron-based Brillouin-zone integration

We report an efficient algorithm for calculating momentum-space integrals in solid state systems on modern graphics processing units (GPUs). We extend the tetrahedron method by Bl"ochl et al.~to the more general case of the integration of a momentum as well as energy dependent quantity and implement the algorithm based on the CUDA programming framework. We […]
Oct, 31

On Pre-Trained Image Features and Synthetic Images for Deep Learning

Deep Learning methods usually require huge amounts of training data to perform at their full potential, and often require expensive manual labeling. Using synthetic images is therefore very attractive to train object detectors, as the labeling comes for free, and several approaches have been proposed to combine synthetic and real images for training. In this […]
Oct, 31

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

For computational fluid dynamics (CFD) applications with a large number of grid points/cells, parallel computing is a common efficient strategy to reduce the computational time. How to achieve the best performance in the modern supercomputer system, especially with heterogeneous computing resources such as hybrid CPU+GPU, or a CPU + Intel Xeon Phi (MIC) co-processors, is […]
Oct, 29

A Study of Time and Energy Efficient Algorithms for Parallel and Heterogeneous Computing

This PhD project is motivated by the need to develop and achieve better and energy efficient computing through the use of parallelism and heterogeneous systems. Our contribution consists of both theoretical aspects, as well as in-depth and comprehensive empirical studies that aim to provide more insight into parallel and heterogeneous computing. Our first problem is […]
Oct, 29

Early Results of Deep Learning on the Stampede2 Supercomputer

We present early results of the deep learning work on the Stampede2 supercomputer. Our goal is to enable scalable and efficient deep learning model training and serving to expedite scientific discovery. We build three popular deep learning frameworks, namely, IntelCaffe, MXNet, and TensorFlow. With the built-in applications of these frameworks (CaffeNet, AlexNet, GoogLeNet, and Cifar10), […]
Oct, 29

Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model

In this work we use the GPU porting task for the operative Japanese weather prediction model "ASUCA" as an opportunity to examine productivity issues with OpenACC when applied to structured grid problems. We then propose "Hybrid Fortran", an approach that combines the advantages of directive based methods (no rewrite of existing code necessary) with that […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: