high performance computing on graphics processing units: hgpu.org

Posts

Sep, 21

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks […]

CUDA

Sep, 16

Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address […]

CUDA

•

OpenCL

Sep, 16

Monte Carlo methods for massively parallel computers

Applications that require substantial computational resources today cannot avoid the use of heavily parallel machines. Embracing the opportunities of parallel computing and especially the possibilities provided by a new generation of massively parallel accelerator devices such as GPUs, Intel’s Xeon Phi or even FPGAs enables applications and studies that are inaccessible to serial programs. Here […]

CUDA

Sep, 16

Meta Networks for Neural Style Transfer

In this paper we propose a new method to get the specified network parameters through one time feed-forward propagation of the meta networks and explore the application to neural style transfer. Recent works on style transfer typically need to train image transformation networks for every new style, and the style is encoded in the network […]

CUDA

Sep, 16

Empower Sequence Labeling with Task-Aware Neural Language Model

Linguistic sequence labeling is a general modeling approach that encompasses a variety of problems, such as part-of-speech tagging and named entity recognition. Recent advances in neural networks (NNs) make it possible to build reliable models without handcrafted features. However, in many cases, it is hard to obtain sufficient annotations to train these models. In this […]

Sep, 16

End-to-end Deep Learning of Optimization Heuristics

Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge […]

OpenCL

Sep, 12

GPU-Accelerated Parallel Finite-Difference Time-Domain Method for Electromagnetic Waves Propagation in Unmagnetized Plasma Media

The finite-difference time-domain (FDTD) method has been commonly utilized in the numerical solution of electromagnetic (EM) waves propagation through the plasma media. However, the FDTD method may bring about a significant increment in additional run-times consuming for computationally large and complicated EM problems. Graphics Processing Unit (GPU) computing based on Compute Unified Device Architecture (CUDA) […]

CUDA

Sep, 12

Sorting with GPUs: A Survey

Sorting is a fundamental operation in computer science and is a bottleneck in many important fields. Sorting is critical to database applications, online search and indexing,biomedical computing, and many other applications. The explosive growth in computational power and availability of GPU coprocessors has allowed sort operations on GPUs to be done much faster than any […]

CUDA

Sep, 12

Optimization of the Brillouin operator on the KNL architecture

Experiences with optimizing the matrix-times-vector application of the Brillouin operator on the Intel KNL processor are reported. Without adjustments to the memory layout, performance figures of 360 Gflop/s in single and 270 Gflop/s in double precision are observed. This is with N_c=3 colors, N_v=12 right-hand-sides, N_{thr}=256 threads, on lattices of size 32^3*64, using exclusively OMP […]

Sep, 12

Report: Performance comparison between C2075 and P100 GPU cards using cosmological correlation functions

In this report, some cosmological correlation functions are used to evaluate the differential performance between C2075 and P100 GPU cards. In the past, the correlation functions used in this work have been widely studied and exploited on some previous GPU architectures. The analysis of the performance indicates that a speedup in the range from 13 […]

CUDA

Sep, 12

A Comparative Study of 2D Numerical Methods with GPU Computing

Graphics Processing Unit (GPU) computing is becoming an alternate computing platform for numerical simulations. However, it is not clear which numerical scheme will provide the highest computational efficiency for different types of problems. To this end, numerical accuracies and computational work of several numerical methods are compared using a GPU computing implementation. The Correction Procedure […]

CUDA

Sep, 10

The 2nd International Conference on Machine Learning and Soft Computing (ICMLSC), 2018

ICMLSC 2018, The 2nd International Conference on Machine Learning and Soft Computing, will take place in Phu Quoc Island, Vietnam, from February 2-4, 2018. ICMLSC 2018 is co-organized by the University of Science, Vietnam and Industrial University of Ho Chi Minh City. ICMLSC 2018 is a not-to-be-missed opportunity that distills the most current knowledge on […]

high performance computing on graphics processing units: hgpu.org

Posts

Distributed Training Large-Scale Deep Architectures

Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds

Monte Carlo methods for massively parallel computers

Meta Networks for Neural Style Transfer

Empower Sequence Labeling with Task-Aware Neural Language Model

End-to-end Deep Learning of Optimization Heuristics

GPU-Accelerated Parallel Finite-Difference Time-Domain Method for Electromagnetic Waves Propagation in Unmagnetized Plasma Media

Sorting with GPUs: A Survey

Optimization of the Brillouin operator on the KNL architecture

Report: Performance comparison between C2075 and P100 GPU cards using cosmological correlation functions

A Comparative Study of 2D Numerical Methods with GPU Computing

The 2nd International Conference on Machine Learning and Soft Computing (ICMLSC), 2018

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)