high performance computing on graphics processing units: hgpu.org

Posts

Jun, 24

Research on the simulation of PF-LBM model based on MPI+CUDA mixed granularity parallel

A microstructure numerical model is an intensive computational problem, for which the simulation time is too long and the simulation scale is too small. To solve these two problems, in this article, we use MPI+CUDA hybrid particle heterogeneous parallel computing to implement the dendrite growth simulation of a PF-LBM phase-field 3D model. Message Passing Interface […]

CUDA

Jun, 24

Go game move prediction using convolutional neural network

The purpose of this paper is to introduce the use of convolutional neural network for prediction of the next appropriate move in the Go game. The paper contains description of the crucial Go game rules, neural networks theory, description of implemented programs and final evaluation of the trained neural networks. The programs were implemented with […]

CUDA

Jun, 24

Synthesis of GPU Programs from High-Level Models

Modern graphics processing units (GPUs) provide high-performance general purpose computation abilities. They have massive parallel architectures that are suitable for executing parallel algorithms and operations. They are also throughput-oriented devices that are optimized to achieve high throughput for stream processing. Designing efficient GPU programs is a notoriously difficult task. The ForSyDe methodology is suitable to […]

OpenCL

Jun, 24

Strategies for the Heterogeneous Execution of Large-Scale Simulations on Hybrid Supercomputers

Massively-parallel devices of various architectures are being adopted by the newest supercomputers to overcome the actual power constraint in the context of the exascale challenge. This progress leads to an increasing hybridisation of HPC systems and makes the design of computing applications a rather complex problem. Therefore, the software efficiency and portability are of crucial […]

OpenCL

Jun, 24

DeepSmith: Compiler Fuzzing through Deep Learning

Random program generation – fuzzing – is an effective technique for discovering bugs in compilers but successful fuzzers require extensive development effort for every language supported by the compiler, and often leave parts of the language space untested. We introduce DeepSmith, a novel machine learning approach to accelerating compiler validation through the inference of generative […]

OpenCL

Jun, 22

The 3rd International Conference on Computer Systems and Communication Technology (ICCSCT), 2018

It will bring together the researchers to exchange their research results and address open issues in: Computer Systems and Engineering Artificial Intelligence Computer Modeling Internet of Things (IoT) Decision Support System and Models Software Engineering Computer Graphics and Multimedia Image Processing System software support for multimedia Communication Technology Network and Wireless Communication VLSI Circuits and […]

Jun, 22

The 4th International Conference on Communication and Information Processing (ICCIP), 2018

Keynote speakers: Prof. Jalel Ben-Othman – University of Paris 13, France. Prof. Herwig Unger – University of Hagen, Germany. Published by: The accepted paper of ICCIP 2018 will be published into ICCIP 2018 Conference Proceedings, which will be published in the International Conference Proceedings Series by ACM and archived in the ACM Digital Library. The […]

Jun, 20

Energy Efficient Computing on Multi-core Processors: Vectorization and Compression Techniques

Over the past few years, energy consumption has become the main limiting factor for computing in general. This has led CPU vendors to aggressively promote parallel computing using multiple cores without significantly increasing the thermal design power of the processor. However, achieving maximum performance and energy efficiency from the available resources on the multi-core and […]

Jun, 20

RAPIDNN: In-Memory Deep Neural Network Acceleration Framework

Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image processing, video segmentation, and speech recognition. Running state-of-the-art DNNs on current systems mostly relies on either general purpose processors, ASIC designs, or FPGA accelerators, all of which suffer from data movements due to the limited on chip memory and data transfer bandwidth. […]

OpenCL

Jun, 20

Comparing Two Generations of Embedded GPUs Running a Feature Detection Algorithm

Graphics processing units (GPUs) in embedded mobile platforms are reaching performance levels where they may be useful for computer vision applications. We compare two generations of embedded GPUs for mobile devices when running a state-of-the-art feature detection algorithm, i.e., Harris-Hessian/FREAK. We compare architectural differences, execution time, temperature, and frequency on Sony Xperia Z3 and Sony […]

OpenCL

Jun, 20

AVX-512 extension to OpenQCD 1.6

We publish an extension of openQCD-1.6 with AVX-512 vector instructions using Intel intrinsics. Recent Intel processors support extended instruction sets with operations on 512-bit wide vectors, increasing both the capacity for floating point operations and register memory. Optimal use of the new capabilities requires reorganising data and floating point operations into these wider vector units. […]

Jun, 20

Neural Code Comprehension: A Learnable Representation of Code Semantics

With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Research on the simulation of PF-LBM model based on MPI+CUDA mixed granularity parallel

Go game move prediction using convolutional neural network

Synthesis of GPU Programs from High-Level Models

Strategies for the Heterogeneous Execution of Large-Scale Simulations on Hybrid Supercomputers

DeepSmith: Compiler Fuzzing through Deep Learning

The 3rd International Conference on Computer Systems and Communication Technology (ICCSCT), 2018

The 4th International Conference on Communication and Information Processing (ICCIP), 2018

Energy Efficient Computing on Multi-core Processors: Vectorization and Compression Techniques

RAPIDNN: In-Memory Deep Neural Network Acceleration Framework

Comparing Two Generations of Embedded GPUs Running a Feature Detection Algorithm

AVX-512 extension to OpenQCD 1.6

Neural Code Comprehension: A Learnable Representation of Code Semantics

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)