2402

Views of posts on hgpu.org

Generating GPU Compiler Heuristics using Reinforcement Learning  1,028 views

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations  1,025 views

FLOWER: A Comprehensive Dataflow Compiler for High-Level Synthesis  1,025 views

Lightning: Scaling the GPU Programming Model Beyond a Single GPU  1,024 views

Optimization of Heterogeneous Parallel Computing Systems using Machine Learning  1,023 views

From English To Foreign Languages: Transferring Pre-trained Language Models  1,022 views

Improving Performance and Energy Efficiency of Heterogeneous Systems with rCUDA  1,019 views

Improving the Performance, Portability, and Productivity of Hardware Accelerators  1,013 views

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library  1,008 views

Simulating flows of incompressible and weakly compressible fluids on multicore hybrid computer systems  1,008 views

StreamBlocks: A compiler for heterogeneous dataflow computing  1,007 views

Mixed precision in Graphics Processing Unit  1,003 views

Apple Silicon Performance in Scientific Computing  1,002 views

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs  1,001 views

Thermal Safety and Real-Time Predictability on Heterogeneous Embedded SoC Platforms  999 views

ScaleHLS: Scalable High-Level Synthesis through MLIR  998 views

DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors  997 views

Novel Computing Architectures  996 views

The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product  994 views

Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System  993 views

LS-CAT: A Large-Scale CUDA AutoTuning Dataset  992 views

CUDA implementation of Wagener’s 2D convex hull PRAM algorithm  991 views

Improving performance for emergent environments parameter tuning and simulation in games using GPU  988 views

CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs  987 views

Optimal program variant generation for hybrid manycore systems  987 views

GPU-based JSON data processing using structural indexes  983 views

Character-level Transformer-based Neural Machine Translation  982 views

Advanced Joins on GPUs  978 views

GPTPU: Accelerating Applications using Edge Tensor Processing Units  978 views

Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems  976 views

INSTA-YOLO: Real-Time Instance Segmentation  974 views

It’s all about data movement: Optimising FPGA data access to boost performance  973 views

Productivity, Portability, Performance: Data-Centric Python  972 views

TorchBench: Benchmarking PyTorch with High API Surface Coverage  970 views

A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks  962 views

Integrating Accelerators in Heterogeneous Systems  961 views

Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators  959 views

Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities  957 views

Performance prediction of deep learning applications training in GPU as a service systems  956 views

Using Intel oneAPI for Multi-hybrid Acceleration Programming with GPU and FPGA Coupling  956 views

Block Conjugate Gradient Solver in OpenCL  954 views

OpenCL FPGA Optimization guided by memory accesses and roofline model analysis applied to tomography acceleration  954 views

Parallel Approaches for SWAMP Sequence Alignment  951 views

Ripple: Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout  950 views

General purpose lattice QCD code set Bridge++ 2.0 for high performance computing  949 views

Implementation of Parallel Simplified Swarm Optimization in CUDA  945 views

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems  944 views

Open SYCL on heterogeneous GPU systems: A case of study  943 views

Artificial Intelligence in Electric Machine Drives: Advances and Trends  942 views

Measurement and Analysis of GPU-accelerated Applications with HPCToolkit  940 views

Onesweep: A Faster Least Significant Digit Radix Sort for GPUs  940 views

One-shot tuner for deep learning compilers  938 views

On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors  936 views

Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA  933 views

Heuristic Adaptability to Input Dynamics for SpMM on GPUs  928 views

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores  925 views

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments  922 views

Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments  921 views

Joint Forces: From Multithreaded Programming to GPU Computing  920 views

Dynamic Adaptation Techniques and Opportunities to Improve HPC Runtimes  920 views

NNP/MM: Fast molecular dynamics simulations with machine learning potentials and molecular mechanics  919 views

Better GPU Hash Tables  918 views

Fancier: A Unified Framework for Java, C, and OpenCL Integration  917 views

A method for decompilation of AMD GCN kernels to OpenCL  916 views

94% on CIFAR-10 in 3.29 Seconds on a Single GPU  913 views

Extending SYCL’s Programming Paradigm with Tensor-based SIMD Abstractions  908 views

EXA2PRO: A Framework for High Development Productivity on Heterogeneous Computing Systems  905 views

Exploring the acceleration of Nekbone on reconfigurable architectures  905 views

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale  904 views

Managing Extreme Heterogeneity in Next Generation HPC Systems  902 views

BAT: A Benchmark suite for AutoTuners  899 views

A ML-based resource utilization OpenCL GPU-kernel fusion model  895 views

NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems  894 views

System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language  893 views

Performance Portability and Evaluation of Heterogeneous Components of SeisSol Targeted to Upcoming Intel HPC GPUs  893 views

An Auto-Programming Approach to Vulkan  891 views

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research  891 views

Optimization of GPU workloads using natural language processing based on deep learning techniques  890 views

Efficient heterogeneous matrix profile on a CPU + High Performance FPGA with integrated HBM  887 views

AnySeq/GPU: A Novel Approach for Faster Sequence Alignment on GPUs  882 views

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning  882 views

User’s needs influencing HPC technologies  877 views

Research and Development of Porting SYCL on QNX Operating System for High Parallelism  876 views

Data-Oriented Language Implementation of Lattice-Boltzmann Method for Dense and Sparse Geometries  875 views

Predictive Data Race Detection for GPUs  873 views

Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment  873 views

Large-eddy simulations with ClimateMachine: a new open-source code for atmospheric simulations on GPUs and CPUs  872 views

Concurrent CPU-GPU Task Programming using Modern C++  869 views

Efficacy of Images Versus Data Buffers: Optimizing Interactive Applications Utilizing OpenCL for Scientific Visualization  867 views

PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations  865 views

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats  865 views

Programming Heterogeneous Systems with General and Domain-Specific Frameworks  865 views

Towards a Benchmarking Suite for Kernel Tuners  862 views

Enabling On-Device Smartphone GPU based Training: Lessons Learned  861 views

Dopia: Online Parallelism Management for Integrated CPU/GPU Architectures  859 views

TCUDB: Accelerating Database with Tensor Processors  858 views

DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware  858 views

Data transfer optimizations for heterogeneous managed runtime systems  858 views

Performance assessment of CUDA and OpenACC in large scale combustion simulations  855 views

LeXInt: GPU-accelerated Exponential Integrators package  854 views

 

Brief statistics for this page

Titles: 100

Total views: 93768

 

Most viewed items:

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: