23849

Posts

Oct, 25

FlowPM: Distributed TensorFlow Implementation of the FastPM Cosmological N-body Solver

We present FlowPM, a Particle-Mesh (PM) cosmological N-body code implemented in Mesh-TensorFlow for GPU-accelerated, distributed, and differentiable simulations. We implement and validate the accuracy of a novel multi-grid scheme based on multiresolution pyramids to compute large scale forces efficiently on distributed platforms. We explore the scaling of the simulation on large-scale supercomputers and compare it […]
Oct, 18

When HLS Meets FPGA HBM: Benchmarking and Bandwidth Optimization

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, we found that it is not easy to fully utilize the available bandwidth when developing some applications with high-level synthesis (HLS) tools. This is […]
Oct, 18

Portable high-order finite element kernels I: Streaming Operations

This paper is devoted to the development of highly efficient kernels performing vector operations relevant in linear system solvers. In particular, we focus on the low arithmetic intensity operations (i.e., streaming operations) performed within the conjugate gradient iterative method, using the parameters specified in the CEED benchmark problems for high-order hexahedral finite elements. We propose […]
Oct, 18

Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)

Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance […]
Oct, 18

A Tensor Compiler for Unified Machine Learning Prediction Serving

Machine Learning (ML) adoption in the enterprise requires simpler and more efficient software infrastructure—the bespoke solutions typical in large web companies are simply untenable. Model scoring, the process of obtaining predictions from a trained model over new data, is a primary contributor to infrastructure complexity and cost as models are trained once but used many […]
Oct, 18

On the performance of a highly-scalable Computational Fluid Dynamics code on AMD, ARM and Intel processors

No area of computing is hungrier for performance than High Performance Computing (HPC), the demands of which continue to be a major driver for processor performance and adoption of accelerators, and also advances in memory, storage, and networking technologies. A key feature of the Intel processor domination of the past decade has been the extensive […]
Oct, 11

Deep Learning for Digital Asset Limit Order Books

This paper shows that temporal CNNs accurately predict bitcoin spot price movements from limit order book data. On a 2 second prediction time horizon we achieve 71% walk-forward accuracy on the popular cryptocurrency exchange coinbase. Our model can be trained in less than a day on commodity GPUs which could be installed into colocation centers […]
Oct, 11

Bempp-cl: A fast Python based just-in-time compiling boundary element library

The boundary element method (BEM) is a numerical method for approximating the solution of certain types of partial differential equations (PDEs) in homogeneous bounded or unbounded domains. The method finds the approximation by discretising a boundary integral equation that can be derived from the PDE. The mathematical background of BEM is covered in, for example, […]
Oct, 11

Mastering Atari with Discrete World Models

Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an […]
Oct, 11

It’s all about data movement: Optimising FPGA data access to boost performance

The use of reconfigurable computing, and FPGAs in particular, to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. However, whilst recent advanced in FPGA tooling have made the physical act of programming reconfigurable architectures much more accessible, in order to gain good performance […]
Oct, 11

Efficient Inference For Neural Machine Translation

Large Transformer models have achieved state-of-the-art results in neural machine translation and have become standard in the field. In this work, we look for the optimal combination of known techniques to optimize inference speed without sacrificing translation quality. We conduct an empirical study that stacks various approaches and demonstrates that combination of replacing decoder self-attention […]
Oct, 4

Transparent Acceleration of Java-based Deep Learning Engines

The advent of modern cloud services, along with the huge volume of data produced on a daily basis, have increased the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. In recent years, hardware accelerators have been employed as a […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: