Posts
Oct, 18
Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)
Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance […]
Oct, 18
A Tensor Compiler for Unified Machine Learning Prediction Serving
Machine Learning (ML) adoption in the enterprise requires simpler and more efficient software infrastructure—the bespoke solutions typical in large web companies are simply untenable. Model scoring, the process of obtaining predictions from a trained model over new data, is a primary contributor to infrastructure complexity and cost as models are trained once but used many […]
Oct, 18
On the performance of a highly-scalable Computational Fluid Dynamics code on AMD, ARM and Intel processors
No area of computing is hungrier for performance than High Performance Computing (HPC), the demands of which continue to be a major driver for processor performance and adoption of accelerators, and also advances in memory, storage, and networking technologies. A key feature of the Intel processor domination of the past decade has been the extensive […]
Oct, 11
Deep Learning for Digital Asset Limit Order Books
This paper shows that temporal CNNs accurately predict bitcoin spot price movements from limit order book data. On a 2 second prediction time horizon we achieve 71% walk-forward accuracy on the popular cryptocurrency exchange coinbase. Our model can be trained in less than a day on commodity GPUs which could be installed into colocation centers […]
Oct, 11
Bempp-cl: A fast Python based just-in-time compiling boundary element library
The boundary element method (BEM) is a numerical method for approximating the solution of certain types of partial differential equations (PDEs) in homogeneous bounded or unbounded domains. The method finds the approximation by discretising a boundary integral equation that can be derived from the PDE. The mathematical background of BEM is covered in, for example, […]
Oct, 11
Mastering Atari with Discrete World Models
Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an […]
Oct, 11
It’s all about data movement: Optimising FPGA data access to boost performance
The use of reconfigurable computing, and FPGAs in particular, to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. However, whilst recent advanced in FPGA tooling have made the physical act of programming reconfigurable architectures much more accessible, in order to gain good performance […]
Oct, 11
Efficient Inference For Neural Machine Translation
Large Transformer models have achieved state-of-the-art results in neural machine translation and have become standard in the field. In this work, we look for the optimal combination of known techniques to optimize inference speed without sacrificing translation quality. We conduct an empirical study that stacks various approaches and demonstrates that combination of replacing decoder self-attention […]
Oct, 4
Transparent Acceleration of Java-based Deep Learning Engines
The advent of modern cloud services, along with the huge volume of data produced on a daily basis, have increased the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. In recent years, hardware accelerators have been employed as a […]
Oct, 4
An OpenCL 3D FFT for Molecular Dynamics Simulations on Multiple FPGAs
3D FFTs are used to accelerate MD electrostatic forces computations but are difficult to parallelize due to communications requirements. We present a distributed OpenCL 3D FFT implementation on Intel Stratix 10 FPGAs for grids up to 128^3. We use FPGA hardware features such as HBM2 memory and multiple 100 Gbps links to provide scalable memory […]
Oct, 4
LoopBench: An Evaluation of Loop Acceleration in Heterogeneous Systems
Computational intensive applications usually consist of multiple nested or flattened loops. These loops are the main building blocks of the applications and embody a specific type of execution pattern. In order to reduce the running time of the loops, developers are analyzing the loops in the code and try to parallelize them on the target […]
Oct, 4
Flexible Performant GEMM Kernels on GPUs
General Matrix Multiplication or GEMM kernels take center place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA’s Tensor Cores. Their exploitation is hampered by the two-language problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of […]