28344

Posts

Jun, 11

minimap2-fpga: Integrating hardware-accelerated chaining for efficient end-to-end long-read sequence mapping

minimap2 is the gold-standard software for reference-based sequence mapping in third-generation long-read sequencing. While minimap2 is relatively fast, further speedup is desirable, especially when processing a multitude of large datasets. In this work, we present minimap2-fpga, a hardware-accelerated version of minimap2 that speeds up the mapping process by integrating an FPGA kernel optimised for chaining. […]
Jun, 11

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application is semidefinite programming (SDP), and it often requires binary128 or higher precision arithmetic to solve problems involving SDP stably. However, only some processors […]
Jun, 4

Hybrid CPU/GPU/APU accelerated query, insert, update and erase operations in hash tables with string keys

Modern computer systems can use different types of hardware acceleration to achieve massive performance improvements. Some accelerators like FPGA and dedicated GPU (dGPU) need optimized data structures for the best performance and often use dedicated memory. In contrast, APUs, which are a combination of a CPU and an integrated GPU (iGPU), support shared memory and […]
Jun, 4

A High-Performance Computing Cluster for Distributed Deep Learning: A Practical Case of Weed Classification Using Convolutional Neural Network Models

One of the main concerns in precision agriculture (PA) is the growth of weeds within a crop field. Currently, to prevent the spread of weeds, automatic techniques and computational tools are used to help to identify, classify, and detect the different types of weeds found in agricultural fields. One of the technologies that can help […]
Jun, 4

GPU-Acceleration of Tensor Renormalization with PyTorch using CUDA

We show that numerical computations based on tensor renormalization group (TRG) methods can be significantly accelerated with PyTorch on graphics processing units (GPUs) by leveraging NVIDIA’s Compute Unified Device Architecture (CUDA). We find improvement in the runtime and its scaling with bond dimension for two-dimensional systems. Our results establish that the utilization of GPU resources […]
Jun, 4

Compiler Technologies in Deep Learning Co-Design: A Survey

With the rapid development of deep learning applications, general-purpose processors no longer suffice for deep learning workloads because of the dying of Moore’s Law. Thus, computer architecture innovation has entered a golden age for domain-specific design, which has led to a demand for new compilation technologies to facilitate cross-layer optimization. Historically, hardware and software have […]
Jun, 4

Implementation Techniques for SPMD Kernels on CPUs

More and more frameworks and simulations are developed using heterogeneous programming models such as OpenCL, SYCL, CUDA, or HIP. A significant hurdle to mapping these models to CPUs in a performance-portable manner is that implementing work-group barriers for such kernels requires providing forward-progress guarantees so that all work-items can reach the barrier. This work provides […]
May, 28

Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis

Genomic analysis is the study of genes which includes the identification, measurement, or comparison of genomic features. Genomics research is of great importance to our society because it can be used to detect diseases, create vaccines, and develop drugs and treatments. As a type of general-purpose accelerators with massive parallel processing capability, GPUs have been […]
May, 28

Experiences Migrating CUDA to SYCL: A Molecular Docking Case Study

In recent years, Intel introduced oneAPI as a unified and cross-architecture programming model based on the Data Parallel C++ (DPC++) language, which in turn, is based on the C++ and SYCL standard languages. In order to facilitate the migration of legacy CUDA code originally written for NVIDIA GPUs, developers can employ the Intel DPC++ Compatibility […]
May, 28

PyTorch Hyperparameter Tuning – A Tutorial for spotPython

The goal of hyperparameter tuning (or hyperparameter optimization) is to optimize the hyperparameters to improve the performance of the machine or deep learning model. spotPython (“Sequential Parameter Optimization Toolbox in Python”) is the Python version of the well-known hyperparameter tuner SPOT, which has been developed in the R programming environment for statistical analysis for over […]
May, 28

Communication-minimizing Asynchronous Tensor Parallelism

As state-of-the-art neural networks scale to billions of parameters, designing parallel algorithms that can train these networks efficiently on multi-GPU clusters has become critical. This paper presents Tensor3D, a novel three-dimensional (3D) approach to parallelize tensor computations, that strives to minimize the idle time incurred due to communication in parallel training of large multi-billion parameter […]
May, 28

ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. However, the resulting control flow divergence makes batching, an important performance optimization, difficult to perform manually. In this paper, we present […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: