high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

A. Lonardo, F. Ameli, R. Ammendola, A. Biagioni, O. Frezza, G. Lamanna, F. Lo Cicero, M. Martinelli, P. S. Paolucci, E. Pastorelli, L. Pontisso, D. Rossetti, F. Simeone, F. Simula, M. Sozzi, L. Tosoratto, P. Vicini

INFN Sezione di Roma, Italy

arXiv:1406.3568 [physics.ins-det], (13 Jun 2014)

@{,

}

Download (PDF)

View

Source

2426

views

While the GPGPU paradigm is widely recognized as an effective approach to high performance computing, its adoption in low-latency, real-time systems is still in its early stages. Although GPUs typically show deterministic behaviour in terms of latency in executing computational kernels as soon as data is available in their internal memories, assessment of real-time features of a standard GPGPU system needs careful characterization of all subsystems along data stream path. The networking subsystem results in being the most critical one in terms of absolute value and fluctuations of its response latency. Our envisioned solution to this issue is NaNet, a FPGA-based PCIe Network Interface Card (NIC) design featuring a configurable and extensible set of network channels with direct access through GPUDirect to NVIDIA Fermi/Kepler GPU memories. NaNet design currently supports both standard – GbE (1000BASE-T) and 10GbE (10Base-R) – and custom – 34~Gbps APElink and 2.5~Gbps deterministic latency KM3link – channels, but its modularity allows for a straightforward inclusion of other link technologies. To avoid host OS intervention on data stream and remove a possible source of jitter, the design includes a network/transport layer offload module with cycle-accurate, upper-bound latency, supporting UDP, KM3link Time Division Multiplexing and APElink protocols. After NaNet architecture description and its latency/bandwidth characterization for all supported links, two real world use cases will be presented: the GPU-based low level trigger for the RICH detector in the NA62 experiment at CERN and the on-/off-shore data link for KM3 underwater neutrino telescope.

Tags: CUDA, FPGA, Instrumentation and Detectors, nVidia, Physics, Tesla M2070

June 16, 2014 by hgpu

Rating: 2.5/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)