Dato: A Task-Based Programming Model for Dataflow Accelerators

hgpu.org » Applications » Computer science » Dato: A Task-Based Programming Model for Dataflow Accelerators

Dato: A Task-Based Programming Model for Dataflow Accelerators

Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang

Cornell University

arXiv:2509.06794 [cs.PL], (8 Sep 2025)

DOI:10.48550/arXiv.2509.06794

@misc{fang2025datotaskbasedprogrammingmodel,

title={Dato: A Task-Based Programming Model for Dataflow Accelerators},

author={Shihan Fang and Hongzheng Chen and Niansong Zhang and Jiajie Li and Han Meng and Adrian Liu and Zhiru Zhang},

year={2025},

eprint={2509.06794},

archivePrefix={arXiv},

primaryClass={cs.PL},

url={https://arxiv.org/abs/2509.06794}

}

Download (PDF)

View

Source

Source codes

Package:

Allo: Accelerator Design Language

3457

views

Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development overhead, whereas high-level tile-based languages abstract away communication details, restricting optimization and forcing compilers to reconstruct the intended dataflow. We present Dato, a Python-embedded, task-based programming model for dataflow accelerators that elevates data communication and sharding to first-class type constructs. Developers write programs as a graph of tasks connected via explicit stream types, with sharded inputs specified using layout types. These tasks are first mapped virtually onto the accelerator’s spatial fabric, and the compiler then generates a physical mapping that respects hardware constraints. Experimental results on both AMD Ryzen AI NPU and Alveo FPGA devices demonstrate that Dato achieves high performance while significantly reducing the burden of writing optimized code. On the NPU, Dato attains up to 84% hardware utilization for GEMM and delivers a 2.81x speedup on attention kernels compared to a state-of-the-art commercial framework. On the FPGA, Dato surpasses leading frameworks in performance when generating custom systolic arrays, achieving 98% of the theoretical peak performance.

Tags: Computer science, FPGA, nVidia, Package, Programming Languages, Python

September 21, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org