high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Le Chen1, Nuo Xu, Winson Chen, Bin Lei, Pei-Hung Lin, Dunzhi Zhou, Rajeev Thakur, Caiwen Ding, Ali Jannesari, Chunhua Liao

Argonne National Laboratory, Lemont, USA

arXiv:2512.03086 [cs.PL], (29 Nov 2025)

DOI:10.48550/arXiv.2512.03086

@misc{chen2025codepairsdialoguebaseddata,

title={Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation},

author={Le Chen and Nuo Xu and Winson Chen and Bin Lei and Pei-Hung Lin and Dunzhi Zhou and Rajeev Thakur and Caiwen Ding and Ali Jannesari and Chunhua Liao},

year={2025},

eprint={2512.03086},

archivePrefix={arXiv},

primaryClass={cs.PL},

url={https://arxiv.org/abs/2512.03086}

}

Download (PDF)

View

Source

795

views

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

Tags: Code generation, Computer science, CUDA, LLM, nVidia, nVidia H200

December 21, 2025 by hgpu

No votes yet.

Please wait...