high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Data-efficient LLM Fine-tuning for Code Generation

Data-efficient LLM Fine-tuning for Code Generation

Weijie Lv, Xuan Xia, Sheng-Jun Huang

Nanjing University of Aeronautics and Astronautics, Nanjing, China

arXiv:2504.12687 [cs.CL], (17 Apr 2025)

DOI:10.48550/arXiv.2504.12687

@misc{lv2025dataefficientllmfinetuningcode,

title={Data-efficient LLM Fine-tuning for Code Generation},

author={Weijie Lv and Xuan Xia and Sheng-Jun Huang},

year={2025},

eprint={2504.12687},

archivePrefix={arXiv},

primaryClass={cs.CL},

url={https://arxiv.org/abs/2504.12687}

}

Download (PDF)

View

Source

Source codes

Package:

Data-efficient LLM Fine-tuning for Code Generation

2054

views

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order to improve the effectiveness and efficiency of training for code-based LLMs. By prioritizing data complexity and ensuring that the sampled subset aligns with the distribution of the original dataset, our sampling strategy effectively selects high-quality data. Additionally, we optimize the tokenization process through a "dynamic pack" technique, which minimizes padding tokens and reduces computational resource consumption. Experimental results show that when training on 40% of the OSS-Instruct dataset, the DeepSeek-Coder-Base-6.7B model achieves an average performance of 66.9%, surpassing the 66.1% performance with the full dataset. Moreover, training time is reduced from 47 minutes to 34 minutes, and the peak GPU memory decreases from 61.47 GB to 42.72 GB during a single epoch. Similar improvements are observed with the CodeLlama-Python-7B model on the Evol-Instruct dataset. By optimizing both data selection and tokenization, our approach not only improves model performance but also improves training efficiency.

Tags: Code generation, Computer science, CUDA, LLM, nVidia, nVidia A100, Package, Python, PyTorch

April 27, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org

Data-efficient LLM Fine-tuning for Code Generation

Package:

Your response

Recent source codes

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

Most viewed papers (last 30 days)

Data-efficient LLM Fine-tuning for Code Generation

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)