27156

Long Code for Code Search

Fan Hu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Xirong Li
School of Information, Renmin University of China
arXiv:2208.11271 [cs.SE], (https://arxiv.org/pdf/2208.11271.pdf)

@misc{https://doi.org/10.48550/arxiv.2208.11271,

   doi={10.48550/ARXIV.2208.11271},

   url={https://arxiv.org/abs/2208.11271},

   author={Hu, Fan and Wang, Yanlin and Du, Lun and Zhang, Hongyu and Han, Shi and Zhang, Dongmei and Li, Xirong},

   keywords={Software Engineering (cs.SE), FOS: Computer and information sciences, FOS: Computer and information sciences},

   title={Long Code for Code Search},

   publisher={arXiv},

   year={2022},

   copyright={arXiv.org perpetual, non-exclusive license}

}

Download Download (PDF)   View View   Source Source   

547

views

Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the restriction of multi-head self-attention and GPU memory, there is a limit on the input token length. The existing pretrained code models, such as GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code (i.e., code that is greater than 256 tokens). Unlike the long text document that can be regarded as a whole with complete semantics, the semantics of long code is discontinuous as a piece of long code may contain different code modules. Therefore, it is unreasonable to directly apply the long text processing methods to long code. To tackle the long code problem, we propose MLCS (Modeling Long Code for Code Search) to obtain a better representation for long code. Our experimental results show the effectiveness of MLCS for long code retrieval. With MLCS, we could use Transformer-based pretraining models to model long code without changing their internal structure and re-pretraining. Through AST-based splitting and attention-based fusion methods, MLCS achieves an overall mean reciprocal ranking (MRR) score of 0.785, outperforming the previous state-of-the-art result of 0.713 on the public CodeSearchNet benchmark.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: