Investigating Input Representations and Representation Models of Source Code for Machine Learning
Technische Universität Dresden
TU Dresden, 2020
@article{brauckmann2020investigating,
title={Investigating Input Representations and Representation Models of Source Code for Machine Learning},
author={Brauckmann, Alexander},
year={2020}
}
Machine Learning methods are actively used to solve various tasks on source code, such as in Compilers to improve performance of executable code, or IDEs to boost developer productivity. While the use cases are manifold, most of these methods rely on manually-defined features that require substantial engineering efforts, while not necessarily being optimal. In this thesis, we introduce a novel approach to encode programs as graphs that include compiler-internal semantics and use the recently discovered class of Graph Neural Networks to learn task-specific features automatically. Specifically, we design a framework for learning program representations based on Abstract Syntax Trees and Control- and Dataflow Graphs, extracted with the Clang/LLVM compiler infrastructure. We empirically evaluate the approach in compiler heuristic use cases and show to outperform existing methods based on Recurrent Neural Networks (RNNs) in generalization performance and inference time. In the task of code generation however, we show limitations of the graph-generative architecture we used, which cause a bias towards generating samples of less size and complexity.
July 12, 2020 by hgpu