Deep Learning for Obfuscated Code Analysis
Indiana University
School of Informatics, Computing, and Engineering, Indiana University
@phdthesis{shroyer2023deep,
title={Deep Learning for Obfuscated Code Analysis},
author={Shroyer, Alexander},
year={2023},
school={Indiana University}
}
Modern software development relies increasingly on third-party code dependencies, which enables rapid development but also increases risk of introducing bugs, malware, or unauthorized intellectual property. The goal of this dissertation is to reduce these risks making them easier to detect. Determining the meaning of an arbitrary program reduces to solving the halting problem, which is provably impossible. Instead, this work focuses on a narrower scope: to assign a similarity metric between a known program and an unknown one. To be able to quantify the distance between two programs, one must take into account slight variations in programs due to diverse compilation, whether debug symbols are stripped, or even intentional obfuscation. We address this variation by adding diversity to our training sample data through diverse compilation and deliberate obfuscation. These methods preserve the syntactic and structural qualities of valid code and permit augmentation of sparse datasets on a large scale. We train a variety of models to classify programs in the augmented training data. These trained models can now predict which parts of unknown programs are most similar to the training programs. In this work we train on standard library functions originally implemented in the C programming language within the musl library. This forms the basis of a novel method which can be applied to other codebases in order to quickly scan for similar examples in unfamiliar code.
January 7, 2024 by hgpu