high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Xin Jin, Jonathan Larson, Weiwei Yang, Zhiqiang Lin

The Ohio State University

arXiv:2312.09601 [cs.CR], (15 Dec 2023)

DOI:10.48550/arXiv.2312.09601

@misc{jin2023binary,

title={Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models},

author={Xin Jin and Jonathan Larson and Weiwei Yang and Zhiqiang Lin},

year={2023},

eprint={2312.09601},

archivePrefix={arXiv},

primaryClass={cs.CR}

}

Download (PDF)

View

Source

1253

views

Binary code summarization, while invaluable for understanding code semantics, is challenging due to its labor-intensive nature. This study delves into the potential of large language models (LLMs) for binary code comprehension. To this end, we present BinSum, a comprehensive benchmark and dataset of over 557K binary functions and introduce a novel method for prompt synthesis and optimization. To more accurately gauge LLM performance, we also propose a new semantic similarity metric that surpasses traditional exact-match approaches. Our extensive evaluation of prominent LLMs, including ChatGPT, GPT-4, Llama 2, and Code Llama, reveals 10 pivotal insights. This evaluation generates 4 billion inference tokens, incurred a total expense of 11,418 US dollars and 873 NVIDIA A100 GPU hours. Our findings highlight both the transformative potential of LLMs in this field and the challenges yet to be overcome.

Tags: Benchmarking, Computer science, nVidia, nVidia A100, Software Engineering

December 24, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Your response

Recent source codes

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

BoltzGen:Toward Universal Binder Design

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Most viewed papers (last 30 days)

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)