The datasets and source code of the NDSS 2025 paper《BinEnhance: An Enhancement Framework Based on External Environment Semantics for Binary Code Search》.
################################################
Eval Part
################################################
[Eval] If you just want to evaluate the improvement of BinEnhance on the five baselines, you can complete the steps in this section.
- Install the required environment, the python environment we use is python3.8
pip install -r Requirements.txt
-
Download the evaluation dataset from OneDrive:
This evaluation dataset contains embeddings of all test binary functions generated by the five methods used in the paper (HermesSim1, TREX2, Asteria3, Asm2vec4, Gemini5), as well as all embeddings enhanced by BinEnhance. There are 13 folders after unpacking the evaluation dataset.
Asm2vec
andAsm2vec+BinEnhance
are the embedding files generated by the original method and BinEnhance (from Dataset D2_norm). The other models are similar.Eval_datas
is the evaluation function pool of different models.RDFs
are all functions with readable data features.save_models
is our trained model. -
Use the storage path of the evaluation dataset downloaded in step 2 to run Eval.py
python Eval.py --data-path="xxx/dataset2_Eval"<br>
################################################
Training and Inference Part of BinEnhance
################################################
[Train] If you want to reproduce the BinEnhance Framework, you can try it by the following steps.
[PS:] The following steps require the IDA Pro tool to extract our eesg, GPUs to train, and other requirements (such as multiprocessing). You may need to modify some settings in some code (such as binary_dir path). If you do not have the above conditions, you can run the Eval part. We have provided the intermediate results.
- Install the required environment, the python environment we use is python3.8
pip install -r requirements_train.txt
- Node Initial Embedding Generation.
[Function Node] After obtaining the function embeddings of baselines, we can run the following function to delete the first '.' in the function name.
def get_unified_funcname(funcname):
if len(funcname) > 0:
if '.' == funcname[0]:
funcname = funcname[1:]
return funcname
And then run whitening_transformation.py to reduce the dimension of embeddings.
python whitening_transformation.py --input-dir xxx --output-dir xxx --dimension xxx
[String Node] We can run mpnet_generate.py to generate the embeddings of strings after running the IDA scripts (step 3).
python mpnet_generate.py --input-dir xxx --output-dir xxx --dimension xxx --model-path xxx
- EESG construction and SEM train.
We need to modify IDA_script/settings.py and Run IDA python scripts in the IDA_script folder to extract EESG.
python extract.py --process-num 30 --output-dir xxx
Then, we need to split our EESG files, function embeddings, and string embeddings to train, valid, and test (8:1:1). After this, modify the corresponding path (eesg, function embeddings and string embeddings of each class) in train.py. Finally, run train.py the train our SEM model.
python train.py --base-path xxx --model-save xxx --fis HermesSim --name dataset2
- Eval.
See the Eval part.
################################################
Datasets in our paper
################################################
D2_norm and D2_noinline in the paper (The homologous function pairs for the evaluation of the function inline scenario can be constructed from them): These datasets can download from normal_dataset and noinline_dataset in Binkit.
################################################
References
################################################
Footnotes
-
H. He, X. Lin, Z. Weng, R. Zhao, S. Gan, L. Chen, Y. Ji, J. Wang, and Z. Xue, “Code is not natural language: Unlock the power of semantics oriented graph representation for binary code similarity detection,” in 33rd USENIX Security Symposium (USENIX Security 24), PHILADELPHIA, PA, 2024. ↩
-
K. Pei, Z. Xuan, J. Yang, S. Jana, and B. Ray, “Learning approximate execution semantics from traces for binary function similarity,” IEEE Transactions on Software Engineering, vol. 49, no. 4, pp. 2776–2790, 2022. ↩
-
S. Yang, L. Cheng, Y. Zeng, Z. Lang, H. Zhu, and Z. Shi, “Asteria: Deep learning-based ast-encoding for cross-platform binary code similarity detection,” in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2021, pp. 224–236. ↩
-
S. H. Ding, B. C. Fung, and P. Charland, “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 472–489. ↩
-
X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network based graph embedding for cross-platform binary code similarity detection,” in Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, 2017, pp. 363–376. ↩