Optimizing Pre-Trained Code Embeddings With Triplet Loss for Code Smell Detection
Citation
NİZAM, Ali, Ertuğrul İSLMAOĞLU, Ömer Kerem ADALI & Musa AYDIN. "Optimizing Pre-Trained Code Embeddings With Triplet Loss for Code Smell Detection." IEEE Access, 13 (2025): 1-16.Abstract
Code embedding represents code semantics in vector form. Although code embedding-based
systems have been successfully applied to various source code analysis tasks, further research is required
to enhance code embedding for better code analysis capabilities, aiming to surpass the performance and
functionality of static code analysis tools. In addition, standard methods for improving code embedding
are essential to develop more effective embedding-based systems, similar to augmentation techniques in
the image processing domain. This study aims to create a contrastive learning-based system to explore the
potential of a generic method for enhancing code embedding for code classification tasks. A triplet lossbased
deep learning network is designed to optimize in-class similarity and increase the distance between
classes. An experimental dataset that contains code from Java, Python, and PHP programming languages and
4 different code smells is created by collecting code from open-source repositories on GitHub. We evaluate
the proposed system’s effectiveness with widely used BERT, CodeBERT, and GraphCodeBERT pretrained
models to create code embedding for the code classification task of code smell detection. Our findings
indicate that the proposed system may offer improvements in accuracy, an average of 8% and a maximum of
13% for models. These results suggest that incorporating contrastive learning techniques into the generation
process of code representation as a preprocessing step can enhance performance in code analysis.