An empirical evaluation of large language models in static code analysis for PHP vulnerability detection

Çetin, Orçun and Ekmekçioğlu, Emre and Arief, Budi and Hernandez-Castro, Julio (2024) An empirical evaluation of large language models in static code analysis for PHP vulnerability detection. Journal of Universal Computer Science, 30 (9). pp. 1163-1183. ISSN 0948-695X (Print) 0948-6968 (Online)

PDF (Open Access)
document-1.pdf
Available under License Creative Commons Attribution.
Download (410kB)

Official URL: http://dx.doi.org/10.3897/jucs.134739

Abstract

Web services play an important role in our daily lives. They are used in a wide range of activities, from online banking and shopping to education, entertainment and social interactions. Therefore, it is essential to ensure that they are kept as secure as possible. However – as is the case with any complex software system – creating a sophisticated software free from any security vulnerabilities is a very challenging task. One method to enhance software security is by employing static code analysis. This technique can be used to identify potential vulnerabilities in the source code before they are exploited by bad actors. This approach has been instrumental in tackling many vulnerabilities, but it is not without limitations. Recent research suggests that static code analysis can benefit from the use of large language models (LLMs). This is a promising line of research, but there are still very few and quite limited studies in the literature on the effectiveness of various LLMs at detecting vulnerabilities in source code. This is the research gap that we aim to address in this work. Our study examined five notable LLM chatbot models: ChatGPT 4, ChatGPT 3.5, Claude, Bard/Gemini1, and Llama-2, assessing their abilities to identify 104 known vulnerabilities spanning the Top-10 categories defined by the Open Worldwide Application Security Project (OWASP). Moreover, we evaluated issues related to these LLMs’ false-positive rates using 97 patched code samples. We specifically focused on PHP vulnerabilities, given its prevalence in web applications. We found that ChatGPT-4 has the highest vulnerability detection rate, with over 61.5% of vulnerabilities found, followed by ChatGPT-3.5 at 50%. Bard has the highest rate of vulnerabilities missed, at 53.8%, and the lowest detection rate, at 13.4%. For all models, there is a significant percentage of vulnerabilities that were classified as partially found, indicating a level of uncertainty or incomplete detection across all tested LLMs. Moreover, we found that ChatGPT-4 and ChatGPT-3.5 are consistently more effective across most categories, compared to other models. Bard and Llama-2 display limited effectiveness in detecting vulnerabilities across the majority of categories listed. Surprisingly, our findings reveal high false positive rates across all LLMs. Even the model demonstrating the best performance (ChatGPT-4) notched a false positive rate of nearly 63%, while several models glaringly under-performed, hitting startlingly bad false positive rates of over 90%. Finally, simultaneously deploying multiple LLMs for static analysis resulted in only a marginal enhancement in the rates of vulnerability detection. We believe these results are generalizable to most other programming languages, and hence far from being limited to PHP only.

Item Type:	Article
Uncontrolled Keywords:	ChatGPT, Claude, Bard, Gemini, Llama-2, Static code analysis, PHP vulnerabilities, Vulnerability detection, LLM in cybersecurity
Divisions:	Faculty of Engineering and Natural Sciences
Depositing User:	Orçun Çetin
Date Deposited:	20 Sep 2024 15:22
Last Modified:	16 Dec 2024 12:40
URI:	https://research.sabanciuniv.edu/id/eprint/50010

Actions (login required)

: View Item