Evaluating the efficacy of large language models for automated vulnerable code generation

Çetin, Orçun and Bıyıklı, Nazlı (2026) Evaluating the efficacy of large language models for automated vulnerable code generation. In: 18th International Conference on Innovative Security Solutions for Information Technology and Communications (SecITC 2025), Bucharest, Romania

Full text not available from this repository. (Request a copy)

Abstract

The increasing adoption of large-language models (LLMs) in programming tasks raises questions about their ability to generate vulnerable code to test and benchmark static and dynamic code analyzers. This paper evaluates GPT-4o, Grok-2, and DeepSeek-R1 in generating intentionally vulnerable Python Flask applications. Twenty-seven common weakness enumerations (CWEs) were selected (three per OWASP Top 10 category), and each model generated 270 code samples, totaling 810 applications. We assessed code production, execution success, Pylint quality, and presence of the intended vulnerability. GPT-4o generated code in 87.8% (237) of cases, refusing 33 prompts, while Grok-2 and DeepSeek-R1 both achieved 100%. Execution success was highest for GPT-4o at 69.3% (187), followed by Grok-2 at 60.4% (163) and DeepSeek-R1 at 55.2% (149), indicating that although GPT-4o initially refused some prompts, it ultimately produced a higher proportion of code that executed successfully compared to the other LLMs. Moreover, our analysis of all generated code, regardless of whether code executed successfully, showed that GPT-4o produced vulnerable code in 74.8% of cases, compared to 81.9% for Grok-2 and 93% for DeepSeek-R1. However, when focusing exclusively on applications that executed correctly, these rates fell to 58.9%, 48.9%, and 49.6%, respectively. Lastly, the quality of all generated code, as indicated by Pylint scores varied by CWE, revealing model-specific strengths across vulnerability types. These findings suggest that LLMs have significant potential to generate vulnerable code, particularly for well-defined and commonly encountered CWEs, but they often struggle with successful execution and more complex vulnerabilities.
Item Type: Papers in Conference Proceedings
Uncontrolled Keywords: ChatGPT; DeepSeek-R1; Grok-2; Large Language Model; OWASP TOP 10; Secure Development; Vulnerable Code Generation
Divisions: Faculty of Engineering and Natural Sciences
Depositing User: Orçun Çetin
Date Deposited: 12 Jun 2026 10:03
Last Modified: 12 Jun 2026 10:03
URI: https://research.sabanciuniv.edu/id/eprint/54145

Actions (login required)

View Item
View Item