Software security is a critical consideration for software development companies who want to provide safe and dependable software to their clients [1]. Modern software applications are typically accessible through the internet and handle sensitive data. As a result, they are continually vulnerable to harmful assaults. Exploiting a single vulnerability can have far-reaching repercussions for both the end-user (e.g., information leakage) and the organization that owns the affected software (e.g., financial losses and reputation damages) [2]. As a result, the software industry has shifted its focus towards creating proactive approaches that may give developers suggestive information about the security quality of their programs by detecting susceptible hotspots in the source code.
The Vulnerability Prediction (VP) mechanism is one such system that enables the prediction and mitigation of software vulnerabilities early in the development cycle. By assigning limited test resources to potentially risky items, VP models (VPM) can be utilized to prioritize testing and inspection efforts. Several VPMs have been developed throughout the years that use a variety of software elements as inputs, such as software metrics, static analysis warnings, and a text mining approach known as bag-of-words (BoW) [1], [3]. Although these models have shown encouraging outcomes, there is still room for improvement. Static analysis warnings contain a high number of false positives in addition to severe alarms. The BoW technique appears to produce better results than static analysis alerts and the usage of software metrics; however, it is overly reliant on the software project used for model training. As a result, current research has switched its attention to more complex approaches for detecting patterns in source code that signal the presence of a vulnerability. They concentrate on collecting information from a specific software application’s raw source code or from abstract representations of its source code, such as their Abstract Syntax Tree.
Using the raw text of the source code in the form of sequences of instructions, this work creates deep-learning (DL) models capable of predicting whether a software component is susceptible or not, employing approaches from the fields of natural language processing (NLP) and text classification. We used approaches from the NLP discipline for this aim. The source code is seen as text, and the vulnerability assessment work, like sentiment analysis, is regarded as a text classification problem. So, using NLP techniques such as Bidirectional Encoder Representations from Transformers (BERT) [4], data pre-processing and transformation to sequences, and training DL models (e.g., recurrent neural networks) suitable for analyzing sequential data, we detect potentially vulnerable components using a binary classifier trained primarily on text token sequences from the source code. Furthermore, software measurements acquired by static code analyzers, in conjunction with text mining approaches, might be utilized to improve the prediction performance of the models.
Subcomponent name | Functionality |
---|---|
Quantitative Security Assessment | Responsible for assessing security level of software applications based on Security Assessment Model |
Vulnerability Prediction | Is responsible for predicting security issues (i.e., vulnerabilities) |
If you wish to learn more about this aspect of the SmartCLIDE project, we invite you to read the public deliverable entitled “D3.1 – Early SmartCLIDE Cloud IDE Design“.
- [1] M. Siavvas, E. Gelenbe, D. Kehagias, and D. Tzovaras, “Static Analysis-Based Approaches for Secure Software Development,” in Security in Computer and Information Sciences, Cham, 2018, pp. 142–157. doi: 10.1007/978-3-319-95189-8_13.
- [2] E. Gelenbe et al., “NEMESYS: Enhanced Network Security for Seamless Service Provisioning in the Smart Mobile Ecosystem,” in Information Sciences and Systems 2013, Cham, 2013, pp. 369–378. doi: 10.1007/978-3-319-01604-7_36.
- [3] S. M. Ghaffarian and H. R. Shahriari, “Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey,” ACM Comput. Surv., vol. 50, no. 4, p. 56:1-56:36, Aug. 2017, doi: 10.1145/3092566.
- [4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Jun. 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.
No responses yet