Vulnerability Prediction – Importance and Challenges
Building secure software is highly important for both the end users and the owning enterprises. Nowadays, software controls critical daily activities, and therefore a security breach could lead to important implications both to user security (or even safety), and to a company’s reputation and finances. To this end, software development companies have shifted their focus towards the security-by-design paradigm in order to build software that is highly secure from the ground up. In order to achieve this, several tools are employed during the development process, which enables detection and elimination of potential vulnerabilities.
Vulnerability prediction is responsible for the identification of security hotspots, i.e., software components that are likely to contain critical vulnerabilities.
One important mechanism that facilitates the identification of vulnerabilities in software is vulnerability prediction. Vulnerability prediction is responsible for the identification of security hotspots, i.e., software components that are likely to contain critical vulnerabilities. This is achieved through the construction of vulnerability prediction models (VPMs), which are mainly machine learning models that are built based on software attributes retrieved primarily from the source code of the analysed software (e.g., software metrics, text features, etc.). The results of the vulnerability prediction models are highly useful for developers and project managers, as they allow them to better prioritise their testing and fortification efforts by allocating limited test resources to high-risk (i.e., potentially vulnerable) areas.
Among the existing solutions, text mining-based VPMs have demonstrated the best predictive performance. The majority of the text mining-based models that have been proposed in the literature so far are based on the concept of Bag of Words (BoW), which is actually a vector with the tokens (i.e., keywords) that are found in the source code along with the number of their occurrences, as well as on the concept of word token sequences (utilising also word embedding techniques for their representation), which corresponds to the sequences of the instructions in the analysed source code. Despite their promising results, these solutions have not demonstrated perfect predictive performance, which could allow them to be used reliably in practice, and therefore there is room for improvement. Recently, more advanced concepts have started being investigated in the literature in order to further enhance the predictive performance of text mining-based VPMs. One interesting direction which has recently started gaining the attention of the research community -, is the examination of whether the adoption of transformers, such as the Bidirectional Encoder Representations from Transformers (BERT) and its alternatives, could lead to more accurate vulnerability prediction.
To this end, we developed deep-learning (DL) models capable of predicting whether a software component is vulnerable, using the raw text of the source code in the form of sequences of instructions, utilising methods from the field of natural language processing (NLP) and text classification. In other words, we focused on building text mining-based VPMs utilising the popular concept of word token sequences and deep learning. We also examined whether the adoption of BERT could lead to sufficient vulnerability prediction models.
What is BERT?
Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. BERT makes use of Transformer. In its vanilla form, Transformer consists of two separate mechanisms: an encoder that reads the text input and a decoder that generates a prediction for the task. Because the goal of BERT is to generate a language model, only the encoder mechanism is required. The Transformer encoder reads the entire sequence of words at once, as opposed to directional models, which read the text input sequentially (left-to-right or right-to-left). As a result, it is regarded as bidirectional, though it would be more accurate to describe it as non-directional. This feature enables the model to learn the context of a word based on its surroundings (left and right of the word). The Transformer encoder is described in detail in the figure below:
As can be seen by the Figure above, the input of BERT is a series of tokens that are embedded into vectors before being processed by the neural network. The output is a sequence of H-dimensional vectors, each vector corresponding to an input token with the same index. The vectors that are produced by BERT can be utilized for building machine learning models for any classification problem, including vulnerability prediction, as we investigate in the present work.
Vulnerability Prediction Models using Text Mining and BERT
For the purposes of the present work, we utilised two popular vulnerability datasets proposed by the National Institute of Standards and Technology (NIST) and the OWASP, which contain examples of vulnerable and clean software components written in Java and C++ programming languages. For the case of C/C++ we utilised the Juliet dataset proposed by NIST, which contains 7651 source code files, 3438 of which are considered as vulnerable and the rest 4213 are considered as clean. For the case of Java, we utilised the OWASP Benchmark, which contains 1415 vulnerable class files and 1325 class files considered as clean.
For each dataset, the source code files were initially cleansed (i.e., comments were removed, literals were replaced with generic values, etc.) and subsequently tokenized in order to retrieve the sequences of their tokens. In order for these vectors to be used for building vulnerability prediction models, they need to be turned into numerical values, since the majority of the machine learning algorithms, including neural networks that are our main focus, operate on numerical inputs. More specifically, integer encoding was employed in order to turn the tokens into integers, and then the embedding vectors were produced. The embedding vectors are, in fact, the numerical representation of the text tokens, which can be used as inputs for our models.
In order to construct VPMs based on the selected datasets, we have used a pre-trained BERT model. Actually, it is the BERT for sequence classification pre-trained model. It belongs to the category of BERT base models with respect to their size. The model parameters, both those of the pre-trained model and those derived after fine-tuning it for the case of vulnerability prediction that we investigate in the present analysis are shown below:
|Number of layers||12|
|Number of epochs||2-4|
The VPMs that were implemented both for the Java dataset and also for the C++ dataset were then evaluated with respect to their predictive performance. For the evaluation of the models, we employed the 10-fold cross-validation technique. As a measure of predictive performance, we decided to use the F2-score. The reasoning behind the selection of this evaluation metric is that the F2-score takes into account both the Recall and the Precision of the produced model, but puts more emphasis on the Recall, which is more important for vulnerability prediction since it is important for a VPM not to miss existing vulnerabilities. The results are summarized in the table below:
|Model||F2 score (%)|
The results of the experiments indicate that the models can identify vulnerabilities in the software to a satisfying degree. More specifically, the F2-score in both cases was found to be above 70%, which is considered sufficient in the literature, and for the case of C++ the F2-score is close to 80%, which is considered high. This suggests that the utilisation of BERT may lead to VPMs with sufficient predictive performance. In the rest of the project, we will further examine the capacity of BERT to be used in vulnerability prediction. More specifically, (i) additional datasets will be considered in order to investigate the generalizability of these observations, (ii) BERT alternatives like codeBERT will be also examined in order to see if more code-related models lead to better results, and (iii) a comparison between models utilising BERT and models based simply on text mining approaches (e.g., BoW and token sequences) without dedicated transformations will be conducted.