PREDICTING CVE EXPLOITATION BASED ON NVD AND KEV OPEN DATA FOR RISK-ORIENTED PRIORITIZATION
DOI:
https://doi.org/10.28925/2663-4023.2025.30.906Keywords:
CVE, CVSS, KEV, exploitation prediction, machine learning, logistic regression, class imbalance, patch prioritization, ML.NET, cybersecurityAbstract
With the increasing number of publicly disclosed software vulnerabilities, security teams are increasingly challenged to identify key issues that require urgent remediation. While systems such as the Common Vulnerability Scoring System (CVSS) provide severity ratings, they do not indicate whether a vulnerability will be exploited in practice. The study proposes a machine learning-based approach to predict exploitable vulnerabilities using structured public data from the National Vulnerability Database (NVD) and the CISA-maintained Catalog of Known Functional Vulnerabilities (KEV). A labeled dataset of over 300,000 CVEs is generated, where randomly exploited ones are identified by KEV. The extracted features include CVSS vectors, CWE identifiers, vendor/product metadata, and time characteristics. Due to the extreme class imbalance (exploited CVEs are ~0.45%), an oversampling method and decision threshold tuning are used. Logistic regression trained in ML.NET is used to build interpretable models; it learns meaningful patterns that distinguish between exposed vulnerabilities. The threshold spectrum scoring demonstrates high completeness and increasing accuracy, offering a transparent and reproducible tool for prioritization. Additionally, the limitation associated with incomplete KEV as a “source of truth” is addressed and directions for improvement are directed: integration of NLP-embedding of CVE descriptions, probability calibration, and time-based validation to prevent data leakage. This approach increases the risk-based nature of cybersecurity decisions and can be otherwise integrated into vulnerability management processes in organizations of various scales.
Downloads
References
Allodi, L., & Massacci, F. (2012). A preliminary analysis of vulnerability scores for attacks in wild: The EKITS and SYM datasets. BADGERS '12: Proceedings of the 2012 ACM Workshop on Building analysis datasets and gathering experience returns for security, 17–24.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Shalev-Shwartz, S., & Zhang, T. (2014). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2), 105–145. https://doi.org/10.1007/s10107-014-0839-0
Zadrozny, B., & Elkan, C. (2002). Transforming Classifier Scores into Accurate Multiclass Probability Estimates. KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 694–699. https://doi.org/10.1145/775047.775151
Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), Стаття e0118432. https://doi.org/10.1371/journal.pone.0118432
Lu, H., & Mazumder, R. (2020). Randomized Gradient Boosting Machine. SIAM Journal on Optimization, 30(4), 2780–2808. https://doi.org/10.1137/18m1223277
Li, X., Moreschini, S., Zhang, Z., Palomba, F., & Taibi, D. (2023). The anatomy of a vulnerability database: A systematic mapping study. Journal of Systems and Software, 111679. https://doi.org/10.1016/j.jss.2023.111679
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. ICML '05: Proceedings of the 22nd international conference on Machine learning, 625–632. https://doi.org/10.1145/1102351.1102430
Almahmoud, Z., Yoo, P. D., Damiani, E., Choo, K.-K. R., & Yeun, C. Y. (2025). Forecasting Cyber Threats and Pertinent Mitigation Technologies. Technological Forecasting and Social Change, 210, 123836. https://doi.org/10.1016/j.techfore.2024.123836
Ferdous, J., Islam, R., Mahboubi, A., & Islam, M. Z. (2025). A Survey on ML Techniques for Multi-Platform Malware Detection: Securing PC, Mobile Devices, IoT, and Cloud Environments. Sensors, 25(4), 1153. https://doi.org/10.3390/s25041153
Lyu, J., Bai, Y., Xing, Z., Li, X., & Ge, W. (2021). A Character-Level Convolutional Neural Network for Predicting Exploitability of Vulnerability. International Symposium on Theoretical Aspects of Software Engineering (TASE), 119–126. https://doi.ieeecomputersociety.org/10.1109/TASE52547.2021.00014
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Владислав Денисюк

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.