PREDICTING CVE EXPLOITATION BASED ON NVD AND KEV OPEN DATA FOR RISK-ORIENTED PRIORITIZATION

Authors

DOI:

https://doi.org/10.28925/2663-4023.2025.30.906

Keywords:

CVE, CVSS, KEV, exploitation prediction, machine learning, logistic regression, class imbalance, patch prioritization, ML.NET, cybersecurity

Abstract

With the increasing number of publicly disclosed software vulnerabilities, security teams are increasingly challenged to identify key issues that require urgent remediation. While systems such as the Common Vulnerability Scoring System (CVSS) provide severity ratings, they do not indicate whether a vulnerability will be exploited in practice. The study proposes a machine learning-based approach to predict exploitable vulnerabilities using structured public data from the National Vulnerability Database (NVD) and the CISA-maintained Catalog of Known Functional Vulnerabilities (KEV). A labeled dataset of over 300,000 CVEs is generated, where randomly exploited ones are identified by KEV. The extracted features include CVSS vectors, CWE identifiers, vendor/product metadata, and time characteristics. Due to the extreme class imbalance (exploited CVEs are ~0.45%), an oversampling method and decision threshold tuning are used. Logistic regression trained in ML.NET is used to build interpretable models; it learns meaningful patterns that distinguish between exposed vulnerabilities. The threshold spectrum scoring demonstrates high completeness and increasing accuracy, offering a transparent and reproducible tool for prioritization. Additionally, the limitation associated with incomplete KEV as a “source of truth” is addressed and directions for improvement are directed: integration of NLP-embedding of CVE descriptions, probability calibration, and time-based validation to prevent data leakage. This approach increases the risk-based nature of cybersecurity decisions and can be otherwise integrated into vulnerability management processes in organizations of various scales.

Downloads

Download data is not yet available.

References

Allodi, L., & Massacci, F. (2012). A preliminary analysis of vulnerability scores for attacks in wild: The EKITS and SYM datasets. BADGERS '12: Proceedings of the 2012 ACM Workshop on Building analysis datasets and gathering experience returns for security, 17–24.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Shalev-Shwartz, S., & Zhang, T. (2014). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2), 105–145. https://doi.org/10.1007/s10107-014-0839-0

Zadrozny, B., & Elkan, C. (2002). Transforming Classifier Scores into Accurate Multiclass Probability Estimates. KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 694–699. https://doi.org/10.1145/775047.775151

Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), Стаття e0118432. https://doi.org/10.1371/journal.pone.0118432

Lu, H., & Mazumder, R. (2020). Randomized Gradient Boosting Machine. SIAM Journal on Optimization, 30(4), 2780–2808. https://doi.org/10.1137/18m1223277

Li, X., Moreschini, S., Zhang, Z., Palomba, F., & Taibi, D. (2023). The anatomy of a vulnerability database: A systematic mapping study. Journal of Systems and Software, 111679. https://doi.org/10.1016/j.jss.2023.111679

Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. ICML '05: Proceedings of the 22nd international conference on Machine learning, 625–632. https://doi.org/10.1145/1102351.1102430

Almahmoud, Z., Yoo, P. D., Damiani, E., Choo, K.-K. R., & Yeun, C. Y. (2025). Forecasting Cyber Threats and Pertinent Mitigation Technologies. Technological Forecasting and Social Change, 210, 123836. https://doi.org/10.1016/j.techfore.2024.123836

Ferdous, J., Islam, R., Mahboubi, A., & Islam, M. Z. (2025). A Survey on ML Techniques for Multi-Platform Malware Detection: Securing PC, Mobile Devices, IoT, and Cloud Environments. Sensors, 25(4), 1153. https://doi.org/10.3390/s25041153

Lyu, J., Bai, Y., Xing, Z., Li, X., & Ge, W. (2021). A Character-Level Convolutional Neural Network for Predicting Exploitability of Vulnerability. International Symposium on Theoretical Aspects of Software Engineering (TASE), 119–126. https://doi.ieeecomputersociety.org/10.1109/TASE52547.2021.00014

Downloads


Abstract views: 18

Published

2025-10-26

How to Cite

Denysiuk, V. (2025). PREDICTING CVE EXPLOITATION BASED ON NVD AND KEV OPEN DATA FOR RISK-ORIENTED PRIORITIZATION. Electronic Professional Scientific Journal «Cybersecurity: Education, Science, Technique», 2(30), 428–444. https://doi.org/10.28925/2663-4023.2025.30.906