IMPACT OF OPTIMIZATION OF THE CSE–CIC–IDS2018 DATASET ON THE EFFICIENCY OF THE HYBRID STICKING MODEL FOR NETWORK INTRUSION DETECTION
DOI:
https://doi.org/10.28925/2663-4023.2025.30.963Keywords:
cybersecurity, threats, network intrusion detection, CSE–CIC–IDS2018, SMOTE, Min-Max normalization, principal component analysis (PCA), stacking, hybrid model, machine learning.Abstract
This paper presents an extensive comparative analysis of the performance of a hybrid stacking model designed for network intrusion detection, with a special emphasis on the transformation of performance indicators before and after the implementation of a comprehensive preprocessing method for the modern CSE–CIC–IDS2018 dataset. The proposed data preparation approach is based on the synergy of three strategic components: the SMOTE algorithm for intelligent class balancing by generating synthetic minority attack samples, the Min–Max normalization method for scaling the feature space to the range [0, 1], which ensures a uniform contribution of each parameter to the training process, and the Principal Component Analysis (PCA) method for aggressively reducing the dimensionality of the data without losing key variance. To achieve maximum objectivity and verify the results, a large-scale experimental cycle was conducted, covering the training and testing of key fundamental machine learning algorithms, as well as ten unique configurations of the hybrid stack ensemble-based metaclassifier. It has been experimentally proven that such deep optimization of input data allows the hybrid model to overcome the problem of “overtraining” on majority classes and significantly increase the analytical power, which was reflected in an increase in accuracy by 3.87% and F1-measure by 5.11%. The most important result for practical application was a radical reduction in prediction time by 76.0%, which effectively removes computational barriers for integrating complex ensemble methods into high-load intrusion detection systems operating in real-time. Thus, the integration of SMOTE, Min-Max normalization and PCA is defined as a fundamental architectural prerequisite for creating new generation systems resistant to cyber threats, capable of effectively detecting anomalies in conditions of high network traffic intensity.
Downloads
References
Sharafaldin, I., Lashkari, A. H., & Ghorbani, A. A. (2018). Toward generating a new intrusion detection dataset and intrusion traffic characterization. Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP 2018), 108–116.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357.
Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 1–16.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6638–6648.
Lim, M., & Al-Hussain, A. (2019). Class imbalance problem in intrusion detection systems: A survey. IEEE Access, 7, 90561–90578.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets. Springer, Cham.
Sommer, R., & Paxson, V. (2010). Outside the closed world: On using machine learning for network intrusion detection. Proceedings of the IEEE Symposium on Security and Privacy, 305–316.
Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153–1176.
Vapnik, V. (1995). The nature of statistical learning theory. Springer.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009). A detailed analysis of the KDD CUP 99 data set. Proceedings of the IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 1–6.
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Дмитро Гамза

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.