1
2

Antonio Carević

1

, Mario Dudjak

2

Antonio Carević

1

, Mario Dudjak

2

1. Josip Juraj Strossmayer University of Osijek, Faculty of Electrical Engineering, Computer Science and Information Technology Osijek, Kneza Trpimira 2B, 31000 Osijek, Croatia
2. Josip Juraj Strossmayer University of Osijek, Faculty of Electrical Engineering, Computer Science and Information Technology Osijek, Kneza Trpimira 2B, 31000 Osijek, Croatia
1. Josip Juraj Strossmayer University of Osijek, Faculty of Electrical Engineering, Computer Science and Information Technology Osijek, Kneza Trpimira 2B, 31000 Osijek, Croatia
2. Josip Juraj Strossmayer University of Osijek, Faculty of Electrical Engineering, Computer Science and Information Technology Osijek, Kneza Trpimira 2B, 31000 Osijek, Croatia

Abstract: The aim of this paper is to define the learning flow of classification algorithms from datasets describing various types of malicious attacks and to determine individual procedures within that flow. By applying the defined flow, results were obtained that demonstrate the high quality of classification models in detecting malicious attacks, confirming its applicability in the field of cybersecurity, especially in intrusion detection systems. The machine learning algorithms used include: Naive Bayes, k-Nearest Neighbors, Decision Tree, Random Forest, and Logistic Regression. During feature selection, filters with Pearson correlation coefficient, mutual information, and ANOVA F-value were used, as well as the sequential forward selection (SFS) wrapper. For processing imbalanced datasets, random oversampling and undersampling procedures were applied. The Decision Tree algorithm achieved the best results with an F1 score of 1.0 on most datasets, while the Naive Bayes algorithm showed significantly weaker performance, with F1 values ranging from 0.12 to 0.98. Feature selection techniques generally improved performance, with the SFS wrapper being particularly prominent. Among the procedures for reducing data imbalance, random oversampling consistently improved the performance of all algorithms, whereas undersampling led to a significant decrease in performance for some algorithms, with F1 score drops of up to 0.22. The proposed learning flow enables the systematic evaluation of the impact of different data preprocessing methods and classification algorithms, thereby contributing to a better understanding of the process of malicious attack detection in imbalanced and heterogeneous datasets, and can serve as a basis for the development of more effective cybersecurity defense systems in real-world environments.

Keywords: classification, malicious attacks, imbalanced dataset, feature selection, machine learning

Summary: The aim of this paper is to define the learning flow of classification algorithms from datasets describing various types of malicious attacks and to determine individual procedures within that flow. By applying the defined flow, results were obtained that demonstrate the high quality of classification models in detecting malicious attacks, confirming its applicability in the field of cybersecurity, especially in intrusion detection systems. The machine learning algorithms used include: Naive Bayes, k-Nearest Neighbors, Decision Tree, Random Forest, and Logistic Regression. During feature selection, filters with Pearson correlation coefficient, mutual information, and ANOVA F-value were used, as well as the sequential forward selection (SFS) wrapper. For processing imbalanced datasets, random oversampling and undersampling procedures were applied. The Decision Tree algorithm achieved the best results with an F1 score of 1.0 on most datasets, while the Naive Bayes algorithm showed significantly weaker performance, with F1 values ranging from 0.12 to 0.98. Feature selection techniques generally improved performance, with the SFS wrapper being particularly prominent. Among the procedures for reducing data imbalance, random oversampling consistently improved the performance of all algorithms, whereas undersampling led to a significant decrease in performance for some algorithms, with F1 score drops of up to 0.22. The proposed learning flow enables the systematic evaluation of the impact of different data preprocessing methods and classification algorithms, thereby contributing to a better understanding of the process of malicious attack detection in imbalanced and heterogeneous datasets, and can serve as a basis for the development of more effective cybersecurity defense systems in real-world environments.
Keywords: classification, malicious attacks, imbalanced dataset, feature selection, machine learning

This work is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International.

Received: 29.05.2025.

Accepted: 23.07.2025.

Number of views: 97