TY - JOUR
T1 - Human-guided auto-labeling for network traffic data
T2 - The GELM approach
AU - Kim, Meejoung
AU - Lee, Inkyu
N1 - Funding Information:
This research was partially supported by the Mid-career Research Program through the NRF Grant funded by the Ministry of Science and ICT, Korean government ( NRF-2019R1A2C1002706 ).
Funding Information:
The authors appreciate the support for the datasets used in this paper: CAIDA’s Internet Traces provided by the National Science Foundation, the US Department of Homeland Security, and CAIDA members, MIT Lab, and Hochschule Coburg.
Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/8
Y1 - 2022/8
N2 - Data labeling is crucial in various areas, including network security, and a prerequisite for applying statistical-based classification and supervised learning techniques. Therefore, developing labeling methods that ensure good performance is important. We propose a human-guided auto-labeling algorithm involving the self-supervised learning concept, with the purpose of labeling data quickly, accurately, and consistently. It consists of three processes: auto-labeling, validation, and update. A labeling scheme is proposed by considering weighted features in the auto-labeling, while the generalized extreme learning machine (GELM) enabling fast training is applied to validate assigned labels. Two different approaches are considered in the update to label new data to investigate labeling speed and accuracy. We experiment to verify the suitability and accuracy of the algorithm for network traffic, applying the algorithm to five traffic datasets, some including distributed denial of service (DDoS), DoS, BruteForce, and PortScan attacks. Numerical results show the algorithm labels unlabeled datasets quickly, accurately, and consistently and the GELM's learning speed enables labeling data in real-time. It also shows that the performances between auto- and conventional labels are nearly identical on datasets containing only DDoS attacks, which implies the algorithm is quite suitable for such datasets. However, the performance differences between the two labels are not negligible on datasets, including various attacks. Several reasons that require further investigation can be considered, including the selected features and the reliability of conventional labels. Even with this limitation of the current study, the algorithm will provide a criterion for labeling data in real-time occurring in many areas.
AB - Data labeling is crucial in various areas, including network security, and a prerequisite for applying statistical-based classification and supervised learning techniques. Therefore, developing labeling methods that ensure good performance is important. We propose a human-guided auto-labeling algorithm involving the self-supervised learning concept, with the purpose of labeling data quickly, accurately, and consistently. It consists of three processes: auto-labeling, validation, and update. A labeling scheme is proposed by considering weighted features in the auto-labeling, while the generalized extreme learning machine (GELM) enabling fast training is applied to validate assigned labels. Two different approaches are considered in the update to label new data to investigate labeling speed and accuracy. We experiment to verify the suitability and accuracy of the algorithm for network traffic, applying the algorithm to five traffic datasets, some including distributed denial of service (DDoS), DoS, BruteForce, and PortScan attacks. Numerical results show the algorithm labels unlabeled datasets quickly, accurately, and consistently and the GELM's learning speed enables labeling data in real-time. It also shows that the performances between auto- and conventional labels are nearly identical on datasets containing only DDoS attacks, which implies the algorithm is quite suitable for such datasets. However, the performance differences between the two labels are not negligible on datasets, including various attacks. Several reasons that require further investigation can be considered, including the selected features and the reliability of conventional labels. Even with this limitation of the current study, the algorithm will provide a criterion for labeling data in real-time occurring in many areas.
KW - Attack prediction
KW - Auto-labeling process
KW - Generalized extreme learning machine
KW - Human-guided labeling
KW - Moore–Penrose generalized inverse
KW - Network traffic
UR - http://www.scopus.com/inward/record.url?scp=85131431671&partnerID=8YFLogxK
U2 - 10.1016/j.neunet.2022.05.007
DO - 10.1016/j.neunet.2022.05.007
M3 - Article
C2 - 35660547
AN - SCOPUS:85131431671
SN - 0893-6080
VL - 152
SP - 510
EP - 526
JO - Neural Networks
JF - Neural Networks
ER -