Application of Machine Learning in Web Attack Detection

Application of Machine Learning in Web Attack Detection

I. Background

In essence, any machine learning problem can be reduced to finding a suitable transformation function. For instance, in voice recognition, the goal is to identify an appropriate transformation function that converts one-dimensional voice signals into a semantic space. Similarly, in image recognition, the objective is to find a transformation function that converts two-dimensional image layouts into a decision-making space. In face recognition, the aim is to identify a transformation function that converts two-dimensional face images into a feature space, enabling the unique determination of a corresponding identity.

In the history of web application attack detection, the primary detection mechanism has relied on a blacklist rule-based approach. This involves using web application firewalls, intrusion detection systems (IDS), and other security tools that depend on detection engines built around regular expressions. Although this approach has been effective in resisting most attacks, it suffers from several limitations. These include:

  1. Rule Base Maintenance Difficulties: The transfer of personnel work and the difficulty of understanding original rules written by previous authors make it challenging to maintain and update the rule base.
  2. Overly Broad or Too Narrow Rules: Writing rules that are too broad can lead to false positives, while writing rules that are too narrow can be easily bypassed.
  3. Regular Engine Performance Impact: The excessive number of regular expressions can significantly impact the performance of the regular engine, leading to accumulation phenomena, such as the one experienced with Kafka flows.

II. Introduction of Malicious Attack Detection System Architecture

To address these challenges, we recently introduced a malicious attack detection system architecture that incorporates machine learning. The architecture, shown in Figure 1, consists of a whitelist filter that removes normal flows greater than 97% and sends the remaining 3% of flows through a regular rules engine. If the result is black (malicious), the flow is sent to a vulnerability automated verification system (Hulk).

Figure 1: Attack Detection System Architecture (First Edition)

Our system has undergone several improvements, with the most significant change being the addition of a Spark machine learning engine before the regular rules engine. The machine learning engine uses the Spark MLlib library to model and predict attacks. If the machine learning engine identifies a malicious flow, it is sent to the regular rules engine for a second check. If the flow is still identified as malicious, it is sent to the Hulk vulnerability verification system.

Figure 2: Attack Detection System Architecture (Latest Version)

The benefits of this architecture include:

  1. Faster Processing Speed: The machine learning engine can filter out most traffic, reducing the load on the regular engine and preventing accumulation phenomena.
  2. Improved Accuracy: The machine learning engine can identify attacks that are not detected by the regular engine, allowing for more accurate threat detection.
  3. Reduced False Positives: The machine learning engine can be used to identify false positives and reduce the number of malicious flows sent to the regular engine.

III. Machine Learning in Web Attack Detection

The application of machine learning to web attack detection involves the following four steps:

  1. Definition of the Target Problem: Identify the core objective of the problem, which in this case is binary classification (prediction or normal flow attack).
  2. Collecting Data and Engineering Features: Collect labeled data and extract relevant features from the data.
  3. Model Training and Evaluation: Train and evaluate the machine learning model using the collected data and features.
  4. Continuous Optimization: Continuously optimize the model and update it with new data.

IV. Definition of the Target Problem

The core objective of the problem is binary classification, with a false negative rate of less than 10%. The model should be able to predict whether a flow is normal or an attack.

V. Collecting Data and Engineering Features

We collected labeled data from ES logs and extracted relevant features from the data. The features include:

  • Eval Feature: The presence of the “eval” function in the URL.
  • ../ and Other Characters: The presence of “../” and other characters in the URL.
  • Punctuation: The presence of punctuation marks in the URL.

We ignored the URI and only took the parameter value as a feature. For example, in the statement “登录首页, world?!”, we extracted the feature “eval” and ignored the URI.

VI. Model Training and Evaluation

We used the Spark MLlib library to train and evaluate the machine learning model. The model was trained on a dataset of 10,000 labeled flows and evaluated on a separate test set.

VII. Continuous Optimization

We continuously optimized the model and updated it with new data. The model was improved by adding new features and updating the existing ones.

VIII. Benefits of Machine Learning in Web Attack Detection

The benefits of machine learning in web attack detection include:

  • Faster Processing Speed: The machine learning engine can filter out most traffic, reducing the load on the regular engine and preventing accumulation phenomena.
  • Improved Accuracy: The machine learning engine can identify attacks that are not detected by the regular engine, allowing for more accurate threat detection.
  • Reduced False Positives: The machine learning engine can be used to identify false positives and reduce the number of malicious flows sent to the regular engine.

IX. Conclusion

The application of machine learning in web attack detection has improved the accuracy and speed of threat detection. The benefits of machine learning in web attack detection include faster processing speed, improved accuracy, and reduced false positives. The machine learning engine can be used to filter out most traffic, reducing the load on the regular engine and preventing accumulation phenomena. The machine learning engine can also be used to identify attacks that are not detected by the regular engine, allowing for more accurate threat detection.