Regression
The objective for Regression Task is to build a model for predicting a Target field based on other fields in a dataset, and then create an alert if the new values of the Target field exceeds the predicted value, by user specified threshold. Building the prediction model is done during the Training phase and alert creation is done during the Inference phase. The dataset for both Training and Inference phases is provided by running FortiSIEM reports.
- Regression Algorithms for Local Mode
- Running Regression Local Mode
- Regression Algorithms for Local Auto Mode
- Running Regression Local Auto Mode
- Regression Algorithms for AWS Mode
- Running Regression AWS Mode
- Regression Algorithms for AWS Auto Mode
- Running Regression AWS Auto Mode
Regression Algorithms for Local Mode
The following Machine Learning Algorithms are available:
In this mode, the following algorithms can run locally within the FortiSIEM Supervisor/Worker cluster.
- Decision Tree Regressor: A supervised regression algorithm that uses a Decision Tree to predict a Target variable based on Feature variables. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
- Linear Regression: A supervised regression algorithm that predicts a Target variable based on a linear combination of Feature variables. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- Random Forest Regressor: A supervised regression algorithm that uses Ensemble learning and Bootstrapping techniques to improve the accuracy of Decision Tree based prediction algorithms. Ensemble learning uses multiple models and Bootstrapping randomly samples datasets and then averages the results of each model to improve accuracy. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- SGDRegressor: A supervised regression algorithm that uses Stochastic Gradient Descent (SGD) method to minimize a user specified loss function for the Linear Regression algorithm. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html
- SVR: A supervised regression algorithm that finds a linear function representing the data within a margin of error Support Vector Regression (SVR) error function for prediction. SVR is more robust to outliers than most other regression methods, since it does not care much about the data outside the margin. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
Running Regression Local Mode
Step 1: Design
First identify the following items:
- Field to Predict: Must be a numerical field.
- Fields to use for Prediction: Each field must be a numerical field.
- A FortiSIEM Report to get this data.
There must be several samples of the data. This can be accomplished by choosing one of the following time attributes as a report column
- Event Receive Time
- Event Receive Hour
- Event Receive Date
Requirements
- Report must contain
- A time attribute
- One numerical Field to Predict
- One or more numerical Fields to use for Prediction
- Each field must be present in the report result; else the whole row will be ignored by the Machine Learning algorithm.
- There can be additional columns in the report and they will be ignored by the machine learning algorithm. However, it is recommended to remove unnecessary columns from the dataset to reduce the size of the dataset exchanged between App Server and phAnomaly modules, during Training and Inference.
Go to Analytics > Search and run various reports. Once you have the right report, save it in Resources > Machine Learning Jobs.
Step 2: Prepare Data
Prepare the data for training.
- Go to Analytics > Machine Learning.
- Select the data source in one of three ways:
- To prepare data from a Machine Learning Job, choose Import via Jobs and select the Job which has associated Report and algorithm.
- To prepare data from the Report folder, choose Import via Report and select the report from the Resources > Machine Learning Job folder
- To prepare data from a CSV file, choose Import Via CSV File and upload the file. In this mode, you can see how the Training algorithm performs, but you cannot schedule for inference, since the data may not be present in FortiSIEM.
- For Case 2a and 2b, select the Report Time Range and the Organization for Service Provider Deployments.
- Click Run. The results are displayed in Machine Learning > Prepare tab.
Note: For FortiSIEM with ClickHouse deployments, Historical mode searches include a Result Filter panel that provides a top and bottom 100 list (click
/
to toggle) for the attributes that are part of your Display Field (
) search. An Add to Main Filter icon (
) is available to hone search results and identify trends related to the selected attributes with other attributes.
Step 3: Train
Train the Regression task using the dataset in Step 2.
- Go to Analytics > Machine Learning > Train.
- If you chose Import via Jobs, then make sure the Fields to use for Prediction and Fields to Predict are populated correctly.
- If you chose Import via Report or Import Via CSV File, then
- Set Run Mode to Local
- Set Task to Regression
- Choose the Algorithm
- Choose the Fields to use for Prediction and Fields to Predict from the report fields.
- Choose the Train factor which should be greater than 70%. This means that 70% of the data will be used for Training and 30% used for Testing.
- Click Train.
After you have completed the Training, the results are shown in the Train > Output tab.
Model Quality:
This shows how accurately the algorithm is able to predict the field. The following metrics are calculated:
- Max Error: Maximum of the error between predicted value and actual value over all data points. Lower value means that regression is a better fit.
- R2 score: A statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data. R-square value of 0.8 means that 80% of the variation in the predicted attribute is explained by the feature attributes.
- Mean Absolute Error: Average of Absolute difference between predicted and actual values over all data points. Lower value means that regression is a better fit.
-
Mean Squared Error: Average of Square of difference between predicted and actual values over all data points. Lower value means that regression is a better fit.
- Root Mean Squared Error (RMSE): Square root of the verage of Square of difference between predicted and actual values over all data points. Lower value means that regression is a better fit. RMSE is affected by the scale of the data. RMSE can be heavily affected by a few predictions which are much worse than the rest.
The actual and predicted values and the errors are shown in 3 ways:
- Regression Result table: If the model did well, then the error column should have small values.
- Scatter plot: If the model did well, then this chart must be centered along the y=x line.
- Error histogram: If the model did well, then the chart should be clustered around the x=0 line.
If you want to change the algorithm parameters and re-train, then click Tune & Train, change the parameters and click Save & Train.
Step 4: Schedule
Once the training is complete, you can schedule the job for Inference.
- Input Details section shows the Report and the Org chosen for the report. These were already chosen during Prepare phase and will be used during Inference.
- Algorithm Setup shows the Machine Learning Algorithm and its parameters. These were already chosen during Train phase and will be used during Inference.
- Schedule Setup shows the Job details and schedules
- Job Id: Specifies the unique Job Id. If it is a system job, it will be overwritten with a new job id when it is saved as a User job. If it is a User job, then user has option to Save as a new user job with different job id or keeping the same job Id.
- Job Name: Name of the job. You can overwrite this one. When a job with the same name exists then a data stamp will be appended.
- Job Description: Description of the job.
- Inference schedule: The frequency at which Inference job will be run
- Retraining schedule: The frequency at which the model would be retrained. Retraining is expensive and it should be carefully considered. Recommended retraining is at least 7 days.
- (Retraining) Report Window: The Report time window during retraining process. Long time window may cause the report to run slowly and this should be carefully considered as well. It is recommended to choose the same time window chosen during the Prepare process.
- Job Group: Shows the folder under Resources > Machine Learning Jobs where this job will be saved.
- Action on Inference: Specifies the action to be taken when an anomaly is found during the Inference process.
- Only one choice is available – Creating an incident when Error is #. Enter a number in the Create an Incident when Error is great than field.
- Check Enabled to ensure that Inference is enabled.
Finally click Save to save this to database. If it is a system job, then a new User job will be created. If it is a User job, then user has option to Save as a new user job with different job id or overwriting the current job.
Regression Algorithms for Local Auto Mode
In this mode, FortiSIEM automatically chooses the best algorithm with the optimal parameters. The following algorithms are considered:
- Decision Tree Regressor
- Random Forest Regressor
- SGDRegressor
- SVR
Note: The Max Run Time parameter is used to limit the amount of time this job runs. By default it is set to 5 minutes. The longer this job runs, potentially better results can be generated.
Running Regression Local Auto Mode
To run, follow the Regression steps in Algorithms for Local Mode, but in Step 3, 3a, select Run Mode as Local Auto.
Regression Algorithms for AWS Mode
The following algorithm runs in AWS mode.
- Linear- Learner Algorithem-for-Regression: Linear models are supervised learning algorithms used for solving either classification or regression problems. See https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html for more information.
Running Regression AWS Mode
Step 0: Set Up AWS
Set up AWS SageMaker by following the instructions in Set Up AWS SageMaker.
Configure AWS in FortiSIEM by following the instructions in Configure FortiSIEM to use AWS SageMaker.
Step 1: Design
First identify the following items:
- Field to Predict: Must be a numerical field.
- Fields to use for Prediction: Each field must be a numerical field.
- A FortiSIEM Report to get this data.
There must be several samples of the data. This can be accomplished by choosing one of the following time attributes as a report column
- Event Receive Time
- Event Receive Hour
- Event Receive Date
Requirements
- Report must contain
- A time attribute
- One numerical Field to Predict
- One or more numerical Fields to use for Prediction
- Each field must be present in the report result; else the whole row will be ignored by the Machine Learning algorithm.
- There can be additional columns in the report and they will be ignored by the machine learning algorithm. However, it is recommended to remove unnecessary columns from the dataset to reduce the size of the dataset exchanged between App Server and phAnomaly modules, during Training and Inference.
Go to Analytics > Search and run various reports. Once you have the right report, save it in Resources > Machine Learning Jobs.
Step 2: Prepare Data
Prepare the data for training.
- Go to Analytics > Machine Learning.
- Select the data source in one of three ways:
- To prepare data from a Machine Learning Job, choose Import via Jobs and select the Job which has associated Report and algorithm.
- To prepare data from the Report folder, choose Import via Report and select the report from the Resources > Machine Learning Job folder
- To prepare data from a CSV file, choose Import Via CSV File and upload the file. In this mode, you can see how the Training algorithm performs, but you cannot schedule for inference, since the data may not be present in FortiSIEM.
- For Case 2a and 2b, select the Report Time Range and the Organization for Service Provider Deployments.
- Click Run. The results are displayed in Machine Learning > Prepare tab.
Note: For FortiSIEM with ClickHouse deployments, Historical mode searches include a Result Filter panel that provides a top and bottom 100 list (click
/
to toggle) for the attributes that are part of your Display Field (
) search. An Add to Main Filter icon (
) is available to hone search results and identify trends related to the selected attributes with other attributes.
Step 3: Train
Train the Regression task using the dataset in Step 2.
- Go to Analytics > Machine Learning > Train.
- If you chose Import via Jobs, then make sure the Fields to use for Prediction and Fields to Predict are populated correctly.
- If you chose Import via Report or Import Via CSV File, then
- Set Run Mode to AWS
- Set Task to Regression
- Choose the Algorithm
- Choose the Fields to use for Prediction and Fields to Predict from the report fields.
- Choose the Train factor which should be greater than 70%. This means that 70% of the data will be used for Training and 30% used for Testing.
- Click Train.
After you have completed the Training, the results are shown in the Train > Output tab.
Model Quality:
This shows how accurately the algorithm is able to predict the field. The following metrics are calculated:
- Mean Absolute Error (MAE): Average of Absolute difference between predicted and actual values over all data points. Lower value means that regression is a better fit.
- Mean Squared Error (MSE): The mean square error of the final model on the validation dataset. This objective metric is only valid for regression.
- R2 score: A statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data. R-square value of 0.8 means that 80% of the variation in the predicted attribute is explained by the feature attributes.
- Root Mean Squared Error (RMSE): The root mean square error of the final model on the validation dataset. This objective metric is only valid for regression.
The actual and predicted values and the errors are shown in 3 ways:
- Regression Result table: If the model did well, then the error column should have small values.
- Scatter plot: If the model did well, then this chart must be centered along the y=x line.
- Error histogram: If the model did well, then the chart should be clustered around the x=0 line.
If you want to change the algorithm parameters and re-train, then click Tune & Train, change the parameters and click Save & Train.
Step 4: Schedule
Once the training is complete, you can schedule the job for Inference.
- Input Details section shows the Report and the Org chosen for the report. These were already chosen during Prepare phase and will be used during Inference.
- Algorithm Setup shows the Machine Learning Algorithm and its parameters. These were already chosen during Train phase and will be used during Inference.
- Schedule Setup shows the Job details and schedules
- Job Id: Specifies the unique Job Id. If it is a system job, it will be overwritten with a new job id when it is saved as a User job. If it is a User job, then user has option to Save as a new user job with different job id or keeping the same job Id.
- Job Name: Name of the job. You can overwrite this one. When a job with the same name exists then a data stamp will be appended.
- Job Description: Description of the job.
- Inference schedule: The frequency at which Inference job will be run
- Retraining schedule: The frequency at which the model would be retrained. Retraining is expensive and it should be carefully considered. Recommended retraining is at least 7 days.
- (Retraining) Report Window: The Report time window during retraining process. Long time window may cause the report to run slowly and this should be carefully considered as well. It is recommended to choose the same time window chosen during the Prepare process.
- Job Group: Shows the folder under Resources > Machine Learning Jobs where this job will be saved.
- Action on Inference: Specifies the action to be taken when an anomaly is found during the Inference process.
- Only one choice is available – Creating an incident when Error is #. Enter a number in the Create an Incident when Error is great than field.
- Check Enabled to ensure that Inference is enabled.
Finally click Save to save this to database. If it is a system job, then a new User job will be created. If it is a User job, then user has option to Save as a new user job with different job id or overwriting the current job.
Regression Algorithms for AWS Auto Mode
In this mode, FortiSIEM automatically chooses the best algorithm with the optimal parameters. Depending on the size of your dataset (whether is it greater or smaller than 100MB), algorithms from HPO mode or from ensembling mode will be considered. For more information, see https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-model-support-validation.html. Definitions below taken from Amazon SageMaker Developer Guide.
HPO mode (Dataset > 100MB)
- Linear learner: A supervised learning algorithm that can solve either classification or regression problems.
- XGBoost: A supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.
- Deep learning algorithm: A multilayer perceptron (MLP) and feedforward artificial neural network. This algorithm can handle data that is not linearly separable.
Ensembling mode (Dataset <=100MB)
- LightGBM: An optimized framework that uses tree-based algorithms with gradient boosting. This algorithm uses trees that grow in breadth, rather than depth, and is highly optimized for speed.
- CatBoost: A framework that uses tree-based algorithms with gradient boosting. Optimized for handling categorical variables.
- XGBoost: A framework that uses tree-based algorithms with gradient boosting that grows in depth, rather than breadth.
- Random Forest: A tree-based algorithm that uses several decision trees on random sub-samples of the data with replacement. The trees are split into optimal nodes at each level. The decisions of each tree are averaged together to prevent overfitting and improve predictions.
- Extra Trees: A tree-based algorithm that uses several decision trees on the entire dataset. The trees are split randomly at each level. The decisions of each tree are averaged to prevent overfitting and to improve predictions. Extra trees add a degree of randomization in comparison to the random forest algorithm.
- Linear Models: A framework that uses a linear equation to model the relationship between two variables in observed data.
- Neural network torch: A neural network model that is implemented using Pytorch.
- Neural network fast.ai: A neural network model that is implemented using fast.ai.
Note: The Max Run Time parameter is used to limit the amount of time this job runs. By default it is set to 225 minutes. The longer this job runs, potentially better results can be generated.
Running Regression AWS Auto Mode
To run, follow the Regression steps in Algorithms for AWS Mode, but in Step 3, 3a, select Run Mode as AWS Auto.
For Step 3, the following model quality information pertains.
Model Quality:
This shows how accurately the algorithm is able to predict the field. The following metrics are calculated:
- MAE: The mean absolute error (MAE) is a measure of how different the predicted and actual values are, when they're averaged over all values. MAE is commonly used in regression analysis to understand model prediction error. If there is linear regression, MAE represents the average distance from a predicted line to the actual value. MAE is defined as the sum of absolute errors divided by the number of observations. Values range from 0 to infinity, with smaller numbers indicating a better model fit to the data.
- RMSE: Root mean squared error (RMSE) measures the square root of the squared difference between predicted and actual values, and is averaged over all values. It is used in regression analysis to understand model prediction error. It's an important metric to indicate the presence of large model errors and outliers. Values range from zero (0) to infinity, with smaller numbers indicating a better model fit to the data. RMSE is dependent on scale, and should not be used to compare datasets of different sizes.MSE: The mean squared error (MSE) is the average of the squared differences between the predicted and actual values. It is used for regression. MSE values are always positive. The better a model is at predicting the actual values, the smaller the MSE value is.
- R2: R2, also known as the coefficient of determination, is used in regression to quantify how much a model can explain the variance of a dependent variable. Values range from one (1) to negative one (-1). Higher numbers indicate a higher fraction of explained variability. R2 values close to zero (0) indicate that very little of the dependent variable can be explained by the model. Negative values indicate a poor fit and that the model is outperformed by a constant function. For linear regression, this is a horizontal line.