Fortinet white logo
Fortinet white logo

User Guide

Clustering

Clustering

The objective of a Clustering Task is to group similar items based on a set of fields in a dataset, and then create an alert if an item belongs to a different group based on current values. Learning the groups is done during the Training phase and alert creation is done during the Inference phase. The dataset for both Training and Inference phases is provided by running FortiSIEM reports.

Clustering Algorithm for Local Mode

In this mode, the following algorithms can run locally within the FortiSIEM Supervisor/Worker cluster.

  • BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is an unsupervised clustering algorithm for grouping particularly large data-sets. It takes a hierarchical approach by first summarizing large datasets into smaller, dense regions called Clustering Feature (CF) entries and then clustering the smaller data set. It only works for numerical entries. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html
  • DBScan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an unsupervised clustering algorithm that groups together points that are closely packed together (points with many nearby neighbors) and discarding data points that lie alone in low-density regions (whose nearest neighbors are too far away). The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
  • GMeans: An unsupervised clustering algorithm for grouping data points. It extends K-means by trying to automatically determine the number of clusters by Gaussian normality test. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
  • KMeans: An unsupervised clustering algorithm that groups data points into user specified K groups so that each data point belongs to one group. It tries to iteratively minimize intra-cluster distance and maximize inter-cluster distance. Note that user needs to specify the number of clusters based on user’s knowledge of data. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
  • Spectral Clustering: An unsupervised graph-based clustering algorithm for grouping data points. It first constructs a similarity graph, then embeds the data points in a lower dimensional space and then applying classical algorithms like KMeans to partition the data points. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html

Running Clustering Local Mode

Step 1: Design

First identify the following items:

  • Fields to use for Clustering: Each field must be a numerical field.
  • Id field: The identity of a row of data, typically the host name or host IP.
  • A FortiSIEM Report to get this data.

The Time field is not recommended for a Clustering task since an item may belong to different clusters at different point in time.

Requirements

  1. Report must contain
    • an Id field
    • One or more numerical fields to use for clustering
  2. Each field must be present in the report result; else the whole row will be ignored by the Machine Learning algorithm.
  3. There can be additional columns in the report and they will be ignored by the machine learning algorithm. However, it is recommended to remove unnecessary columns from the dataset to reduce the size of the dataset exchanged between App Server and phAnomaly modules, during Training and Inference.

Go to Analytics > Search and run various reports. Once you have the right report, save it in Resources > Machine Learning Jobs.

Step 2: Prepare Data

Prepare the data for training.

  1. Go to Analytics > Machine Learning.
  2. Select the data source in one of three ways:
    1. To prepare data from a Machine Learning Job, choose Import via Jobs and select the Job which has associated Report and algorithm.
    2. To prepare data from the Report folder, choose Import via Report and select the report from the Resources > Machine Learning Job folder
    3. To prepare data from a CSV file, choose Import Via CSV File and upload the file. In this mode, you can see how the Training algorithm performs, but you cannot schedule for inference, since the data may not be present in FortiSIEM.
  3. For Case 2a and 2b, select the Report Time Range and the Organization for Service Provider Deployments.
  4. Click Run. The results are displayed in Machine Learning > Prepare tab.

Step 3: Train

Train the Clustering task using the dataset in Step 2.

  1. Go to Analytics > Machine Learning > Train.
  2. If you chose Import via Jobs, then make sure the Id Field and Fields to use for Clustering are populated correctly.
  3. If you chose Import via Report or Import via CSV File, then
    1. Set Run Mode to Local
    2. Set Task to Clustering
    3. Choose the Algorithm
    4. Check the algorithm parameters, e.g. for KMeans choose the cluster size as a guess.
    5. Choose the Id Field and Fields to use for Clustering from the report fields.
  4. Choose the Train factor which should be greater than 70%. This means that 70% of the data will be used for Training and 30% used for Testing.
  5. Click Train.

After you have completed the Training, the results are shown in the Train > Output tab.

Model Quality:

The following metrics show the quality of the clusters found.

  • Calinski Marabasz Score: Calinski Marabasz Score (also known as the Variance Ratio Criterion) is calculated as a ratio of the sum of inter-cluster dispersion and the sum of intra-cluster dispersion for all clusters (where the dispersion is the sum of squared distances). A high Calinski-Harabasz Score means better clustering since observations in each cluster are closer together (more dense), while clusters themselves are further away from each other (well separated).
  • Davies Bouldin Score: This is calculated as the average similarity of each cluster with a cluster most similar to it. Low Davies Bouldin Score means clusters are well separated.
  • Silhouette Score: This is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Silhouette Score of 1 means clusters are well apart from each other and clearly distinguished. Silhouette Score of 0 means that clusters are indifferent, or we can say that the distance between clusters is not significant. Silhouette Score of -1 means clusters are assigned in the wrong way.

The actual and predicted values and the errors are shown in 3 ways:

  • Clustering Result table: this shows which entity is in which cluster
  • Clustering Membership > Heatmap View: this shows entity-cluster membership along with the values. Entities in the same cluster should have the same color.

If you want to change the algorithm parameters and re-train, then click Tune & Train, change the parameters and click Save & Train.

Step 4: Schedule

Once the training is complete, you can schedule the job for Inference.

  • Input Details section shows the Report and the Org chosen for the report. These were already chosen during Prepare phase and will be used during Inference.
  • Algorithm Setup shows the Machine Learning Algorithm and its parameters. These were already chosen during Train phase and will be used during Inference.
  • Schedule Setup shows the Job details and schedules
    • Job Id: Specifies the unique Job Id. If it is a system job, it will be overwritten with a new job id when it is saved as a User job. If it is a User job, then user has option to Save as a new user job with different job id or keeping the same job Id.
    • Job Name: Name of the job. You can overwrite this one. When a job with the same name exists then a data stamp will be appended.
    • Job Description: Description of the job.
    • Inference schedule: The frequency at which Inference job will be run
    • Retraining schedule: The frequency at which the model would be retrained. Retraining is expensive and it should be carefully considered. Recommended retraining is at least 7 days.
    • (Retraining) Report Window: The Report time window during retraining process. Long time window may cause the report to run slowly and this should be carefully considered as well. It is recommended to choose the same time window chosen during the Prepare process.
    • Job Group: Shows the folder under Resources > Machine Learning Jobs where this job will be saved.
  • Action on Inference: Specifies the action to be taken when cluster changes during the Inference process.
    • Two choices are available – creating a FortiSIEM Incident or sending an email. Specify the emails if you want emails to be sent. Make sure that email server is specified in Admin > Settings > Email.
    • Check Enabled to ensure that Inference is enabled.

Finally click Save to save this to database. If it is a system job, then a new User job will be created. If it is a User job, then user has option to Save as a new user job with different job id or overwriting the current job.

Clustering Algorithms for Local Auto Mode

In this mode, FortiSIEM picks the best algorithm from the following:

  • BIRCH
  • DBSCAN
  • KMeans

Note: The Max Run Time parameter is used to limit the amount of time this job runs. By default it is set to 10 minutes. The longer this job runs, potentially better result can be generated.

Running Clustering Local Auto Mode

To run, follow the Clustering steps in Algorithms for Local Mode, but in Step 3, 3a, select Run Mode as Local Auto.

Clustering Algorithm for AWS Mode

In this mode, the following algorithm runs in AWS.

  • KMeans

Running Clustering AWS Mode

Step 0: Set Up AWS

Set up AWS SageMaker by following the instructions in Set Up AWS SageMaker.

Configure AWS in FortiSIEM by following the instructions in Configure FortiSIEM to use AWS SageMaker.

Step 1: Design

First identify the following items:

  • Fields to use for Clustering: Each field must be a numerical field.
  • Id field: The identity of a row of data, typically the host name or host IP.
  • A FortiSIEM Report to get this data.

The Time field is not recommended for a Clustering task since an item may belong to different clusters at different point in time.

Requirements

  1. Report must contain
    • an Id field
    • One or more numerical fields to use for clustering
  2. Each field must be present in the report result; else the whole row will be ignored by the Machine Learning algorithm.
  3. There can be additional columns in the report and they will be ignored by the machine learning algorithm. However, it is recommended to remove unnecessary columns from the dataset to reduce the size of the dataset exchanged between App Server and phAnomaly modules, during Training and Inference.

Go to Analytics > Search and run various reports. Once you have the right report, save it in Resources > Machine Learning Jobs.

Step 2: Prepare Data

Prepare the data for training.

  1. Go to Analytics > Machine Learning.
  2. Select the data source in one of three ways:
    1. To prepare data from a Machine Learning Job, choose Import via Jobs and select the Job which has associated Report and algorithm.
    2. To prepare data from the Report folder, choose Import via Report and select the report from the Resources > Machine Learning Job folder
    3. To prepare data from a CSV file, choose Import Via CSV File and upload the file. In this mode, you can see how the Training algorithm performs, but you cannot schedule for inference, since the data may not be present in FortiSIEM.
  3. For Case 2a and 2b, select the Report Time Range and the Organization for Service Provider Deployments.
  4. Click Run. The results are displayed in Machine Learning > Prepare tab.

Step 3: Train

Train the Clustering task using the dataset in Step 2.

  1. Go to Analytics > Machine Learning > Train.
  2. If you chose Import via Jobs, then make sure the Id Field and Fields to use for Clustering are populated correctly.
  3. If you chose Import via Report or Import via CSV File, then
    1. Set Run Mode to AWS
    2. Set Task to Clustering
    3. Choose the Algorithm
    4. Check the algorithm parameters, e.g. for KMeans choose the cluster size as a guess.
    5. Choose the Id Field and Fields to use for Clustering from the report fields.
  4. Choose the Train factor which should be greater than 70%. This means that 70% of the data will be used for Training and 30% used for Testing.
  5. Click Train.

After you have completed the Training, the results are shown in the Train > Output tab.

Model Quality:

The following metrics show the quality of the clusters found.

  • Calinski Marabasz Score: Calinski Marabasz Score (also known as the Variance Ratio Criterion) is calculated as a ratio of the sum of inter-cluster dispersion and the sum of intra-cluster dispersion for all clusters (where the dispersion is the sum of squared distances). A high Calinski-Harabasz Score means better clustering since observations in each cluster are closer together (more dense), while clusters themselves are further away from each other (well separated).
  • Davies Bouldin Score: This is calculated as the average similarity of each cluster with a cluster most similar to it. Low Davies Bouldin Score means clusters are well separated.
  • Silhouette Score: This is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Silhouette Score of 1 means clusters are well apart from each other and clearly distinguished. Silhouette Score of 0 means that clusters are indifferent, or we can say that the distance between clusters is not significant. Silhouette Score of -1 means clusters are assigned in the wrong way.
  • 'msd' (Mean Squared Distance): This metric is the average of the squared distances between each data point and its assigned cluster center. The mean squared distance is used to assess the overall quality of the clustering, where lower values indicate better clustering performance.
  • 'ssd' (Sum of Squared Distances): This metric is the sum of the squared distances between each data point and its assigned cluster center. It is also known as the within-cluster sum of squares (WCSS) and is commonly used to evaluate the compactness of the clusters.

The actual and predicted values and the errors are shown in the following ways:

  • Clustering Result table: this shows which entity is in which cluster
  • Clustering Membership > Heatmap View: this shows entity-cluster membership along with the values. Entities in the same cluster should have the same color.

If you want to change the algorithm parameters and re-train, then click Tune & Train, change the parameters and click Save & Train.

Step 4: Schedule

Once the training is complete, you can schedule the job for Inference.

  • Input Details section shows the Report and the Org chosen for the report. These were already chosen during Prepare phase and will be used during Inference.
  • Algorithm Setup shows the Machine Learning Algorithm and its parameters. These were already chosen during Train phase and will be used during Inference.
  • Schedule Setup shows the Job details and schedules
    • Job Id: Specifies the unique Job Id. If it is a system job, it will be overwritten with a new job id when it is saved as a User job. If it is a User job, then user has option to Save as a new user job with different job id or keeping the same job Id.
    • Job Name: Name of the job. You can overwrite this one. When a job with the same name exists then a data stamp will be appended.
    • Job Description: Description of the job.
    • Inference schedule: The frequency at which Inference job will be run
    • Retraining schedule: The frequency at which the model would be retrained. Retraining is expensive and it should be carefully considered. Recommended retraining is at least 7 days.
    • (Retraining) Report Window: The Report time window during retraining process. Long time window may cause the report to run slowly and this should be carefully considered as well. It is recommended to choose the same time window chosen during the Prepare process.
    • Job Group: Shows the folder under Resources > Machine Learning Jobs where this job will be saved.
  • Action on Inference: Specifies the action to be taken when cluster changes during the Inference process.
    • Two choices are available – creating a FortiSIEM Incident or sending an email. Specify the emails if you want emails to be sent. Make sure that email server is specified in Admin > Settings > Email.
    • Check Enabled to ensure that Inference is enabled.

Finally click Save to save this to database. If it is a system job, then a new User job will be created. If it is a User job, then user has option to Save as a new user job with different job id or overwriting the current job.

Clustering

Clustering

The objective of a Clustering Task is to group similar items based on a set of fields in a dataset, and then create an alert if an item belongs to a different group based on current values. Learning the groups is done during the Training phase and alert creation is done during the Inference phase. The dataset for both Training and Inference phases is provided by running FortiSIEM reports.

Clustering Algorithm for Local Mode

In this mode, the following algorithms can run locally within the FortiSIEM Supervisor/Worker cluster.

  • BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is an unsupervised clustering algorithm for grouping particularly large data-sets. It takes a hierarchical approach by first summarizing large datasets into smaller, dense regions called Clustering Feature (CF) entries and then clustering the smaller data set. It only works for numerical entries. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html
  • DBScan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an unsupervised clustering algorithm that groups together points that are closely packed together (points with many nearby neighbors) and discarding data points that lie alone in low-density regions (whose nearest neighbors are too far away). The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
  • GMeans: An unsupervised clustering algorithm for grouping data points. It extends K-means by trying to automatically determine the number of clusters by Gaussian normality test. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
  • KMeans: An unsupervised clustering algorithm that groups data points into user specified K groups so that each data point belongs to one group. It tries to iteratively minimize intra-cluster distance and maximize inter-cluster distance. Note that user needs to specify the number of clusters based on user’s knowledge of data. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
  • Spectral Clustering: An unsupervised graph-based clustering algorithm for grouping data points. It first constructs a similarity graph, then embeds the data points in a lower dimensional space and then applying classical algorithms like KMeans to partition the data points. The parameters for this algorithm are described here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html

Running Clustering Local Mode

Step 1: Design

First identify the following items:

  • Fields to use for Clustering: Each field must be a numerical field.
  • Id field: The identity of a row of data, typically the host name or host IP.
  • A FortiSIEM Report to get this data.

The Time field is not recommended for a Clustering task since an item may belong to different clusters at different point in time.

Requirements

  1. Report must contain
    • an Id field
    • One or more numerical fields to use for clustering
  2. Each field must be present in the report result; else the whole row will be ignored by the Machine Learning algorithm.
  3. There can be additional columns in the report and they will be ignored by the machine learning algorithm. However, it is recommended to remove unnecessary columns from the dataset to reduce the size of the dataset exchanged between App Server and phAnomaly modules, during Training and Inference.

Go to Analytics > Search and run various reports. Once you have the right report, save it in Resources > Machine Learning Jobs.

Step 2: Prepare Data

Prepare the data for training.

  1. Go to Analytics > Machine Learning.
  2. Select the data source in one of three ways:
    1. To prepare data from a Machine Learning Job, choose Import via Jobs and select the Job which has associated Report and algorithm.
    2. To prepare data from the Report folder, choose Import via Report and select the report from the Resources > Machine Learning Job folder
    3. To prepare data from a CSV file, choose Import Via CSV File and upload the file. In this mode, you can see how the Training algorithm performs, but you cannot schedule for inference, since the data may not be present in FortiSIEM.
  3. For Case 2a and 2b, select the Report Time Range and the Organization for Service Provider Deployments.
  4. Click Run. The results are displayed in Machine Learning > Prepare tab.

Step 3: Train

Train the Clustering task using the dataset in Step 2.

  1. Go to Analytics > Machine Learning > Train.
  2. If you chose Import via Jobs, then make sure the Id Field and Fields to use for Clustering are populated correctly.
  3. If you chose Import via Report or Import via CSV File, then
    1. Set Run Mode to Local
    2. Set Task to Clustering
    3. Choose the Algorithm
    4. Check the algorithm parameters, e.g. for KMeans choose the cluster size as a guess.
    5. Choose the Id Field and Fields to use for Clustering from the report fields.
  4. Choose the Train factor which should be greater than 70%. This means that 70% of the data will be used for Training and 30% used for Testing.
  5. Click Train.

After you have completed the Training, the results are shown in the Train > Output tab.

Model Quality:

The following metrics show the quality of the clusters found.

  • Calinski Marabasz Score: Calinski Marabasz Score (also known as the Variance Ratio Criterion) is calculated as a ratio of the sum of inter-cluster dispersion and the sum of intra-cluster dispersion for all clusters (where the dispersion is the sum of squared distances). A high Calinski-Harabasz Score means better clustering since observations in each cluster are closer together (more dense), while clusters themselves are further away from each other (well separated).
  • Davies Bouldin Score: This is calculated as the average similarity of each cluster with a cluster most similar to it. Low Davies Bouldin Score means clusters are well separated.
  • Silhouette Score: This is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Silhouette Score of 1 means clusters are well apart from each other and clearly distinguished. Silhouette Score of 0 means that clusters are indifferent, or we can say that the distance between clusters is not significant. Silhouette Score of -1 means clusters are assigned in the wrong way.

The actual and predicted values and the errors are shown in 3 ways:

  • Clustering Result table: this shows which entity is in which cluster
  • Clustering Membership > Heatmap View: this shows entity-cluster membership along with the values. Entities in the same cluster should have the same color.

If you want to change the algorithm parameters and re-train, then click Tune & Train, change the parameters and click Save & Train.

Step 4: Schedule

Once the training is complete, you can schedule the job for Inference.

  • Input Details section shows the Report and the Org chosen for the report. These were already chosen during Prepare phase and will be used during Inference.
  • Algorithm Setup shows the Machine Learning Algorithm and its parameters. These were already chosen during Train phase and will be used during Inference.
  • Schedule Setup shows the Job details and schedules
    • Job Id: Specifies the unique Job Id. If it is a system job, it will be overwritten with a new job id when it is saved as a User job. If it is a User job, then user has option to Save as a new user job with different job id or keeping the same job Id.
    • Job Name: Name of the job. You can overwrite this one. When a job with the same name exists then a data stamp will be appended.
    • Job Description: Description of the job.
    • Inference schedule: The frequency at which Inference job will be run
    • Retraining schedule: The frequency at which the model would be retrained. Retraining is expensive and it should be carefully considered. Recommended retraining is at least 7 days.
    • (Retraining) Report Window: The Report time window during retraining process. Long time window may cause the report to run slowly and this should be carefully considered as well. It is recommended to choose the same time window chosen during the Prepare process.
    • Job Group: Shows the folder under Resources > Machine Learning Jobs where this job will be saved.
  • Action on Inference: Specifies the action to be taken when cluster changes during the Inference process.
    • Two choices are available – creating a FortiSIEM Incident or sending an email. Specify the emails if you want emails to be sent. Make sure that email server is specified in Admin > Settings > Email.
    • Check Enabled to ensure that Inference is enabled.

Finally click Save to save this to database. If it is a system job, then a new User job will be created. If it is a User job, then user has option to Save as a new user job with different job id or overwriting the current job.

Clustering Algorithms for Local Auto Mode

In this mode, FortiSIEM picks the best algorithm from the following:

  • BIRCH
  • DBSCAN
  • KMeans

Note: The Max Run Time parameter is used to limit the amount of time this job runs. By default it is set to 10 minutes. The longer this job runs, potentially better result can be generated.

Running Clustering Local Auto Mode

To run, follow the Clustering steps in Algorithms for Local Mode, but in Step 3, 3a, select Run Mode as Local Auto.

Clustering Algorithm for AWS Mode

In this mode, the following algorithm runs in AWS.

  • KMeans

Running Clustering AWS Mode

Step 0: Set Up AWS

Set up AWS SageMaker by following the instructions in Set Up AWS SageMaker.

Configure AWS in FortiSIEM by following the instructions in Configure FortiSIEM to use AWS SageMaker.

Step 1: Design

First identify the following items:

  • Fields to use for Clustering: Each field must be a numerical field.
  • Id field: The identity of a row of data, typically the host name or host IP.
  • A FortiSIEM Report to get this data.

The Time field is not recommended for a Clustering task since an item may belong to different clusters at different point in time.

Requirements

  1. Report must contain
    • an Id field
    • One or more numerical fields to use for clustering
  2. Each field must be present in the report result; else the whole row will be ignored by the Machine Learning algorithm.
  3. There can be additional columns in the report and they will be ignored by the machine learning algorithm. However, it is recommended to remove unnecessary columns from the dataset to reduce the size of the dataset exchanged between App Server and phAnomaly modules, during Training and Inference.

Go to Analytics > Search and run various reports. Once you have the right report, save it in Resources > Machine Learning Jobs.

Step 2: Prepare Data

Prepare the data for training.

  1. Go to Analytics > Machine Learning.
  2. Select the data source in one of three ways:
    1. To prepare data from a Machine Learning Job, choose Import via Jobs and select the Job which has associated Report and algorithm.
    2. To prepare data from the Report folder, choose Import via Report and select the report from the Resources > Machine Learning Job folder
    3. To prepare data from a CSV file, choose Import Via CSV File and upload the file. In this mode, you can see how the Training algorithm performs, but you cannot schedule for inference, since the data may not be present in FortiSIEM.
  3. For Case 2a and 2b, select the Report Time Range and the Organization for Service Provider Deployments.
  4. Click Run. The results are displayed in Machine Learning > Prepare tab.

Step 3: Train

Train the Clustering task using the dataset in Step 2.

  1. Go to Analytics > Machine Learning > Train.
  2. If you chose Import via Jobs, then make sure the Id Field and Fields to use for Clustering are populated correctly.
  3. If you chose Import via Report or Import via CSV File, then
    1. Set Run Mode to AWS
    2. Set Task to Clustering
    3. Choose the Algorithm
    4. Check the algorithm parameters, e.g. for KMeans choose the cluster size as a guess.
    5. Choose the Id Field and Fields to use for Clustering from the report fields.
  4. Choose the Train factor which should be greater than 70%. This means that 70% of the data will be used for Training and 30% used for Testing.
  5. Click Train.

After you have completed the Training, the results are shown in the Train > Output tab.

Model Quality:

The following metrics show the quality of the clusters found.

  • Calinski Marabasz Score: Calinski Marabasz Score (also known as the Variance Ratio Criterion) is calculated as a ratio of the sum of inter-cluster dispersion and the sum of intra-cluster dispersion for all clusters (where the dispersion is the sum of squared distances). A high Calinski-Harabasz Score means better clustering since observations in each cluster are closer together (more dense), while clusters themselves are further away from each other (well separated).
  • Davies Bouldin Score: This is calculated as the average similarity of each cluster with a cluster most similar to it. Low Davies Bouldin Score means clusters are well separated.
  • Silhouette Score: This is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Silhouette Score of 1 means clusters are well apart from each other and clearly distinguished. Silhouette Score of 0 means that clusters are indifferent, or we can say that the distance between clusters is not significant. Silhouette Score of -1 means clusters are assigned in the wrong way.
  • 'msd' (Mean Squared Distance): This metric is the average of the squared distances between each data point and its assigned cluster center. The mean squared distance is used to assess the overall quality of the clustering, where lower values indicate better clustering performance.
  • 'ssd' (Sum of Squared Distances): This metric is the sum of the squared distances between each data point and its assigned cluster center. It is also known as the within-cluster sum of squares (WCSS) and is commonly used to evaluate the compactness of the clusters.

The actual and predicted values and the errors are shown in the following ways:

  • Clustering Result table: this shows which entity is in which cluster
  • Clustering Membership > Heatmap View: this shows entity-cluster membership along with the values. Entities in the same cluster should have the same color.

If you want to change the algorithm parameters and re-train, then click Tune & Train, change the parameters and click Save & Train.

Step 4: Schedule

Once the training is complete, you can schedule the job for Inference.

  • Input Details section shows the Report and the Org chosen for the report. These were already chosen during Prepare phase and will be used during Inference.
  • Algorithm Setup shows the Machine Learning Algorithm and its parameters. These were already chosen during Train phase and will be used during Inference.
  • Schedule Setup shows the Job details and schedules
    • Job Id: Specifies the unique Job Id. If it is a system job, it will be overwritten with a new job id when it is saved as a User job. If it is a User job, then user has option to Save as a new user job with different job id or keeping the same job Id.
    • Job Name: Name of the job. You can overwrite this one. When a job with the same name exists then a data stamp will be appended.
    • Job Description: Description of the job.
    • Inference schedule: The frequency at which Inference job will be run
    • Retraining schedule: The frequency at which the model would be retrained. Retraining is expensive and it should be carefully considered. Recommended retraining is at least 7 days.
    • (Retraining) Report Window: The Report time window during retraining process. Long time window may cause the report to run slowly and this should be carefully considered as well. It is recommended to choose the same time window chosen during the Prepare process.
    • Job Group: Shows the folder under Resources > Machine Learning Jobs where this job will be saved.
  • Action on Inference: Specifies the action to be taken when cluster changes during the Inference process.
    • Two choices are available – creating a FortiSIEM Incident or sending an email. Specify the emails if you want emails to be sent. Make sure that email server is specified in Admin > Settings > Email.
    • Check Enabled to ensure that Inference is enabled.

Finally click Save to save this to database. If it is a system job, then a new User job will be created. If it is a User job, then user has option to Save as a new user job with different job id or overwriting the current job.