Training and maintaining the Bayesian databases
Bayesian scanning uses databases to determine if an email is spam. For Bayesian scanning to be effective, the databases must be trained with known-spam and known-good email messages so the scanner can learn the differences between the two types of email. To maintain its effectiveness, false positives and false negatives must be sent to the FortiMail unit so the Bayesian scanner can learn from its mistakes.
Be aware that, without ongoing training, Bayesian scanning will become significantly less effective over time and thus Fortinet does not recommend enabling the Bayesian scanning feature. |
The Security > Bayesian submenu lets you manage the databases used to store statistical information for Bayesian antispam processing, and to configure the email addresses used for remote control and training of the Bayesian databases.
To use a Bayesian database, you must enable the Bayesian scan in the antispam profile. For more information, see Managing antispam profiles.
This section contains the following topics:
- Types of Bayesian databases
- Training the Bayesian databases
- Example: Bayesian training
- Backing up, batch training, and monitoring the Bayesian databases
- Configuring the Bayesian training control accounts
Types of Bayesian databases
FortiMail units have two types of Bayesian databases:
All types contain Bayesian statistical data that can be used by Bayesian scans to detect spam, and should be trained in order to be most accurate for detecting spam within their respective scopes. For more information on training each type of Bayesian database, see Training the Bayesian databases.
Only one Bayesian database is used by any individual Bayesian scan; which type will be used depends on the directionality of the email and your configuration of the FortiMail unit’s protected domains and antispam profiles. For information, see Use global Bayesian database.
Global
The global Bayesian database is a single database that contains Bayesian statistics that can be used to detect spam for any email user.
Outgoing antispam profiles can use only the global Bayesian database. Incoming antispam profiles can use global or domain Bayesian databases.
If all spam sent to all protected domains has similar characteristics and you do not require your Bayesian scans to be tailored specifically to the email of a protected domain, using the global database for all Bayesian scanning may be an ideal choice, because there is only one database to train and maintain.
For email that does not require use of the global database, if you want to use the global database, you must disable use of the per-domain Bayesian databases. For information on configuring protected domains to use the global Bayesian database, see Use global Bayesian database.
Group
Group Bayesian databases, also known as per-domain Bayesian databases, contain Bayesian statistics that can be used to detect spam for email users in a specific protected domain. FortiMail units can have multiple group Bayesian databases: one for each protected domain.
If you require Bayesian scans to be tailored specifically to the email received by each protected domain, using per-domain Bayesian databases may provide greater accuracy and fewer false positives.
For example, medical terms are a common characteristic of many spam messages. However, those terms may be a poor indicator of spam if the protected domain belongs to a hospital. In this case, you may want to train a separate, per-domain Bayesian database in which medical terms are not statistically likely to indicate spam.
If you want to use a per-domain database, you must disable use of the global Bayesian databases. For information on disabling use of the global Bayesian database for a protected domain, see Use global Bayesian database.
Training the Bayesian databases
Bayesian scans analyze the words (or “tokens”) in an message header and message body of an email to determine the probability that it is spam. For every token, the FortiMail unit calculates the probability that the email is spam based on the percentage of times that the word has previously been associated with spam or non-spam email. If a Bayesian database has not yet been trained, the Bayesian scan does not yet know the spam or non-spam association of many tokens, and does not have enough information to determine the statistical likelihood of an email being spam. By training a Bayesian database to recognize words that are and are not likely to be associated with spam, Bayesian scans become increasingly accurate.
However, spammers are constantly trying to invent new ways to defeat antispam filters. In one technique commonly used in attempt to avoid antispam filters, spammers alter words commonly identified as characteristic of spam, inserting symbols such as periods ( .
), or using nonstandard but human-readable spellings, such as substituting , Ç, Ë, or Í for A, C, E or I. These altered words are technically different tokens to a Bayesian database, so mature Bayesian databases may require some ongoing training to recognize new spam tokens.
You generally will not want to enable Bayesian scans until you have performed initial training of your Bayesian databases, as using untrained Bayesian databases can increase your rate of spam false positives and false negatives.
To initially train the Bayesian databases
- Train the global database by uploading mailbox (.mbox) files. For details, see Backing up, batch training, and monitoring the Bayesian databases.
By uploading mailbox files, you can provide initial training more rapidly than through the Bayesian control email addresses. Training the global database ensures that outgoing antispam profiles in which you have enabled Bayesian scanning, and incoming antispam profiles for protected domains that you have configured to use the global database, can recognize spam.
If you have configured the FortiMail unit for email archiving, you can make mailbox files from archived email and spam. For details, see Managing archived email. |
You can leave the global database untrained if both these conditions are true:
- no outgoing antispam profile has Bayesian scanning enabled
- no protected domain is configured to use the global Bayesian database
By uploading mailbox files, you can provide initial training more rapidly than through the Bayesian control email addresses. Training per-domain databases ensures that incoming antispam profiles for protected domains that you have configured to use the per-domain database can recognize spam.
You can leave a per-domain database untrained if either of these conditions are true:
- the protected domain is configured to use the global Bayesian database
- no incoming antispam profiles exist for the protected domain
Before instructing email users to train the Bayesian databases, verify that you have enabled the FortiMail unit to accept training messages. If you have not enabled the “Accept training messages from users” option in the antispam profile for policies which match training messages, the training messages will be discarded without notification to the sender, and no training will occur. |
FortiMail units apply training messages to either the global or per-domain Bayesian database, whichever is enabled for the sender’s protected domain.
Example: Bayesian training
In this example, Company X has set up a FortiMail unit to protect its email server. With over 1,000 email users, Company X plans to enable Bayesian scanning for incoming email. You, the system administrator, have been asked to configure Bayesian scanning, perform initial training of the Bayesian databases, and configure Bayesian control email addresses for ongoing training.
The local domain name of the FortiMail unit itself is example.com.
Company X has email users in two existing protected domains:
- example.net
- example.org
Each protected domains receives email with slightly different terminology, which could be considered spam to the other protected domain, and so will use separate per-domain Bayesian databases.
To facilitate initial training of each per-domain Bayesian database, you have used your email client software to collect samples of spam and non-spam email from each protected domain, and exported them into mailbox files:
- example-net-spam.mbox
- example-net-not-spam.mbox
- example-org-spam.mbox
- example-org-not-spam.mbox
After initial training, email users will use the default Bayesian control email addresses to perform any required ongoing training for each of their per-domain Bayesian databases.
To enable use of per-domain Bayesian databases
- Go to Domain & User > Domain > Domain.
- Select the row corresponding to example.net and click Edit.
- Click the arrow to expand Advanced Setting and click Other.
- Disable Use global bayesian database.
- Click OK.
Repeat the above steps for the protected domain example.org.
To initially train each per-domain Bayesian database using mailbox files
- Go to Security > Bayesian > Domain.
- From Select a domain, select a domain.
- In the Operations area, click Train group Bayesian database with email samples.
- In Clean emails, click Browse and locate example-net-not-spam.mbox.
- In Spam emails, click Browse and locate example-net-spam.mbox.
- Click OK.
This example uses example.net and example.org.
A dialog appears.
Repeat the above steps for the protected domain example.org and its sample Bayesian database files.
To enable Bayesian scanning
- Go to Profile > AntiSpam > AntiSpam.
- In the row corresponding to an antispam profile that is selected in a policy that matches recipients in the protected domain example.net, click Edit.
- Enable Bayesian.
- Click the arrow to expand Bayesian.
- Enable the option Accept training messages from user.
- Click OK.
Repeat the above steps for all incoming antispam profiles that are selected in policies that match recipients in the protected domain example.org.
To perform ongoing training of each per-domain Bayesian database
- Notify email users that they can train the Bayesian database for their protected domain by sending them an email similar to the following:
This procedure assumes the default Bayesian control email addresses. To configure the Bayesian control email addresses, go to Security > Bayesian > Control Account. |
All employees,
We have enabled a new email system feature that can be trained to recognize the differences between spam and legitimate email. You can help to train this feature. This message describes how to train our email system.
If you have old email messages and spam...
Forward the old spam to learn-is-spam@example.com from your company email account.
Forward any old email messages that are not spam to learn-is-not-spam@example.com from your company email account.
If you receive any new spam, or if a legitimate email is mistakenly classified as spam...
Forward spam that was not recognized to is‑spam@example.com from your company email account.
Forward legitimate email that was incorrectly classified as spam to is‑not‑spam@example.com from your company email account.
- default-grp@example.net
- default-grp@example.org
For example, when forwarding a training message from the sender (From:
) email address default-grp@example.net
, the FortiMail unit will apply the training message to the per-domain Bayesian database of example.net.
See also
Training the Bayesian databases
Backing up, batch training, and monitoring the Bayesian databases
Configuring the Bayesian training control accounts
Configuring global quarantine report settings
Backing up, batch training, and monitoring the Bayesian databases
You can train, back up, restore, and reset the global and per-domain Bayesian databases. You can also view a summary of the number of email messages that have been used to train each Bayesian database.
You can alternatively train Bayesian databases by forwarding spam and non-spam email to Bayesian control email addresses. For more information, see Training the Bayesian databases. |
You can alternatively back up, restore, and reset all Bayesian databases at once. For more information, see Backup and restore. |
Domain administrators cannot access the global Bayesian settings. |
For details, see About administrator account permissions and domains.
To individually train, view and manage Bayesian databases
- Go to Security > Bayesian > Domain.
- Select the type of the Bayesian database:
- For the global Bayesian database, from Select a domain, select System. For more information, see Use global Bayesian database.
- For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.
The Summary area displays the total number of email messages that the Bayesian database has learned as spam or not spam.
- To train a Bayesian database using mailbox files
- To back up a Bayesian database
- To restore a Bayesian database
- To reset a Bayesian database
To train a Bayesian database using mailbox files
Uploading mailbox files trains a Bayesian database with many email messages at once, which is especially useful for initial training of the Bayesian database until it reaches maturity. Because this method appends to the Bayesian database rather than overwriting, you may also perform this procedure periodically with new samples of spam and non-spam email for batch maintenance training.
If you have configured the FortiMail unit for email archiving, you can make mailbox files from archived email and spam. For details, see Managing archived email. |
- Go to Security > Bayesian > Domain.
- Select the type of the Bayesian database that you want to train.
- For the global Bayesian database, from Select a domain, select System.
- For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.
- Train global Bayesian database with mbox files
- Train group Bayesian database with mbox files
A pop-up window appears enabling you to specify which mailbox files to upload.
For best results, the mailbox file should contain a representative sample of spam for the specific FortiMail unit, protected domain, or email user.
Your management computer uploads the file to the FortiMail unit to train the database, and the pop-up window closes. Time required varies by the size of the file and the speed of your network connection. To update the training summary display in the Summary area with the new number of learned spam and non-spam messages, refresh the page by selecting the tab.
To back up a Bayesian database
- Go to Security > Bayesian > Domain.
- Select the type of the Bayesian database that you want to train.
- For the global Bayesian database, from Select a domain, select System.
- For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.
- Backup global Bayesian database
- Backup group Bayesian database
A pop-up window appears enabling you to download the database backup file.
The Bayesian database backup file is downloaded to your management computer. Time required varies by the size of the file and the speed of your network connection.
To restore a Bayesian database
Back up the Bayesian database before beginning this procedure. Restoring a Bayesian database replaces all training data stored in the database. For more information on backing up Bayesian database files, see To back up a Bayesian database or Backup and restore. |
- Go to Security > Bayesian > Domain.
- Select the type of the Bayesian database that you want to train.
- For the global Bayesian database, from Select a domain, select System.
- For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.
- Restore global Bayesian database
- Restore group Bayesian database
A pop-up window appears enabling you to upload a database backup file.
The Bayesian database backup file is uploaded from your management computer, and a success message appears. Time required varies by the size of the file and the speed of your network connection.
If a database operation error message appears, you can attempt to repair database errors. For more information, see Backup and restore.
To reset a Bayesian database
Back up the Bayesian database before beginning this procedure. Resetting a Bayesian database deletes all training data stored in the database. For more information on backing up Bayesian database files, see To back up a Bayesian database or Backup and restore. |
- Go to Security > Bayesian > Domain.
- Select the type of the Bayesian database that you want to train.
- For the global Bayesian database, from Select a domain, select System.
- For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.
- Reset global Bayesian database
- Reset group Bayesian database
A pop-up window appears asking for confirmation.
A status message notifies you that the FortiMail unit has emptied the contents of the Bayesian database.
See also
Training the Bayesian databases
Configuring the Bayesian training control accounts
Configuring the Bayesian training control accounts
The Control Account tab lets you configure the email addresses used for remote training of the Bayesian databases.
To train the Bayesian databases through email, email users and FortiMail administrators forward spam and non-spam email (also called training messages) to the appropriate Bayesian control email address. Bayesian control email addresses consist of the user name portion (also known as the local-part) of the email address configured on this tab and the local domain name of the FortiMail unit. For example, if the local domain name of the FortiMail unit is example.com, you might forward spam to learn‑is‑spam@example.com
.
If the FortiMail unit is configured to accept training messages, it will use the email to train one or more Bayesian databases. To accept a training message:
- The training message must match a recipient-based policy.
- The matching recipient-based policy must specify use of an antispam profile in which the “Accept training messages from users” option is enabled. For more information, see Accept training messages from users .
If either of these conditions is not met, the FortiMail unit will silently discard the training message without using them for training.
If these conditions are both met, the FortiMail unit accepts the training message and examines the user name portion and domain name portion of the sender address. Depending on whether the sender’s protected domain is configured to use the global or per-domain Bayesian database (see Use global Bayesian database), that database will be trained.
To configure the Bayesian control email addresses, go to Security > Bayesian > Control Account.
See also
Training the Bayesian databases
Backing up, batch training, and monitoring the Bayesian databases