Training and maintaining the Bayesian databases

Bayesian scanning uses databases to determine if an email is spam. For Bayesian scanning to be effective, the databases must be trained with known-spam and known-good email messages so the scanner can learn the differences between the two types of email. To maintain its effectiveness, false positives and false negatives must be sent to the FortiMail unit so the Bayesian scanner can learn from its mistakes.

Be aware that, without ongoing training, Bayesian scanning will become significantly less effective over time and thus Fortinet does not recommend enabling the Bayesian scanning feature.

The Security > Bayesian submenu lets you manage the databases used to store statistical information for Bayesian antispam processing, and to configure the email addresses used for remote control and training of the Bayesian databases.

To use a Bayesian database, you must enable the Bayesian scan in the antispam profile. For more information, see Managing antispam profiles.

This section contains the following topics:

Types of Bayesian databases
Training the Bayesian databases
Example: Bayesian training
Backing up, batch training, and monitoring the Bayesian databases
Configuring the Bayesian training control accounts

Types of Bayesian databases

FortiMail units have two types of Bayesian databases:

Global
Group

All types contain Bayesian statistical data that can be used by Bayesian scans to detect spam, and should be trained in order to be most accurate for detecting spam within their respective scopes. For more information on training each type of Bayesian database, see Training the Bayesian databases.

Only one Bayesian database is used by any individual Bayesian scan; which type will be used depends on the directionality of the email and your configuration of the FortiMail unit’s protected domains and antispam profiles. For information, see Use global Bayesian database.

Global

The global Bayesian database is a single database that contains Bayesian statistics that can be used to detect spam for any email user.

Outgoing antispam profiles can use only the global Bayesian database. Incoming antispam profiles can use global or domain Bayesian databases.

If all spam sent to all protected domains has similar characteristics and you do not require your Bayesian scans to be tailored specifically to the email of a protected domain, using the global database for all Bayesian scanning may be an ideal choice, because there is only one database to train and maintain.

For email that does not require use of the global database, if you want to use the global database, you must disable use of the per-domain Bayesian databases. For information on configuring protected domains to use the global Bayesian database, see Use global Bayesian database.

Group

Group Bayesian databases, also known as per-domain Bayesian databases, contain Bayesian statistics that can be used to detect spam for email users in a specific protected domain. FortiMail units can have multiple group Bayesian databases: one for each protected domain.

If you require Bayesian scans to be tailored specifically to the email received by each protected domain, using per-domain Bayesian databases may provide greater accuracy and fewer false positives.

For example, medical terms are a common characteristic of many spam messages. However, those terms may be a poor indicator of spam if the protected domain belongs to a hospital. In this case, you may want to train a separate, per-domain Bayesian database in which medical terms are not statistically likely to indicate spam.

If you want to use a per-domain database, you must disable use of the global Bayesian databases. For information on disabling use of the global Bayesian database for a protected domain, see Use global Bayesian database.

Training the Bayesian databases

Bayesian scans analyze the words (or “tokens”) in an message header and message body of an email to determine the probability that it is spam. For every token, the FortiMail unit calculates the probability that the email is spam based on the percentage of times that the word has previously been associated with spam or non-spam email. If a Bayesian database has not yet been trained, the Bayesian scan does not yet know the spam or non-spam association of many tokens, and does not have enough information to determine the statistical likelihood of an email being spam. By training a Bayesian database to recognize words that are and are not likely to be associated with spam, Bayesian scans become increasingly accurate.

However, spammers are constantly trying to invent new ways to defeat antispam filters. In one technique commonly used in attempt to avoid antispam filters, spammers alter words commonly identified as characteristic of spam, inserting symbols such as periods ( . ), or using nonstandard but human-readable spellings, such as substituting , Ç, Ë, or Í for A, C, E or I. These altered words are technically different tokens to a Bayesian database, so mature Bayesian databases may require some ongoing training to recognize new spam tokens.

You generally will not want to enable Bayesian scans until you have performed initial training of your Bayesian databases, as using untrained Bayesian databases can increase your rate of spam false positives and false negatives.

To initially train the Bayesian databases

Train the global database by uploading mailbox (.mbox) files. For details, see Backing up, batch training, and monitoring the Bayesian databases.

By uploading mailbox files, you can provide initial training more rapidly than through the Bayesian control email addresses. Training the global database ensures that outgoing antispam profiles in which you have enabled Bayesian scanning, and incoming antispam profiles for protected domains that you have configured to use the global database, can recognize spam.

If you have configured the FortiMail unit for email archiving, you can make mailbox files from archived email and spam. For details, see Managing archived email.

You can leave the global database untrained if both these conditions are true:

no outgoing antispam profile has Bayesian scanning enabled
no protected domain is configured to use the global Bayesian database

Train the per-domain databases by uploading mailbox (.mbox) files. For details, see Backing up, batch training, and monitoring the Bayesian databases.

By uploading mailbox files, you can provide initial training more rapidly than through the Bayesian control email addresses. Training per-domain databases ensures that incoming antispam profiles for protected domains that you have configured to use the per-domain database can recognize spam.

You can leave a per-domain database untrained if either of these conditions are true:

the protected domain is configured to use the global Bayesian database
no incoming antispam profiles exist for the protected domain

If you have enabled incoming antispam profiles to train Bayesian databases when the FortiMail unit receives training messages, and have selected those antispam profiles in recipient-based policies that match training messages, instruct FortiMail administrators and email users to forward sample spam and non-spam email to the Bayesian control email addresses. For more information, see Configuring the Bayesian training control accounts, Accept training messages from users , and Training Bayesian databases.

Before instructing email users to train the Bayesian databases, verify that you have enabled the FortiMail unit to accept training messages. If you have not enabled the “Accept training messages from users” option in the antispam profile for policies which match training messages, the training messages will be discarded without notification to the sender, and no training will occur.

FortiMail units apply training messages to either the global or per-domain Bayesian database, whichever is enabled for the sender’s protected domain.

Example: Bayesian training

In this example, Company X has set up a FortiMail unit to protect its email server. With over 1,000 email users, Company X plans to enable Bayesian scanning for incoming email. You, the system administrator, have been asked to configure Bayesian scanning, perform initial training of the Bayesian databases, and configure Bayesian control email addresses for ongoing training.

The local domain name of the FortiMail unit itself is example.com.

Company X has email users in two existing protected domains:

example.net
example.org

Each protected domains receives email with slightly different terminology, which could be considered spam to the other protected domain, and so will use separate per-domain Bayesian databases.

To facilitate initial training of each per-domain Bayesian database, you have used your email client software to collect samples of spam and non-spam email from each protected domain, and exported them into mailbox files:

example-net-spam.mbox
example-net-not-spam.mbox
example-org-spam.mbox
example-org-not-spam.mbox

After initial training, email users will use the default Bayesian control email addresses to perform any required ongoing training for each of their per-domain Bayesian databases.

To enable use of per-domain Bayesian databases

Go to Domain & User > Domain > Domain.
Select the row corresponding to example.net and click Edit.
Click the arrow to expand Advanced Setting and click Other.
Disable Use global bayesian database.
Click OK.

Repeat the above steps for the protected domain example.org.

To initially train each per-domain Bayesian database using mailbox files

Go to Security > Bayesian > Domain.
From Select a domain, select a domain.

This example uses example.net and example.org.

In the Operations area, click Train group Bayesian database with email samples.

A dialog appears.

In Clean emails, click Browse and locate example-net-not-spam.mbox.
In Spam emails, click Browse and locate example-net-spam.mbox.
Click OK.

Repeat the above steps for the protected domain example.org and its sample Bayesian database files.

To enable Bayesian scanning

Go to Profile > AntiSpam > AntiSpam.
In the row corresponding to an antispam profile that is selected in a policy that matches recipients in the protected domain example.net, click Edit.
Enable Bayesian.
Click the arrow to expand Bayesian.
Enable the option Accept training messages from user.
Click OK.

Repeat the above steps for all incoming antispam profiles that are selected in policies that match recipients in the protected domain example.org.

To perform ongoing training of each per-domain Bayesian database

Notify email users that they can train the Bayesian database for their protected domain by sending them an email similar to the following:

This procedure assumes the default Bayesian control email addresses. To configure the Bayesian control email addresses, go to Security > Bayesian > Control Account.

All employees,

We have enabled a new email system feature that can be trained to recognize the differences between spam and legitimate email. You can help to train this feature. This message describes how to train our email system.

If you have old email messages and spam...

Forward the old spam to learn-is-spam@example.com from your company email account.
Forward any old email messages that are not spam to learn-is-not-spam@example.com from your company email account.

If you receive any new spam, or if a legitimate email is mistakenly classified as spam...

Forward spam that was not recognized to is‑spam@example.com from your company email account.
Forward legitimate email that was incorrectly classified as spam to is‑not‑spam@example.com from your company email account.

Notify other FortiMail administrators that they can train the per-domain Bayesian databases for those protected domains by forwarding email to the Bayesian control accounts, described in the previous step. To do so, they must configure their email client software with the following sender addresses:

default-grp@example.net
default-grp@example.org

For example, when forwarding a training message from the sender (From:) email address default-grp@example.net, the FortiMail unit will apply the training message to the per-domain Bayesian database of example.net.

See also

Training the Bayesian databases

Types of Bayesian databases

Backing up, batch training, and monitoring the Bayesian databases

Configuring the Bayesian training control accounts

Configuring global quarantine report settings

Backing up, batch training, and monitoring the Bayesian databases

You can train, back up, restore, and reset the global and per-domain Bayesian databases. You can also view a summary of the number of email messages that have been used to train each Bayesian database.

You can alternatively train Bayesian databases by forwarding spam and non-spam email to Bayesian control email addresses. For more information, see Training the Bayesian databases.

You can alternatively back up, restore, and reset all Bayesian databases at once. For more information, see Backup and restore.

To access this part of the web UI, your administrator account’s access profile must have Read or Read-Write permission to the Policy category.

Domain administrators cannot access the global Bayesian settings.

For details, see About administrator account permissions and domains.

To individually train, view and manage Bayesian databases

Go to Security > Bayesian > Domain.
Select the type of the Bayesian database:

For the global Bayesian database, from Select a domain, select System. For more information, see Use global Bayesian database.
For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.

The Summary area displays the total number of email messages that the Bayesian database has learned as spam or not spam.

For any level of Bayesian database, select an operation:

To train a Bayesian database using mailbox files
To back up a Bayesian database
To restore a Bayesian database
To reset a Bayesian database

To train a Bayesian database using mailbox files

Uploading mailbox files trains a Bayesian database with many email messages at once, which is especially useful for initial training of the Bayesian database until it reaches maturity. Because this method appends to the Bayesian database rather than overwriting, you may also perform this procedure periodically with new samples of spam and non-spam email for batch maintenance training.

If you have configured the FortiMail unit for email archiving, you can make mailbox files from archived email and spam. For details, see Managing archived email.

Go to Security > Bayesian > Domain.
Select the type of the Bayesian database that you want to train.

For the global Bayesian database, from Select a domain, select System.
For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.

In the Operation area, click the link appropriate to the type that you selected in the previous step, either:

Train global Bayesian database with mbox files
Train group Bayesian database with mbox files

A pop-up window appears enabling you to specify which mailbox files to upload.

In the Innocent mailbox field, click Browse, then select a mailbox file containing email that is not spam.

In the Spam mailbox field, click Browse, then select a mailbox file containing email that is spam.

For best results, the mailbox file should contain a representative sample of spam for the specific FortiMail unit, protected domain, or email user.

Click OK.

Your management computer uploads the file to the FortiMail unit to train the database, and the pop-up window closes. Time required varies by the size of the file and the speed of your network connection. To update the training summary display in the Summary area with the new number of learned spam and non-spam messages, refresh the page by selecting the tab.

To back up a Bayesian database

Go to Security > Bayesian > Domain.
Select the type of the Bayesian database that you want to train.

For the global Bayesian database, from Select a domain, select System.
For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.

In the Operation area, click the link appropriate to the type that you selected in the previous step, either:

Backup global Bayesian database
Backup group Bayesian database

A pop-up window appears enabling you to download the database backup file.

Select a location in which to save the database backup file and save it.

The Bayesian database backup file is downloaded to your management computer. Time required varies by the size of the file and the speed of your network connection.

To restore a Bayesian database

Back up the Bayesian database before beginning this procedure. Restoring a Bayesian database replaces all training data stored in the database. For more information on backing up Bayesian database files, see To back up a Bayesian database or Backup and restore.

Go to Security > Bayesian > Domain.
Select the type of the Bayesian database that you want to train.

For the global Bayesian database, from Select a domain, select System.
For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.

In the Operation area, click the link appropriate to the type that you selected in the previous step, either:

Restore global Bayesian database
Restore group Bayesian database

A pop-up window appears enabling you to upload a database backup file.

Click Browse to locate and select the Bayesian database backup file, then click OK.

Click OK.

The Bayesian database backup file is uploaded from your management computer, and a success message appears. Time required varies by the size of the file and the speed of your network connection.

If a database operation error message appears, you can attempt to repair database errors. For more information, see Backup and restore.

To reset a Bayesian database

Back up the Bayesian database before beginning this procedure. Resetting a Bayesian database deletes all training data stored in the database. For more information on backing up Bayesian database files, see To back up a Bayesian database or Backup and restore.

Go to Security > Bayesian > Domain.
Select the type of the Bayesian database that you want to train.

For the global Bayesian database, from Select a domain, select System.
For a per-domain Bayesian database, from Select a domain, select the name of the protected domain, such as example.com.

In the Operation area, click the link appropriate to the type that you selected in the previous step, either:

Reset global Bayesian database
Reset group Bayesian database

A pop-up window appears asking for confirmation.

Click Yes.

A status message notifies you that the FortiMail unit has emptied the contents of the Bayesian database.

See also

Training the Bayesian databases

Types of Bayesian databases

Configuring the Bayesian training control accounts

Backup and restore

Configuring the Bayesian training control accounts

The Control Account tab lets you configure the email addresses used for remote training of the Bayesian databases.

To train the Bayesian databases through email, email users and FortiMail administrators forward spam and non-spam email (also called training messages) to the appropriate Bayesian control email address. Bayesian control email addresses consist of the user name portion (also known as the local-part) of the email address configured on this tab and the local domain name of the FortiMail unit. For example, if the local domain name of the FortiMail unit is example.com, you might forward spam to learn‑is‑spam@example.com.

If the FortiMail unit is configured to accept training messages, it will use the email to train one or more Bayesian databases. To accept a training message:

The training message must match a recipient-based policy.
The matching recipient-based policy must specify use of an antispam profile in which the “Accept training messages from users” option is enabled. For more information, see Accept training messages from users .

If either of these conditions is not met, the FortiMail unit will silently discard the training message without using them for training.

If these conditions are both met, the FortiMail unit accepts the training message and examines the user name portion and domain name portion of the sender address. The following factor determines which Bayesian database or databases will be trained:

whether the sender’s protected domain is configured to use the global or per-domain Bayesian database (see Use global Bayesian database)
whether per-user Bayesian databases are enabled in the antispam profile (see “Use personal database” on page 490)

Depending on those factors, the FortiMail unit uses the training message to train either the global or per-domain Bayesian database.

To access this part of the web UI, your administrator account’s:

Domain must be System
access profile must have Read or Read-Write permission to the Policy category

For details, see About administrator account permissions and domains.

To configure the Bayesian control email addresses, go to Security > Bayesian > Control Account.

GUI item	Description
"is really spam" user name	Enter the user name portion of the email address, such as `is-spam`, to which email users will forward spam false negatives. Forwarding false negatives corrects the Bayesian database when it inaccurately classifies spam as being legitimate email.
"is not really spam" user name	Enter the user name portion of the email address, such as `is-not-spam`, to which email users will forward spam false positives. Forwarding false positives corrects the Bayesian database when it inaccurately classifies legitimate email as being spam.
"learn is spam" user name	Enter the user name portion of the email address, such as `learn-is-spam`, to which email users will forward spam that the Bayesian scanner has not previously scanned.
"learn is not spam" user name	Enter the user name portion of the email address, such as `learn-is-not-spam`, to which email users will forward spam that the Bayesian scanner has not previously scanned.
training group	Enter the user name portion of the email address, such as `default-grp`, that FortiMail administrators can use as their sender email address when forwarding email to the “learn is spam” email address or “learn is not spam” email address. Training messages sent from this sender email address will be used to train the global or per-domain Bayesian database (whichever is selected in the protected domain).