Fortinet black logo

FortiSIEM Reference Architecture Using ClickHouse

In-Cluster Resilience

In-Cluster Resilience

Several features provide in-cluster resilience against node failure and help to provide a highly available solution.

FortiSIEM supports a single Supervisor node per cluster. Supervisor availability can be increased by:

  • Deploying the Supervisor node on an enterprise class hypervisor platform with redundant disk array and PSUs, backed by an enterprise class uninterruptable power supply and generator

  • Taking regular snapshots and backups of the Supervisor node

  • Using hypervisor guest HA mechanisms, such as VMware High Availability

In a multi-node distributed solution short term limited system operation will continue if the Supervisor node is lost. Collectors will continue to ingest logs. If the Worker and ClickHouse keeper nodes are still available, then Collectors will continue to upload logs, otherwise they will temporarily cache them on the local hard drive. Workers will continue to write logs to the ClickHouse database assuming the cluster is available and in a read-write state. The following main features will be unavailable:

  • User interface

  • Correlation / Alerts / Incidents

  • Scheduled reports

  • Configuration changes, CMDB and SVN operations

The loss of a Worker node can affect the system in several ways:

  • If the Worker node performs query functions, then query and reporting performance may be impacted

  • If the Worker node is an upload target for collectors then the collectors using it will temporarily lose connectivity. They will automatically reconnect to another worker if available

  • If the Worker node is part of a ClickHouse shard then the shard will be impacted. If it is the only node in a shard, then the shard is lost and cannot be queried or written to. If the shard has another active replica then operation will continue but resilience is affected.

  • If the Worker node is running the ClickHouse Keeper process, then system availability may be affected. See the section ‘ClickHouse Database Resilience’ for more information

Collector node resilience for syslog ingestion is possible in several ways:

  • Using multiple Collector nodes and a load balancer to create a ‘virtual Collector’ architecture

  • Deploying virtual Collectors on a hypervisor that provides HA capabilities

  • Collector virtual machines are not licensed, they can be quickly redeployed in the event of a failure

Do not configure multiple duplicate syslog upload target addresses on monitored devices - this will result in duplicate log entries in FortiSIEM.

Active monitoring jobs such as SNMP and WMI polling do not automatically fail over between Collectors; monitored devices must be rediscovered from another collector to restart active device monitoring.

In an MSSP environment, collectors can be deployed in a multi-tenant Collector pool to provide additional collector resilience for specific use-cases:

  • FortiSIEM agent upload

  • Cloud API log pulling

In-Cluster Resilience

Several features provide in-cluster resilience against node failure and help to provide a highly available solution.

FortiSIEM supports a single Supervisor node per cluster. Supervisor availability can be increased by:

  • Deploying the Supervisor node on an enterprise class hypervisor platform with redundant disk array and PSUs, backed by an enterprise class uninterruptable power supply and generator

  • Taking regular snapshots and backups of the Supervisor node

  • Using hypervisor guest HA mechanisms, such as VMware High Availability

In a multi-node distributed solution short term limited system operation will continue if the Supervisor node is lost. Collectors will continue to ingest logs. If the Worker and ClickHouse keeper nodes are still available, then Collectors will continue to upload logs, otherwise they will temporarily cache them on the local hard drive. Workers will continue to write logs to the ClickHouse database assuming the cluster is available and in a read-write state. The following main features will be unavailable:

  • User interface

  • Correlation / Alerts / Incidents

  • Scheduled reports

  • Configuration changes, CMDB and SVN operations

The loss of a Worker node can affect the system in several ways:

  • If the Worker node performs query functions, then query and reporting performance may be impacted

  • If the Worker node is an upload target for collectors then the collectors using it will temporarily lose connectivity. They will automatically reconnect to another worker if available

  • If the Worker node is part of a ClickHouse shard then the shard will be impacted. If it is the only node in a shard, then the shard is lost and cannot be queried or written to. If the shard has another active replica then operation will continue but resilience is affected.

  • If the Worker node is running the ClickHouse Keeper process, then system availability may be affected. See the section ‘ClickHouse Database Resilience’ for more information

Collector node resilience for syslog ingestion is possible in several ways:

  • Using multiple Collector nodes and a load balancer to create a ‘virtual Collector’ architecture

  • Deploying virtual Collectors on a hypervisor that provides HA capabilities

  • Collector virtual machines are not licensed, they can be quickly redeployed in the event of a failure

Do not configure multiple duplicate syslog upload target addresses on monitored devices - this will result in duplicate log entries in FortiSIEM.

Active monitoring jobs such as SNMP and WMI polling do not automatically fail over between Collectors; monitored devices must be rediscovered from another collector to restart active device monitoring.

In an MSSP environment, collectors can be deployed in a multi-tenant Collector pool to provide additional collector resilience for specific use-cases:

  • FortiSIEM agent upload

  • Cloud API log pulling