Design for Resilience
When designing resilience into the solution consider:
-
Supervisor node availability and scaling
-
ClickHouse replicas
-
ClickHouse keeper node resilience
-
Architectural resilience provided by load balancers
-
Underlying hypervisor resilience features. Many hypervisors provide features to increase the resilience of their hosted VMs
-
Host resilience features, redundant PSU, NIC, fans, and storage array resilience
The key resilience points to review are:
Supervisor Resilience
FortiSIEM supports high availability Supervisor nodes as a licensed feature. This capability provides:
-
Up to 5 Supervisor nodes
-
CMDB, Incident, SVN is synchronized between the Supervisor nodes
-
There is a concept of a Primary Leader Supervisor node and subsequent Supervisor nodes become Primary Follower Supervisor nodes.
-
If the Primary Leader Supervisor node is unavailable, then a Primary Follower Supervisor node can be promoted to take the Leader role.
-
Provide scale out of concurrent GUI users to the platform.
For more information on High Availability, see the documentation here: https://docs.fortinet.com/document/fortisiem/6.7.7/high-availability-and-disaster-recovery-procedures-clickhouse/933956/high-availability-and-disaster-recovery-clickhouse
ClickHouse Database Resilience
A ClickHouse replica is a copy of the data within a shard stored on another host in the shard. This provides resilience against the failure of a node within the shard - if a shard has two replicas and one of the hosts fails, the system will continue to use the remaining replica on the remaining host. Shards can have more than two replicas, which will increase the resilience but also the cost of the solution. For increased resilience, each replica should be hosted on a separate server and the data stored on a separate storage array. Hosting replicas on the same storage array leaves the solution more vulnerable to data loss due to a hardware failure.
The ClickHouse keeper process is essential to the functioning of the system. If the keeper process is not available, then the system will operate in read-only mode and new events will be dropped. Take the following steps to increase the resilience of the keeper process:
-
Deploy the keeper process on a separate dedicated worker node. This will reduce the possibility that the keeper node has to be rebooted as it is not co-hosted with the Supervisor or active Worker node processes.
-
Consider deploying three keeper nodes in a cluster for maximum resilience. In a three-keeper node cluster, one node can fail and the cluster will remain operational.
-
Note that a two-node cluster does not provide resilience due to the requirement for a quorum of nodes in a mulit-node cluster. See ClickHouse documentation for more information on the advanced concept.
-
Additional Architectural Resilience
FortiSIEM Supervisor and Worker nodes should be deployed on a high performance, resilient, data center class network. High performance LAN switches should be deployed in a hierarchical topology with resilient, high bandwidth uplinks and high-performance failure-recovery mechanisms. If cluster traffic traverses a wide-area network (WAN), this should be an enterprise class, resilient WAN that provides high bandwidth and low latency. FortiSIEM cluster traffic should be prioritized across the LAN and WAN to minimize latency between nodes.
Load balancers can be deployed at various points within the system both for scalability and increased responsiveness to a failure. This is discussed in more depth throughout the document; some examples include:
-
Load balancers can be installed in front of the Worker node cluster and collectors configured to upload to a shared virtual IP (VIP) on the load balancer that is balanced across a group of workers. This will make the failure of a worker less noticeable to the collectors.
-
This is optional. By default, FortiSIEM has an inbuilt load-sharing mechanism which distributes collectors across the Worker cluster and fails over to another Worker node in the event of a Worker failure.
-
-
Load balancers can be installed in front of a group of Collectors to provide resilience for inbound syslog and FortiSIEM agent connections.
Many hypervisors include advanced features to increase the resilience and uptime of the VMs they host. Extensive hardware features are available to increase server and storage resilience. Be sure to work with the sever team to take full advantage of these when designing the solution, and to understand the limitations and potential points of failure present in the hosting solution as they may also affect the performance of the solution, and the availability and integrity of the data hosted on it.