High Availability and Disaster Recovery - ClickHouse

The following section provides an introduction, recommendation and requirements for High Availability and Disaster Recovery for ClickHouse.

Understanding High Availability (HA)
Understanding Disaster Recovery (DR)
Recommended HA+DR Deployments

Understanding High Availability (HA)

FortiSIEM High Availability feature enables you to run multiple Active-Active Supervisor nodes. This means that

User can log in to any Supervisor node and perform GUI Operations
Agents, Collectors and Worker nodes can communicate with any Supervisor node
Analytics operations like streaming rule and query processing are distributed across the Supervisor nodes
The databases are replicated among the Supervisor nodes as follows.
- The CMDB residing in a PostgreSQL database
- CMDB Device configurations residing in file system and managed by SVN-lite
- Profile data residing on SQLite databases on the Supervisor node.
Event data is replicated between ClickHouse Data nodes using ClickHouse mechanisms.

An Active-Active Supervisor cluster is built around the concept of Leader and Follower Supervisor nodes, configured as a linked list.

The first Supervisor where you install FortiSIEM License, is the Leader. The Leader’s UUID matches the UUID in the license. On the Leader, Redis and PostGreSQL database services run in Master mode. That means that all PostGreSQL database writes from all Supervisors go to the Leader node.
Next, you add a Follower, which follows the Leader, in the sense that its PostGreSQL database is replicated from that of the Leader. On the Follower node, both Redis and PostGreSQL database services run in Follower (that is, read only) mode. CMDB and Profile data are also copied via rsync from the Leader to the Follower node.
You can add another Follower to follow the Follower node added in Step 2. Its PostGreSQL database is replicated from the Supervisor created in Step 2. On this node, both Redis and PostGreSQL database services run in Follower (that is, read only) mode. CMDB and Profile data are also copied via rsync from the Follower node in Step 2 to the Follower node in Step 3. A replication linked list is formed: Leader -> Follower in Step 2 -> Follower in Step 3.
You can add more Follower nodes to follow the last Follower node in the chain.

Detailed Leader/Follower Supervisor configuration steps are in Configuring High Availability. It is recommended that you configure a Load Balancer in front of the Active-Active Supervisor cluster. Then FortiSIEM GUI users, Collectors, and Agents can reach any of the Supervisors through the Load Balancer. Workers are aware of the Supervisors and their Leader/Follower role and communicate to the appropriate Supervisor in a load balanced manner.

If the Leader Supervisor goes down, then you need to login to the (immediate) Follower Supervisor and promote it to be the Leader. This is done by running a script on the Follower – see High Availability Operations for details. Note that while the Leader is down,

GUI users can not login to the Supervisor Cluster since the Master PostGreSQL database in Leader is not available.
Incidents may be lost since Rule Master cannot upload Incidents to the App Server, because App Server needs to authenticate Rule Master via the Master PostGreSQL database located on the Leader.
Event parsing and event insertion to database will continue

The Supervisor cluster will come up once the Follower is promoted to the Leader. Since the FortiSIEM license is tied to failed Leader’s UUID, you will see a message prompting you to install a new license with new Leader’s UUID within a 2 weeks grace period from the time of failure. If you fail to do this, then you will not be able to login to the system.

If a Follower Supervisor goes down, then there is no impact:

Load balancer will route GUI users to an available Supervisor
Rules assigned to the Failed Supervisor will be re-assigned to one of the available Supervisors

Once the failed Supervisor comes back up, then the Supervisor will automatically recover. You also have the option to delete the failed Supervisor node, clean it and add it back to the cluster as a follower of the last node in the chain.

Detailed failover steps are provided in High Availability Operations.

Understanding Disaster Recovery (DR)

Disaster Recovery provides availability in case the entire Supervisor, Worker cluster goes down. To satisfy this requirement, in addition to the Primary Site 1, a separate fully licensed FortiSIEM needs to run in a Secondary Site (Site 2). The two sites, Sites 1 and 2, need to be set up identically in terms of Supervisor, Workers, and event storage, with the exception that Site 2 does not allow multiple Supervisors.

Note that all the Supervisors in Primary Site 1 are Active-Active while the Supervisor in Secondary Site 2 is Hot Standby, with limited Read/Only GUI functionality.

Under normal operations:

Collectors upload events to Site 1 Worker nodes. The events are first stored in Site 1 ClickHouse nodes and then replicated to Site 2 nodes using ClickHouse mechanisms.
The following databases residing in Site 1 (Leader) Supervisor are replicated to Site 2 Supervisor:
- The CMDB residing in a PostgreSQL database
- CMDB Device configurations residing in file system and managed by SVN-lite in Supervisor node
- Profile data residing on SQLite databases on the Supervisor node
- Event data is replicated between ClickHouse Data nodes using ClickHouse mechanisms

On Site 2 Supervisor, PostGreSQL database services run in Follower (that is, read only) mode. User can logon to Site 2 Supervisor but can only do very limited operations.

Able to view CMDB, Incidents, Cases, Tasks, Resources and all settings in the ADMIN page except the License Usage Page, etc
Cannot run any queries on ANALYTICS and all widgets on Dashboards and all report related graphs such as the License Usage Page have no data.
Cannot do any Editing operations on all GUI pages.
All actions related to update operations do not work.

When disaster strikes and entire Site 1 is down:

User need to promote Site 2 Supervisor from Secondary to Primary
User must make DNS changes so that users can logon to Site 2 Supervisor, and Collectors can send events to Site 2 Workers.

When the Old Primary (Site 1) is recovered and powered up, the user needs to add Site 1 back as a Secondary site. Then Site 2 will sync missing data with Site 1.

Once the replication is complete, then the user can decide to return to the pre-disaster setup by promoting Site 1 to Primary.

Detailed Disaster Recovery Operations are available here.

Single Site HA Deployment

If you have one site and want to deploy FortiSIEM with ClickHouse in High Availability mode, then deploy as follows:

2 Supervisors – Primary Leader and Follower
3 Workers as Keeper only nodes
N shards depending on your EPS requirements, with 2 Workers in each shard. Workers should have both Data (Ingest) and Query flags set. That means that each Worker will locally store the events and replicate to other nodes in the shard.

For configuring Supervisors, see Configuration under Configuring and Maintaining Active-Active Supervisor Cluster in the Online Help.

For configuring ClickHouse topology of Keeper and Data Clusters, see ClickHouse Configuration in the Online Help.

Multi-Site HA Deployment

If you have two sites and want to deploy FortiSIEM with ClickHouse in High Availability mode, then deploy as follows:

2 Supervisors – Primary Leader and Follower; one in each site.
3 Workers as Keeper only nodes; 2 in one site and 1 in the other site.
N shards depending on your EPS requirements, with 2 Workers in each shard and 1 Worker in each Site. Workers should have both Data (Ingest) and Query flags set. That means that each Worker will locally store the events and then replicate to other node in the shard.

For replication to work efficiently, the latency between the sites must be less than 100msec, the less the better.

The concept can be extended to more than 2 sites as follows.

3 or 4 Supervisors – 1 Primary Leader in one site and remaining Follower Supervisors in remaining sites depending on user access patterns.
3 Workers as Keeper only nodes; 2 in one site and 1 in the other site. 2 Keeper nodes are co-located in the same site to facilitate reaching quorum because of low latency considerations.
N shards depending on your EPS requirements. For each shard, place 1 Worker in each Site. Workers should have both Data (Ingest) and Query flags set. That means that each Worker will locally store the events and then replicate to other node in the shard.
For replication to work efficiently, the latency between the sites must be less than 100msec, the less the better.

For configuring Supervisors, see Configuration under Configuring and Maintaining Active-Active Supervisor Cluster in the Online Help.

For configuring ClickHouse topology of Keeper and Data Clusters, see ClickHouse Configuration in the Online Help.

Multi-Site HA+ DR Deployment

If you have many sites and want to designate one site only for Disaster Recovery, then deploy as follows. Note that Disaster Recovery Site does not have full functionality until the Secondary is made Primary. Secondary Site has all combined events from all sites.

3 or 4 Supervisors
- 1 Primary Leader in one Primary site
- 1 Secondary Supervisor in Secondary Site and
- 1 or 2 Follower Supervisors in remaining Primary sites.
- Place Supervisors in sites that are close to users.
3 Workers as Keeper only nodes; 2 in one Primary site and 1 in the Secondary site.
N shards depending on your EPS requirements. For each shard:
- If you have 1 Primary Site, then place 2 Workers in the Site.
- If you have more than 1 Primary Site, then place 1 Worker in each Primary Site.
- The number of Workers in Secondary Site must be equal to the total number of Workers in all Primary Sites.
- Primary Site Workers should have both Data (Ingest) and Query flags set. That means that they will locally store the events and replicate to other nodes in the shard.
- Secondary Site Workers should have both Data (Ingest) and Query flags unset. That means that they will only store events replicated from Primary Site Workers and will not participate in Queries.

You need to follow these steps:

Configure Primary Site Workers, including ClickHouse configuration for that site.
Configure Secondary Site Workers.
Cleanup the ClickHouse configuration on Secondary Site.
Set up Disaster Recovery from Primary Site.
Configure ClickHouse from the Primary Site by including the Secondary Workers in the Keeper and Data Cluster configurations.

For configuring Supervisors, see Configuration under Configuring and Maintaining Active-Active Supervisor Cluster in the Online Help.

For configuring ClickHouse topology of Keeper and Data Clusters, see ClickHouse Configuration in the Online Help.

For configuring Disaster Recovery, see Configuring Disaster Recovery - Step 5 and Step 6.

HA Deployment with Appliances

If you want to deploy FortiSIEM hardware appliances or all-in-one VMs in High Availability mode, then deploy as follows:

1 Appliance in each site.
Designate one Appliance as Leader and the other as Follower. As a matter of convention, the Leader is in the Primary Site.
1 Worker acting as Keeper in the Primary Site.
Keeper cluster consists of 2 Appliances and the Worker node.
1 Shard consisting of the two Appliances, and they have both Data (Ingest) and Query flags set.

For configuring Supervisors, see Configuration under Configuring and Maintaining Active-Active Supervisor Cluster in the Online Help. .

For configuring ClickHouse topology of Keeper and Data Clusters, see ClickHouse Configuration in the Online Help.

High Availability and Disaster Recovery - ClickHouse

The following section provides an introduction, recommendation and requirements for High Availability and Disaster Recovery for ClickHouse.

Understanding High Availability (HA)
Understanding Disaster Recovery (DR)
Recommended HA+DR Deployments

Understanding High Availability (HA)

FortiSIEM High Availability feature enables you to run multiple Active-Active Supervisor nodes. This means that

User can log in to any Supervisor node and perform GUI Operations
Agents, Collectors and Worker nodes can communicate with any Supervisor node
Analytics operations like streaming rule and query processing are distributed across the Supervisor nodes
The databases are replicated among the Supervisor nodes as follows.
- The CMDB residing in a PostgreSQL database
- CMDB Device configurations residing in file system and managed by SVN-lite
- Profile data residing on SQLite databases on the Supervisor node.
Event data is replicated between ClickHouse Data nodes using ClickHouse mechanisms.

An Active-Active Supervisor cluster is built around the concept of Leader and Follower Supervisor nodes, configured as a linked list.

The first Supervisor where you install FortiSIEM License, is the Leader. The Leader’s UUID matches the UUID in the license. On the Leader, Redis and PostGreSQL database services run in Master mode. That means that all PostGreSQL database writes from all Supervisors go to the Leader node.
Next, you add a Follower, which follows the Leader, in the sense that its PostGreSQL database is replicated from that of the Leader. On the Follower node, both Redis and PostGreSQL database services run in Follower (that is, read only) mode. CMDB and Profile data are also copied via rsync from the Leader to the Follower node.
You can add another Follower to follow the Follower node added in Step 2. Its PostGreSQL database is replicated from the Supervisor created in Step 2. On this node, both Redis and PostGreSQL database services run in Follower (that is, read only) mode. CMDB and Profile data are also copied via rsync from the Follower node in Step 2 to the Follower node in Step 3. A replication linked list is formed: Leader -> Follower in Step 2 -> Follower in Step 3.
You can add more Follower nodes to follow the last Follower node in the chain.

GUI users can not login to the Supervisor Cluster since the Master PostGreSQL database in Leader is not available.
Incidents may be lost since Rule Master cannot upload Incidents to the App Server, because App Server needs to authenticate Rule Master via the Master PostGreSQL database located on the Leader.
Event parsing and event insertion to database will continue

If a Follower Supervisor goes down, then there is no impact:

Load balancer will route GUI users to an available Supervisor
Rules assigned to the Failed Supervisor will be re-assigned to one of the available Supervisors

Detailed failover steps are provided in High Availability Operations.

Understanding Disaster Recovery (DR)

Note that all the Supervisors in Primary Site 1 are Active-Active while the Supervisor in Secondary Site 2 is Hot Standby, with limited Read/Only GUI functionality.

Under normal operations:

Collectors upload events to Site 1 Worker nodes. The events are first stored in Site 1 ClickHouse nodes and then replicated to Site 2 nodes using ClickHouse mechanisms.
The following databases residing in Site 1 (Leader) Supervisor are replicated to Site 2 Supervisor:
- The CMDB residing in a PostgreSQL database
- CMDB Device configurations residing in file system and managed by SVN-lite in Supervisor node
- Profile data residing on SQLite databases on the Supervisor node
- Event data is replicated between ClickHouse Data nodes using ClickHouse mechanisms

On Site 2 Supervisor, PostGreSQL database services run in Follower (that is, read only) mode. User can logon to Site 2 Supervisor but can only do very limited operations.

Able to view CMDB, Incidents, Cases, Tasks, Resources and all settings in the ADMIN page except the License Usage Page, etc
Cannot run any queries on ANALYTICS and all widgets on Dashboards and all report related graphs such as the License Usage Page have no data.
Cannot do any Editing operations on all GUI pages.
All actions related to update operations do not work.

When disaster strikes and entire Site 1 is down:

User need to promote Site 2 Supervisor from Secondary to Primary
User must make DNS changes so that users can logon to Site 2 Supervisor, and Collectors can send events to Site 2 Workers.

When the Old Primary (Site 1) is recovered and powered up, the user needs to add Site 1 back as a Secondary site. Then Site 2 will sync missing data with Site 1.

Once the replication is complete, then the user can decide to return to the pre-disaster setup by promoting Site 1 to Primary.

Detailed Disaster Recovery Operations are available here.