Fortinet white logo
Fortinet white logo

High Availability and Disaster Recovery Procedures - ClickHouse

Disaster Recovery Operations

Disaster Recovery Operations

Primary (Site 1) Down, Secondary (Site 2) becomes Primary

Primary (Site 1) has failed and unavailable. The current Secondary (Site 2) needs to become Primary. The Collectors will start buffering events since the Site 1 Workers are down.

Follow the steps below to promote Site 2 to Primary.

Step 1. Make Secondary the New Primary

Log on to Site 2 (Secondary) as root and run the following command:

phsecondary2primary

Site 2 (Secondary) becomes Primary, and you can log on to Site 2, which becomes the current Primary, to continue your work.

Note: The process should take approximately 10 minutes to complete. Both FortiSIEM nodes become independent after running the command.

Step 2: Restart phMonitor on all Secondary Workers.

Restart phMonitor on all Secondary Workers.

Step 3. Remove Failed ClickHouse Keeper Nodes in Site 1

  1. Logon to Site 2 GUI.

  2. Go to Admin > Settings > Database > ClickHouse Config.

  3. In Keeper Cluster, select the Site 1 Nodes and click to delete the nodes.

  4. Click Test. Once this succeeds, then click Deploy.

Site 1 Up and Supervisor / Worker Recovered

Site1 has recovered after failure meaning that the Supervisor and Workers are up. You first need to make Site 1 as Secondary and then (optionally) switch roles if you want Site 1 to become Primary again.

Follow the instructions below to make Site 1 as Secondary.

  1. Logon to the current Primary FortiSIEM node (this should be Site 2) using the GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed (this should be Site 1). It should appear as Inactive under the Replication Status column.

  4. Click Edit.

  5. Review the information to ensure that all the information is correct.
    Note: The information is read only.

  6. When done, click Save. The Original Primary (Site 1) now becomes the Secondary in your Disaster Recovery configuration. The Replication Status changes from Inactive to Active.

  7. Now, if you want to switch roles so Site 1 becomes Primary again, follow the instructions in Switching Primary and Secondary Roles.

  8. Add back 2 recovered Site 1 ClickHouse Keeper nodes:

    1. Clean up the Keeper registry by running the following command.

      rm -rf /data-clickhouse-hot-1/clickhouse-keeper

    2. Go to ADMIN > Settings > Database > ClickHouse Config and add the 2 workers to the ClickHouse Keeper cluster.

Site 1 Up but Supervisor / Worker Cannot be Recovered

In this case, Site 1 is up, but the Supervisor and Workers cannot be recovered after failure.

In this situation, you need to first reinstall a Supervisor and Workers on Site 1 and make Site 1 as Secondary.

  1. Logon to the current Primary FortiSIEM node (Site 2) using the GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed. It should appear with as Inactive under the Replication Status column.

  4. Click Delete to remove it from the Disaster Recovery configuration.

  5. Log out from Site 2.

  6. Re-install Supervisor and Workers in Site 1.

  7. Log back onto Site 2 GUI.

  8. Add Site 1 as a new Secondary by following the instructions in Configuring Disaster Recovery.

Now, if you want to switch roles so Site 1 becomes Primary again, follow the instructions in Switching Primary and Secondary Roles.

Switching Primary and Secondary Roles

If you need to change your Disaster Recovery setup so that Site 2 will be Secondary, and Site 1 will be Primary, take the following steps.

  1. Logon to the current Secondary FortiSIEM node (Site 1) as root, and run the following command:

    phsecondary2primary

    When the job is completed, Site 1 is now the Primary.

  2. Logon to the Site 1 (Primary) UI.

  3. Navigate to ADMIN > License > Nodes.

  4. Select the Site 2 (Secondary) FortiSIEM node listed and click Edit.

  5. Review the information to ensure that all the information is correct.

  6. When done, click Save.
    Site 1 will become Primary and Site 2 will be Secondary. Remember to change the DNS addresses after this role switch so that users are logging on to the Primary and Collectors are sending to Primary Workers.

Recovering from Human Error

If, by mistake, the phsecondary2primary command is executed, it turns Site 2 (Secondary) node to Primary. At this point, you have two independent Primary nodes. To recover, take the following steps:

  1. Logon to the FortiSIEM you wish to be the current Primary.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary node.

  4. Click Edit.

  5. Review the information to ensure that all the information is correct.

  6. When done, click Save.

Turning Off Disaster Recovery

To turn off Disaster Recovery, take the following steps.

  1. Logon to the current Primary GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed and click Delete.

  4. Click Yes to confirm the operation.

Removing ClickHouse Data and Keeper Nodes

To clean up nodes that have been used as ClickHouse data nodes only, or as ClickHouse Keeper and ClickHouse data nodes, run the following script:

cleanup_replicated_tables_and_keeper.sh

The script can be found here: /opt/phoenix/phscripts/clickhouse/cleanup_replicated_tables_and_keeper.sh

Note: Since not all nodes act as ClickHouse Keeper Node and ClickHouse data node at the same time, the script might warn that tables or keeper directory is not found. The warning messages can be ignored.

Disaster Recovery Operations

Disaster Recovery Operations

Primary (Site 1) Down, Secondary (Site 2) becomes Primary

Primary (Site 1) has failed and unavailable. The current Secondary (Site 2) needs to become Primary. The Collectors will start buffering events since the Site 1 Workers are down.

Follow the steps below to promote Site 2 to Primary.

Step 1. Make Secondary the New Primary

Log on to Site 2 (Secondary) as root and run the following command:

phsecondary2primary

Site 2 (Secondary) becomes Primary, and you can log on to Site 2, which becomes the current Primary, to continue your work.

Note: The process should take approximately 10 minutes to complete. Both FortiSIEM nodes become independent after running the command.

Step 2: Restart phMonitor on all Secondary Workers.

Restart phMonitor on all Secondary Workers.

Step 3. Remove Failed ClickHouse Keeper Nodes in Site 1

  1. Logon to Site 2 GUI.

  2. Go to Admin > Settings > Database > ClickHouse Config.

  3. In Keeper Cluster, select the Site 1 Nodes and click to delete the nodes.

  4. Click Test. Once this succeeds, then click Deploy.

Site 1 Up and Supervisor / Worker Recovered

Site1 has recovered after failure meaning that the Supervisor and Workers are up. You first need to make Site 1 as Secondary and then (optionally) switch roles if you want Site 1 to become Primary again.

Follow the instructions below to make Site 1 as Secondary.

  1. Logon to the current Primary FortiSIEM node (this should be Site 2) using the GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed (this should be Site 1). It should appear as Inactive under the Replication Status column.

  4. Click Edit.

  5. Review the information to ensure that all the information is correct.
    Note: The information is read only.

  6. When done, click Save. The Original Primary (Site 1) now becomes the Secondary in your Disaster Recovery configuration. The Replication Status changes from Inactive to Active.

  7. Now, if you want to switch roles so Site 1 becomes Primary again, follow the instructions in Switching Primary and Secondary Roles.

  8. Add back 2 recovered Site 1 ClickHouse Keeper nodes:

    1. Clean up the Keeper registry by running the following command.

      rm -rf /data-clickhouse-hot-1/clickhouse-keeper

    2. Go to ADMIN > Settings > Database > ClickHouse Config and add the 2 workers to the ClickHouse Keeper cluster.

Site 1 Up but Supervisor / Worker Cannot be Recovered

In this case, Site 1 is up, but the Supervisor and Workers cannot be recovered after failure.

In this situation, you need to first reinstall a Supervisor and Workers on Site 1 and make Site 1 as Secondary.

  1. Logon to the current Primary FortiSIEM node (Site 2) using the GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed. It should appear with as Inactive under the Replication Status column.

  4. Click Delete to remove it from the Disaster Recovery configuration.

  5. Log out from Site 2.

  6. Re-install Supervisor and Workers in Site 1.

  7. Log back onto Site 2 GUI.

  8. Add Site 1 as a new Secondary by following the instructions in Configuring Disaster Recovery.

Now, if you want to switch roles so Site 1 becomes Primary again, follow the instructions in Switching Primary and Secondary Roles.

Switching Primary and Secondary Roles

If you need to change your Disaster Recovery setup so that Site 2 will be Secondary, and Site 1 will be Primary, take the following steps.

  1. Logon to the current Secondary FortiSIEM node (Site 1) as root, and run the following command:

    phsecondary2primary

    When the job is completed, Site 1 is now the Primary.

  2. Logon to the Site 1 (Primary) UI.

  3. Navigate to ADMIN > License > Nodes.

  4. Select the Site 2 (Secondary) FortiSIEM node listed and click Edit.

  5. Review the information to ensure that all the information is correct.

  6. When done, click Save.
    Site 1 will become Primary and Site 2 will be Secondary. Remember to change the DNS addresses after this role switch so that users are logging on to the Primary and Collectors are sending to Primary Workers.

Recovering from Human Error

If, by mistake, the phsecondary2primary command is executed, it turns Site 2 (Secondary) node to Primary. At this point, you have two independent Primary nodes. To recover, take the following steps:

  1. Logon to the FortiSIEM you wish to be the current Primary.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary node.

  4. Click Edit.

  5. Review the information to ensure that all the information is correct.

  6. When done, click Save.

Turning Off Disaster Recovery

To turn off Disaster Recovery, take the following steps.

  1. Logon to the current Primary GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed and click Delete.

  4. Click Yes to confirm the operation.

Removing ClickHouse Data and Keeper Nodes

To clean up nodes that have been used as ClickHouse data nodes only, or as ClickHouse Keeper and ClickHouse data nodes, run the following script:

cleanup_replicated_tables_and_keeper.sh

The script can be found here: /opt/phoenix/phscripts/clickhouse/cleanup_replicated_tables_and_keeper.sh

Note: Since not all nodes act as ClickHouse Keeper Node and ClickHouse data node at the same time, the script might warn that tables or keeper directory is not found. The warning messages can be ignored.