Fortinet black logo

High Availability and Disaster Recovery Procedures - ClickHouse

Disaster Recovery Operations

Disaster Recovery Operations

Primary (Site 1) Down, Secondary (Site 2) becomes Primary

Primary (Site 1) has failed and unavailable. The current Secondary (Site 2) needs to become Primary. The Collectors will start buffering events since the Site 1 Workers are down.

Follow the steps below to promote Site 2 to Primary.

Step 1. Make Secondary the New Primary

Log on to Site 2 (Secondary) as root and run the following command:

phsecondary2primary

Site 2 (Secondary) becomes Primary, and you can log on to Site 2, which becomes the current Primary, to continue your work.

Note: The process should take approximately 10 minutes to complete. Both FortiSIEM nodes become independent after running the command.

Step 2: Restart phMonitor on all Secondary Workers.

Restart phMonitor on all Secondary Workers.

Step 3. Remove Failed ClickHouse Keeper Nodes in Site 1

Since Site 1 is down, more than 1 ClickHouse Keeper nodes are down and they need to be removed from the ClickHouse database using these steps below.

Step A. Delete the failed nodes from ClickHouse Keeper config file.

On all surviving ClickHouse Keeper nodes, modify the Keeper config located at /data-clickhouse-hot-1/clickhouse-keeper/conf/keeper.xml

<yandex>
    <keeper_server>
         <raft_configuration>
             <server>
               -- server 1 info --
             </server>
             <server>
                -- server 2 info --
             </server>
          </raft_configuration>
      </keeper_server>
</yandex>

Remove any <server></server> block corresponding to the nodes that are down.

Step B. Prepare systemd argument for recovery

On all surviving ClickHouse Keeper nodes, make sure /data-clickhouse-hot-1/clickhouse-keeper/.systemd_argconf file has a single line as follows:

ARG1=--force-recovery

Step C. Restart ClickHouse Keeper

On all surviving ClickHouse Keeper nodes, run the following command.

systemctl restart ClickHouseKeeper

Check to make sure ClickHouse Keeper is restarted with --force-recovery option. Check with the following command.

systemctl status ClickHouseKeeper

The output of the status command will show whether --force-recovery is being applied at "Process:" line, like so:

Process: 487110 ExecStart=/usr/bin/clickhouse-keeper --daemon --force-recovery --config=/data-clickhouse-hot-1/clickhouse-keeper/conf/keeper.xml (code=exited, status=0/SUCCESS)

Step D. Check to make sure ClickHouse Server(s) are up

After Step C is done on all surviving ClickHouse Keeper nodes, wait for a couple of minutes to make sure the sync up between ClickHouse Keeper and ClickHouse Server finish successfully. Use four letter commands to check the status of ClickHouse Keepers.

On each surviving ClickHouse Keeper node, issue the following command.

echo stat | nc localhost 2181

If "clickhouse-client" shell is available, then ClickHouse Server is up and fully synced up with ClickHouse Keeper cluster.

Extra commands such as mntr(maintenance) and srvr(server) will also provide overlapping and other information as well.

Step E. Update Redis key for ClickHouse Keepers

On Supervisor Leader Redis, set the following key to the comma-separated list of surviving ClickHouse Keeper nodes' IP addresses. For example, assuming 172.30.58.96 and 172.30.58.98 are surviving ClickHouse Keeper nodes, the command will be:

set cache:ClickHouse:clickhouseKeeperNodes "172.30.58.96,172.30.58.98"

You can use the following command to check the values before and after setting to the new value:

get cache:ClickHouse:clickhouseKeeperNode

Step F. Update ClickHouse Config via GUI.

After all the changes from Steps A-E above have been successfully executed, you can delete all the down ClickHouse Keeper from ADMIN > Settings > ClickHouse Config page.

Site 1 Up and Supervisor / Worker Recovered

Site1 has recovered after failure meaning that the Supervisor and Workers are up. You first need to make Site 1 as Secondary and then (optionally) switch roles if you want Site 1 to become Primary again.

Follow the instructions below to make Site 1 as Secondary.

  1. Logon to the current Primary FortiSIEM node (this should be Site 2) using the GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed (this should be Site 1). It should appear as Inactive under the Replication Status column.

  4. Click Edit.

  5. Review the information to ensure that all the information is correct.
    Note: The information is read only.

  6. When done, click Save. The Original Primary (Site 1) now becomes the Secondary in your Disaster Recovery configuration. The Replication Status changes from Inactive to Active.

  7. Now, if you want to switch roles so Site 1 becomes Primary again, follow the instructions in Switching Primary and Secondary Roles.

  8. Add back 2 recovered Site 1 ClickHouse Keeper nodes:

    1. Clean up the Keeper registry by running the following command.

      rm -rf /data-clickhouse-hot-1/clickhouse-keeper

    2. Go to ADMIN > Settings > Database > ClickHouse Config and add the 2 workers to the ClickHouse Keeper cluster.

Site 1 Up but Supervisor / Worker Cannot be Recovered

In this case, Site 1 is up, but the Supervisor and Workers cannot be recovered after failure.

In this situation, you need to first reinstall a Supervisor and Workers on Site 1 and make Site 1 as Secondary.

  1. Logon to the current Primary FortiSIEM node (Site 2) using the GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed. It should appear with as Inactive under the Replication Status column.

  4. Click Delete to remove it from the Disaster Recovery configuration.

  5. Log out from Site 2.

  6. Re-install Supervisor and Workers in Site 1.

  7. Log back onto Site 2 GUI.

  8. Add Site 1 as a new Secondary by following the instructions in Configuring Disaster Recovery.

Now, if you want to switch roles so Site 1 becomes Primary again, follow the instructions in Switching Primary and Secondary Roles.

Switching Primary and Secondary Roles

If you need to change your Disaster Recovery setup so that Site 2 will be Secondary, and Site 1 will be Primary, take the following steps.

  1. Logon to the current Secondary FortiSIEM node (Site 1) as root, and run the following command:

    phsecondary2primary

    When the job is completed, Site 1 is now the Primary.

  2. Logon to the Site 1 (Primary) UI.

  3. Navigate to ADMIN > License > Nodes.

  4. Select the Site 2 (Secondary) FortiSIEM node listed and click Edit.

  5. Review the information to ensure that all the information is correct.

  6. When done, click Save.
    Site 1 will become Primary and Site 2 will be Secondary. Remember to change the DNS addresses after this role switch so that users are logging on to the Primary and Collectors are sending to Primary Workers.

Recovering from Human Error

If, by mistake, the phsecondary2primary command is executed, it turns Site 2 (Secondary) node to Primary. At this point, you have two independent Primary nodes. To recover, take the following steps:

  1. Logon to the FortiSIEM you wish to be the current Primary.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary node.

  4. Click Edit.

  5. Review the information to ensure that all the information is correct.

  6. When done, click Save.

Turning Off Disaster Recovery

To turn off Disaster Recovery, take the following steps.

  1. Logon to the current Primary GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed and click Delete.

  4. Click Yes to confirm the operation.

Removing ClickHouse Data and Keeper Nodes

To clean up nodes that have been used as ClickHouse data nodes only, or as ClickHouse Keeper and ClickHouse data nodes, run the following script:

cleanup_replicated_tables_and_keeper.sh

The script can be found here: /opt/phoenix/phscripts/clickhouse/cleanup_replicated_tables_and_keeper.sh

Note: Since not all nodes act as ClickHouse Keeper Node and ClickHouse data node at the same time, the script might warn that tables or keeper directory is not found. The warning messages can be ignored.

Disaster Recovery Operations

Primary (Site 1) Down, Secondary (Site 2) becomes Primary

Primary (Site 1) has failed and unavailable. The current Secondary (Site 2) needs to become Primary. The Collectors will start buffering events since the Site 1 Workers are down.

Follow the steps below to promote Site 2 to Primary.

Step 1. Make Secondary the New Primary

Log on to Site 2 (Secondary) as root and run the following command:

phsecondary2primary

Site 2 (Secondary) becomes Primary, and you can log on to Site 2, which becomes the current Primary, to continue your work.

Note: The process should take approximately 10 minutes to complete. Both FortiSIEM nodes become independent after running the command.

Step 2: Restart phMonitor on all Secondary Workers.

Restart phMonitor on all Secondary Workers.

Step 3. Remove Failed ClickHouse Keeper Nodes in Site 1

Since Site 1 is down, more than 1 ClickHouse Keeper nodes are down and they need to be removed from the ClickHouse database using these steps below.

Step A. Delete the failed nodes from ClickHouse Keeper config file.

On all surviving ClickHouse Keeper nodes, modify the Keeper config located at /data-clickhouse-hot-1/clickhouse-keeper/conf/keeper.xml

<yandex>
    <keeper_server>
         <raft_configuration>
             <server>
               -- server 1 info --
             </server>
             <server>
                -- server 2 info --
             </server>
          </raft_configuration>
      </keeper_server>
</yandex>

Remove any <server></server> block corresponding to the nodes that are down.

Step B. Prepare systemd argument for recovery

On all surviving ClickHouse Keeper nodes, make sure /data-clickhouse-hot-1/clickhouse-keeper/.systemd_argconf file has a single line as follows:

ARG1=--force-recovery

Step C. Restart ClickHouse Keeper

On all surviving ClickHouse Keeper nodes, run the following command.

systemctl restart ClickHouseKeeper

Check to make sure ClickHouse Keeper is restarted with --force-recovery option. Check with the following command.

systemctl status ClickHouseKeeper

The output of the status command will show whether --force-recovery is being applied at "Process:" line, like so:

Process: 487110 ExecStart=/usr/bin/clickhouse-keeper --daemon --force-recovery --config=/data-clickhouse-hot-1/clickhouse-keeper/conf/keeper.xml (code=exited, status=0/SUCCESS)

Step D. Check to make sure ClickHouse Server(s) are up

After Step C is done on all surviving ClickHouse Keeper nodes, wait for a couple of minutes to make sure the sync up between ClickHouse Keeper and ClickHouse Server finish successfully. Use four letter commands to check the status of ClickHouse Keepers.

On each surviving ClickHouse Keeper node, issue the following command.

echo stat | nc localhost 2181

If "clickhouse-client" shell is available, then ClickHouse Server is up and fully synced up with ClickHouse Keeper cluster.

Extra commands such as mntr(maintenance) and srvr(server) will also provide overlapping and other information as well.

Step E. Update Redis key for ClickHouse Keepers

On Supervisor Leader Redis, set the following key to the comma-separated list of surviving ClickHouse Keeper nodes' IP addresses. For example, assuming 172.30.58.96 and 172.30.58.98 are surviving ClickHouse Keeper nodes, the command will be:

set cache:ClickHouse:clickhouseKeeperNodes "172.30.58.96,172.30.58.98"

You can use the following command to check the values before and after setting to the new value:

get cache:ClickHouse:clickhouseKeeperNode

Step F. Update ClickHouse Config via GUI.

After all the changes from Steps A-E above have been successfully executed, you can delete all the down ClickHouse Keeper from ADMIN > Settings > ClickHouse Config page.

Site 1 Up and Supervisor / Worker Recovered

Site1 has recovered after failure meaning that the Supervisor and Workers are up. You first need to make Site 1 as Secondary and then (optionally) switch roles if you want Site 1 to become Primary again.

Follow the instructions below to make Site 1 as Secondary.

  1. Logon to the current Primary FortiSIEM node (this should be Site 2) using the GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed (this should be Site 1). It should appear as Inactive under the Replication Status column.

  4. Click Edit.

  5. Review the information to ensure that all the information is correct.
    Note: The information is read only.

  6. When done, click Save. The Original Primary (Site 1) now becomes the Secondary in your Disaster Recovery configuration. The Replication Status changes from Inactive to Active.

  7. Now, if you want to switch roles so Site 1 becomes Primary again, follow the instructions in Switching Primary and Secondary Roles.

  8. Add back 2 recovered Site 1 ClickHouse Keeper nodes:

    1. Clean up the Keeper registry by running the following command.

      rm -rf /data-clickhouse-hot-1/clickhouse-keeper

    2. Go to ADMIN > Settings > Database > ClickHouse Config and add the 2 workers to the ClickHouse Keeper cluster.

Site 1 Up but Supervisor / Worker Cannot be Recovered

In this case, Site 1 is up, but the Supervisor and Workers cannot be recovered after failure.

In this situation, you need to first reinstall a Supervisor and Workers on Site 1 and make Site 1 as Secondary.

  1. Logon to the current Primary FortiSIEM node (Site 2) using the GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed. It should appear with as Inactive under the Replication Status column.

  4. Click Delete to remove it from the Disaster Recovery configuration.

  5. Log out from Site 2.

  6. Re-install Supervisor and Workers in Site 1.

  7. Log back onto Site 2 GUI.

  8. Add Site 1 as a new Secondary by following the instructions in Configuring Disaster Recovery.

Now, if you want to switch roles so Site 1 becomes Primary again, follow the instructions in Switching Primary and Secondary Roles.

Switching Primary and Secondary Roles

If you need to change your Disaster Recovery setup so that Site 2 will be Secondary, and Site 1 will be Primary, take the following steps.

  1. Logon to the current Secondary FortiSIEM node (Site 1) as root, and run the following command:

    phsecondary2primary

    When the job is completed, Site 1 is now the Primary.

  2. Logon to the Site 1 (Primary) UI.

  3. Navigate to ADMIN > License > Nodes.

  4. Select the Site 2 (Secondary) FortiSIEM node listed and click Edit.

  5. Review the information to ensure that all the information is correct.

  6. When done, click Save.
    Site 1 will become Primary and Site 2 will be Secondary. Remember to change the DNS addresses after this role switch so that users are logging on to the Primary and Collectors are sending to Primary Workers.

Recovering from Human Error

If, by mistake, the phsecondary2primary command is executed, it turns Site 2 (Secondary) node to Primary. At this point, you have two independent Primary nodes. To recover, take the following steps:

  1. Logon to the FortiSIEM you wish to be the current Primary.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary node.

  4. Click Edit.

  5. Review the information to ensure that all the information is correct.

  6. When done, click Save.

Turning Off Disaster Recovery

To turn off Disaster Recovery, take the following steps.

  1. Logon to the current Primary GUI.

  2. Navigate to ADMIN > License > Nodes.

  3. Select the Secondary FortiSIEM node listed and click Delete.

  4. Click Yes to confirm the operation.

Removing ClickHouse Data and Keeper Nodes

To clean up nodes that have been used as ClickHouse data nodes only, or as ClickHouse Keeper and ClickHouse data nodes, run the following script:

cleanup_replicated_tables_and_keeper.sh

The script can be found here: /opt/phoenix/phscripts/clickhouse/cleanup_replicated_tables_and_keeper.sh

Note: Since not all nodes act as ClickHouse Keeper Node and ClickHouse data node at the same time, the script might warn that tables or keeper directory is not found. The warning messages can be ignored.