Fortinet black logo

Administration Guide

High Availability support in FortiSOAR

Copy Link
Copy Doc ID 5c3f3ace-db1f-11eb-97f7-00505692583a:869250
Download PDF

High Availability support in FortiSOAR

High Availability (HA) can be achieved using the following methods:

  • Nightly database backups and incremental VM snapshots: FortiSOAR provides backup scripts that are scheduled to run at pre-defined intervals and take full database backup on a shared or backed up drive. The full backups have to be supplemented with incremental Virtual Machine (VM) snapshots whenever there are changes made to the file system, such as connector installation changes, config file changes, upgrades, schedule changes, etc. For more information, see the Backing up and Restoring FortiSOAR chapter.
  • HA provided by the underlying virtualization platform: Your Virtualization platform also provides HA, such as VMware HA and AWS EBS snapshots. This method relies on your expertise and infrastructure.
  • Externalized Database: This method allows you to externalize your PostgreSQL database and uses your own database's HA solution. VM snapshots have to be taken when there are changes made to the file system, such as connector installation changes, config file changes, upgrades, schedule changes, etc.
    For more information on externalizing PostgreSQL database, see the Externalization of your FortiSOAR PostgreSQL database chapter.
  • HA clusters: FortiSOAR provides a clustering solution with more than one FortiSOAR node joined to form an HA cluster. When you deploy FortiSOAR instance, the FortiSOAR Configuration Wizard configures the instance as a single node cluster, and it is created as an active primary node. You can join more nodes to this node to form a multi-node cluster. This method is explained in detail in this chapter.

FortiSOAR implements HA Clustering with the use of PostgreSQL database clustering. It supports Active/Active and Active/Passive configurations with both internal and external PostgreSQL databases. HA clusters can be used to fulfill the following two use cases: Disaster Recovery (DR) and Scaling. For DR you can configure an Active/Passive cluster that has the passive node located in a remote datacenter. For scaling workflow execution across multiple nodes, you can use co-located Active/Active cluster nodes.

High Availability Types supported with FortiSOAR

You can configure FortiSOAR with either an externalized PostgreSQL database or an internal PostgreSQL database. For both cases you can configure Active-Active or Active-Passive high availability clusters.

High Availability with an internal PostgreSQL database

FortiSOAR HA/DR is based on internal clustering that takes care of replicating data (PostgreSQL) to all cluster nodes, and provides an administration CLI (csadm) to manage the cluster and perform the "Takeover" operation, when necessary. FortiSOAR uses PostgreSQL streaming replication, which is asynchronous in nature. For more information, see PostgreSQL: Documentation.

You can configure FortiSOAR for high availability (HA) with an internal PostgreSQL database in the following two ways:

  • In an Active-Active HA cluster configuration, at least two nodes are actively running the same kind of service simultaneously. The main aim of the active-active cluster is to achieve load balancing and horizontal scaling, while data is being replicated asynchronously. You should front multiple active nodes with a proxy or a load balancer to effectively direct requests to all nodes.
    FortiSOAR™ with an internal database and an Active/Active configuration
  • In an Active-Passive HA cluster configuration, one or more passive or standby nodes are available to take over if the primary node fails. Processing is done only by the primary node. However, when the primary node fails, then a standby node can be promoted as the primary node. In this configuration, you can have one active node and one or more passive nodes configured in a cluster, which provides redundancy, while data is being replicated asynchronously.
    FortiSOAR™ with an internal database and an Active/Passive configuration

High Availability with an externalized PostgreSQL database

In case of an externalized database, the user will use their own database's HA solution. FortiSOAR ensures that changes done in the file system of any of the cluster nodes arising from the connector install/uninstall or any changes in the module definitions are synced across every node so a secondary or passive node can takeover in the least time in case of a failure of the primary node.

FortiSOAR™ with an external database and an Active/Active configuration

Cluster Licensing

FortiSOAR version 6.4.4 and later does not mandate 'Additional Users' entitlement to be the same across all cluster nodes, i.e., you do not require to buy additional user licenses for clustered nodes. User count entitlement will now always be validated from the primary node. The secondary nodes can have the basic two-user entitlement.

The HA cluster shares the user count details from primary node of the cluster. Hence, all 'Concurrent Users' count restrictions apply as per the primary node. If a node leaves the cluster, the restriction will apply as per its own original license. For more information about FortiSOAR licensing, see the Licensing FortiSOAR chapter in the "Deployment Guide."

Viewing and updating the license of an HA cluster

In case your FortiSOAR instance is part of a High Availability cluster, the License Manager page also displays the information about the nodes in the cluster, if you have added secondary node(s) as shown in the following image:
License Manager Page in case of your FortiSOAR™ instance is part of a High Availability cluster
As shown in the above image, the primary node is Node 2 and that node is licensed with 7 users, therefore the Allowed User Seats count displays as 7 users.
To update the license for each node, click Update License and upload the license for that node.

Note

If you update a license that does not match with the system UUID, then you will get a warning on UI while updating the license. If you update the same license in more than one environment then the license is detected duplicate and you require to correct the license, else your FortiSOAR UI will be blocked in 2 hours.

If a license on one node of an HA cluster expires, you will not be able to access any nodes of that HA cluster. All nodes in that HA cluster will display the same FortiSOAR UI page, asking you to deploy a new valid license for the expired nodes:
FortiSOAR UI displaying nodes in an HA cluster that have an expired license

Prerequisites to configuring High Availability

  • Your FortiSOAR instance must be a 5.0.0 and later instance, either a fresh install of 5.0.0 and later or your instance must be upgraded to 5.0.0 and later.
  • All nodes of a cluster should DNS resolvable from each other.
  • Ensure that the ssh session does not time out by entering into the screen mode. For more information, see the Handle session timeouts while running the FortiSOAR upgrade article present in the Fortinet Knowledge Base.
  • If you have a security group (AWS) or an external firewall between the HA nodes, then you must open the following ports between HA nodes on AWS or the external firewall:
    For PostgreSQL: 5432, for MQ TCP traffic: 5671, and for ElasticSearch: 9200
  • Fronting and accessing the FortiSOAR HA Cluster with a Load Balancer such as HAProxy, Gobetween, or a Reverse Proxy is recommended so that the address remains unchanged on takeover.

Process for configuring High Availability

Steps to configure FortiSOAR HA cluster with an internal PostgreSQL database

If you are configuring HA with an internal PostgreSQL database, ensure that you have met all the Prerequisites criteria (see the Prerequisites to configuring High Availability section) and then perform the following steps:

Important: You must join nodes to a HA cluster in a sequentially order.

  1. Use the FortiSOAR Admin CLI (csadm) to configure HA for your FortiSOAR instances. For more information, see the FortiSOAR Admin CLI chapter. Connect to your VM as a root user and run the following command:
    # csadm ha
    This will display the options available to configure HA:
    FortiSOAR™ Admin CLI - csadm ha command output
  2. To configure a node as a secondary node, ensure that all HA nodes are resolvable through DNS and then SSH to the server that you want to configure as a secondary node and run the following command:
    # csadm ha join-cluster --status <active, passive> --role secondary --primary-node <DNS_Resolvable_Primary_Node_Name>
    Once you enter this command, you will be prompted to enter the SSH password to access your primary node.
    In case of a cloud environment, where authentication is key-based, you require to run the following command:
    # csadm ha join-cluster --status <active, passive> --role <primary, secondary> --primary-node <DNS_Resolvable_Primary_Node_Name> --primary-node-ssh-key <Path_To_Pem_File>
    This will add the node as a secondary node in the cluster.
    Note: When you join a node to an HA cluster, the list-nodes command does not display that a node is in the process of joining the cluster. The newly added node will be displayed in the list-nodes command only after it has been added to the HA cluster.
Note

If you have upgraded FortiSOAR and are joining a freshly provisioned node using the join-cluster operation to a cluster having some connectors installed, then you are required to manually reinstall the connectors that were present on the existing node on the new node.

Also, note that if you have built your own custom connector, then you must upload the .tgz file of the connector on all the nodes within the HA cluster.
When you are uploading the .tgz file on all the nodes, you must ensure that you select the Delete all existing versions checkbox. You must also ensure that you have uploaded the same version of the connector to all the nodes.

Steps to configure FortiSOAR HA cluster with an external PostgreSQL database

If you are configuring HA with an external PostgreSQL database, perform the following steps:

  1. Externalize the PostgreSQL database for the primary node of your HA configuration. For the procedure for externalizing PostgreSQL databases, see the Externalization of your FortiSOAR PostgreSQL database chapter.
  2. Add the hostnames of the secondary nodes to the allowlist in the external database.
  3. Add the hostnames of the secondary notes to the pg_hba.conf (/var/lib/pgsql/12/data/pg_hba.conf) file in the external database. This ensures that the external database trusts the FortiSOAR server for incoming connections.
  4. Ensure that you have met all the Prerequisites criteria (see the Prerequisites to configuring High Availability section).
  5. Create the HA cluster by following the steps mentioned in the Steps to configure FortiSOAR HA cluster with an internal PostgreSQL database section.

Takeover

Use the csadm ha takeover command to perform a takeover when your active primary node is down. Run this command on the secondary node that you want to configure as your active primary node.

If during takeover you specify no to the Do you want to invoke ‘join-cluster’ on other cluster nodes? prompt, or if any node(s) is not reachable, then you will have to reconfigure all the nodes (or the node(s) that were not reachable) in the cluster to point to the new active primary node using the csadm ha join-cluster command.

During the takeover operation, if the secondary node license user entitlement is lesser than that on the primary node, then the licenses get swapped between the new primary node (node B) and the old primary node (node A). To prevent any undesirable node lockouts, FortiSOAR checks the user count entitlement of both licenses before exchanging the licenses between Node A and Node B. If Node B already has a higher user entitlement, then the licenses are not swapped. Therefore, no duplicate license violation will occur once Node A comes back online in case of matching user entitlements of cluster nodes.

The swapping of licenses during takeover leads to the following scenarios:

  • If Node A is alive at the time of the takeover operation, then whether Node A joins back the HA cluster or not, it synchronizes to the Fortinet Licensing Portal with the license previously associated with Node B.
  • If Node A is not alive at the time of the takeover operation, then it synchronizes with FDN with its old license, which is being used by Node B as well; and this might cause a node lockout, if this is not corrected manually, by deploying the old Node B license onto Node A, in the grace window of two hours. Note, that FortiSOAR allows a grace period of two hours even when FDN reports a duplicate license.
Note

After you have performed takeover and configured a secondary node as the active primary node, then you will observe that the log forwarder configurations are not present on the new primary node. This is because Syslog settings are not replicated to the passive node since the passive node could be in a remote datacenter and with network latencies between datacenters. Also, the same Syslog server might not be the ideal choice for log forwarding from the DR node. If you want to forward logs from the passive node, you must enable it manually using the csadm log forward command. For more information, see the FortiSOAR Admin CLI chapter.

Usage of the csadm ha command

Certain operations, such as takeover, join cluster, etc. might take a longer period of time to run, therefore you must ensure that your ssh session does not timed out by entering into the screen mode. For more information, see the Handle session timeouts while running the FortiSOAR upgrade article present in the Fortinet Knowledge Base.

You can get help for the csadm ha command and subcommands using the --help parameter.

Note

It is recommended that you perform operations such as join-cluster, leave-cluster, etc sequentially. For example, when you are adding nodes to a cluster, it is recommended that you add the nodes in a sequence, i.e., one after the other rather than adding them in parallel.

The following table lists all the subcommands that you can use with the csadm ha command:

Subcommand Brief Description
list-nodes Lists all the nodes that are available in the cluster with their respective node names and ID, status, role, and a comment that contains information about which nodes have joined the specific HA cluster and the primary server.
List Nodes command output
You can filter nodes for specific status, role, etc.
For example, if you want to retrieve only those nodes that are active use the following command: csadm ha list-nodes --active, or if you want to retrieve secondary active nodes, then use the following command: csadm ha list-nodes --active --secondary.
Note: The list-nodes command will not display a node that is in the process of joining the cluster, i.e., it will display the newly added node only after it has been added to the HA cluster.
export-conf Exports the configuration of details of the active primary node to a configuration file named ha.conf. For more details on export-conf, see the Process for configuring HA section.
allowlist Adds the hostnames of the secondary nodes in the HA cluster to the allowlist on the active primary node. For more details on allowlist, see the Process for configuring HA section.
Important: Ensure that incoming TCP traffic from the IP address(es) [xxx.xxx.xx.xxx] of your FortiSOAR instance(s) on port(s) 5432, 9200, and 5671 is not blocked by your organization's firewall.
join-cluster Adds a node to the cluster with the role and status you have specified. For more details on join-cluster, see the Process for configuring HA section.
get-replication-stat Displays the replication statistics, i.e., the replication lag and status between cluster nodes.
In the case of secondary nodes, information about total lag and time elapsed from last sync is displayed. In the case of the primary node, information about sending lag, receiving lag, relaying lag, and total lag is displayed.
You can use the following option with this subcommand:
--verbose: Displays detailed replication statistics.
In the case of secondary nodes, information about receive lsn, replay lsn, total lag, and time elapsed from last sync is displayed.
Do not use this subcommand on the primary node since no additional details are displayed for the primary node when you use the --verbose subcommand.
Note: If you have configured FortiSOAR with an externalized PostgreSQL database, then replication statistics will not be displayed for the cluster nodes.
show-health Displays the health information for the current node.
You can use the following option with this subcommand:
--all nodes: Displays the health information for all the nodes in an HA cluster. This information is also available for a single node, and can be used to setup monitoring and sending health statistics of a FortiSOAR instance to external monitoring applications.
--json: Displays the health information in the JSON format.
firedrill Tests your disaster recovery configuration.
You can perform a firedrill on a secondary (active or passive) node only. Running the firedrill suspends the replication to the node's database and sets it up as a standalone node pointing to its local database. Since the firedrill is primarily performed to ensure that the database replication is set up correctly, hence it is not applicable when the database is externalized.
Once you have completed the firedrill, ensure that you perform restore, to get the nodes back to replication mode.
Licenses on a firedrilled node:
- If the node license had a user license entitlement matching the primary node user entitlement, all users can login to the firedrilled node.
- If the node license had a basic user entitlement and the HA cluster had more active users, then only the csadmin user can login to the UI of the firedrilled node. The csadmin user can then activate two users who need to test the firedrill and make the rest of the users inactive.
Note: This does not cause any impact to the primary node or other nodes in the HA cluster. Post-restore, the firedrilled node will join the cluster back and maximum active users as per the entitlement will be honored.
Schedules on a firedrilled node:
The node on which a firedrill is being performed will have their schedules and playbooks stopped, i.e., celerybeatd will be disabled on this node. This is done intentionally as any configured schedules or playbooks should not run when the node is in the firedrill mode.
restore Restores the node back to its original state in the cluster after you have performed a firedrill. That is, csadm ha restore restores the node that was converted to the active primary node after the firedrill back to its original state of a secondary node.
The restore command discards all activities such as record creation, that is done during the firedrill since that data is assumed to be test data. This command will restore the database from the content backed up prior to firedrill.
takeover Performs a takeover when your active primary node is down. Therefore, you must run the csadm ha takeover command on the secondary node that you want to configure as your active primary node.
leave-cluster Removes a node from the cluster and the node goes back to the state it was in before joining the cluster.

Overview of nodes in a FortiSOAR HA cluster

  • A FortiSOAR HA cluster can have only one active primary node, all the other nodes are either active secondary nodes or passive nodes. Uniqueness of the primary node is due to the following:
    • In case of an internal database, all active nodes talk to the database of the primary node for all reads/writes. The database of all other nodes is in the read-only mode and setup for replication from the primary node.
    • Although the queued workflows are distributed amongst all active nodes, the Workflow scheduler runs only on the primary node.
    • All active nodes index the data for quick search into ElasticSearch at the primary node.
    • All integrations or connectors that have a listener configured for notifications, such as IMAP, Exchange, Syslog, etc run the listeners only on the primary node.
      Therefore, if the primary node goes down, one of the other nodes in the cluster must be promoted as the new primary node and the other nodes should rejoin the cluster connecting to the new primary.
  • Active secondary nodes connect to the database of the active primary node and serve FortiSOAR requests. However, passive nodes are used only for disaster recovery and they do not serve any FortiSOAR requests.

Checking replication between nodes in an active-passive configuration

When using an active-passive configuration with internal databases, ensure that replication between the nodes is working correctly using the following steps:

  • Perform the firedrill operation at regular intervals to ensure that the passive node can takeover successfully, when required.
  • Schedule full nightly backups at the active primary node using the FortiSOAR backup and restore scripts. For more information on backup and restore, see the Backing up and Restoring FortiSOAR chapter.

Upgrading an HA cluster

For the procedure on how to upgrade a FortiSOAR High Availability Cluster to 6.4.4, see the Upgrading a FortiSOAR High Availability Cluster to 6.4.4 section in the "Upgrade Guide."

Load Balancer

The clustered instances should be fronted by a TCP Load Balancer such as HAProxy or by gobetween, and clients should connect to the cluster using the address of the proxy.

Setting up HAProxy as a TCP load balancer fronting the two clustered nodes

The following steps list out the steps to install "HAProxy" as a load balancer on a CentOS Virtual Machine:

  1. # yum install haproxy
  2. In the /etc/haproxy/haproxy.cfg file, add the policy as shown in the following image:
    Load balancer policy in the haproxy configuration file
  3. To reload the firewall, run the following commands:
    $ sudo firewall-cmd --zone=public --add-port=<portspecifiedwhilebindingHAProxy>/tcp --permanent
    $ sudo firewall-cmd --reload
  4. Restart haproxy using the following command:
    # systemctl restart haproxy
  5. Use the bind address (instead of the IP address of the node in the cluster) for accessing the FortiSOAR UI.

Using Gobetween load balancer

Gobetween is a minimalistic yet powerful high-performance L4 TCP, TLS, and UDP based load balancer.

It works on multiple platforms like Windows, Linux, Docker, Darwin, etc., and you can build your own load balancer using from source code. Balancing is done based on the following algorithms that you can choose in the configuration:

  • IP hash

  • World famous - Round Robin

  • Least bandwidth

  • Least connection

  • Weight

Configuring Gotbetween for FortiSOAR A-A HA Cluster

Installation:

Gobetween can be installed either on the Linux platform or on Windows. For details on installing gobetween, see 'Installation' section of the gobetween documentation.

Configuration:

Edit the gobetween.toml configuration file and then restart the gobetween service for the changes to take effect. A sample configuration follows:

The configuration has three sections,

  • The first one describes the protocol to be used and defines the port to which the load balancer will be bound:
    [servers.fsr]
    protocol = "tcp"
    bind = "0.0.0.0:3000"

  • The second describes how the FortiSOAR nodes are discovered:
    [servers.fsr.discovery]
    kind = "static"
    static_list = [
    "qa-env5.fortisoar.in:443 weight=25 priority=1",
    "qa-env7.fortisoar.in:443 weight=25 priority=1",
    "qa-env9.fortisoar.in:443 weight=25 priority=1",
    "qa-env10.fortisoar.in:443 weight=25 priority=1"
    ]

    In the node discovery section, you need to add FortiSOAR nodes and provide their weight and priority to determine how requests to the load balancer will be addressed.

  • The last one checks the ‘health’ status of each node:
    [servers.fsr.healthcheck]
    fails = 1
    passes = 1
    interval = "2s"
    timeout="1s"
    kind = "ping"
    ping_timeout_duration = "500ms"

For more details about configuration, see the gobetween documentation.

Configuring Gotbetween for a MQ Cluster

Initial procedure for setting up a RabbitMQ cluster, such as setting up the hosts file, installing the RabbitMQ server, etc, should already have been completed. For more information see the How to Set up RabbitMQ Cluster on CentOS 7 article. Once the initial setup is completed, do the following:

  1. Set up the RabbitMQ cluster: To setup the RabbitMQ cluster, ensure that the .erlang.cookie file is the same on all nodes. To achieve this, copy the '.erlang.cookie' file from the /var/lib/rabbitmq directory of the primary node to the other nodes. For our example, let us assume the primary node is 'node1' and secondary nodes are 'node2' and 'node3'. To copy the '.erlang.cookie' file use the scp command from the primary node ('node1'). For example:
    scp /var/lib/rabbitmq/.erlang.cookie root@node2:/var/lib/rabbitmq/
    scp /var/lib/rabbitmq/.erlang.cookie root@node3:/var/lib/rabbitmq/
    Ensure that there are no errors on both the servers, then join the node2 and node3 to node1, using the join-cluster command, to create a RabbitMQ cluster. For more information, see the Process for configuring High Availability section.
  2. Configure RabbitMQ Setup Queue Mirroring: You must configure the 'ha policy' cluster for queue mirroring and replication to all cluster nodes. If the node that hosts queue master fails, the oldest mirror will be promoted to the new master as long as it synchronized, depending on the 'ha-mode' and 'ha-params' policies.
    Following are some examples of the RabbitMQ ha policies:
    Setup an ha policy named 'ha-all' with all queues on the RabbitMQ cluster that will be mirrored to all nodes on the cluster:
    sudo rabbitmqctl set_policy ha-all ".*" '{"ha-mode":"all"}'
    Setup ha policy named 'ha-nodes' with all queue names that start with 'nodes' and that will be mirrored to two specific nodes 'node02' and 'node03' on the cluster:
    sudo rabbitmqctl set_policy ha-nodes "^nodes\." \
    '{"ha-mode":"nodes","ha-params":["rabbit@node02", "rabbit@node03"]}'
    You can check all the available policies using the following command:
    sudo rabbitmqctl list_policies;
    If you want to remove a specific policy, use the following command:
    sudo rabbitmqctl clear_policy <name_of_policy>
  3. Ensure that the SSL certificates that you specify while configuring the secure message exchange must be the same on all the nodes and should have the secure message exchange's CN name or should be a wildcard.
    For information on adding a secure message exchange, see the Deploying FortiSOAR chapter in the "Deployment Guide." When you are adding or configuring the secure message exchange, in the Add New Secure Message Exchange dialog ensure the following:
    • In the TCP Port field ensure that you enter the same TCP port that you have specified while configuring the secure message exchange. Also, ensure that the FortiSOARnode has outbound connectivity to the secure message exchange at this port.
    • In the Certificate field, you must copy-paste the certificate text of the Certificate Authority (CA) that has signed the secure message exchange certificate in the pem format. If it is a chain, then the complete chain must be provided. By default, the CA certificate for the FortiSOAR self-signed certificate is present at the following location: /opt/cyops/configs/rabbitmq/ssl/cyopsca/cacert.pem
    • Enter the required details in the other fields and save the secure message exchange configuration.
  4. Edit the gobetween.toml configuration file on each of the nodes in the MQ cluster and then restart the gobetween service for the changes to take effect. A sample configuration follows:
    The configuration has three sections,
    • The first one describes the protocol to be used and defines the ports to which the load balancer will be bound on various nodes of the MQ cluster. Ensure that you the enter the same TCP port that you have specified while configuring the secure message exchange and added in the Add New Secure Message Exchange dialog.
      For example, on node 1 it could be:
      [servers.routerapi]
      protocol = "tcp"
      bind = "0.0.0.0:3000"

      For example, on node 2 it could be:
      [servers.routertcp]
      protocol = "tcp"
      bind = "0.0.0.0:3000"

    • The second describes how the MQ cluster nodes are discovered:
      For example, on node 1 it could be:
      [servers.routerapi.discovery]
      kind = "static"
      static_list = [
      "router-node1.fortisoar.in:15671 weight=25 priority=1",
      "router-node2.fortisoar.in:54549 weight=25 priority=1",
      "router-node3.fortisoar.in:54549 weight=25 priority=2"
      ]

      For example, on node 2 it could be:
      [servers.routertcp.discovery]
      kind = "static"
      static_list = [
      "router-node1.fortisoar.in:5671 weight=25 priority=1",
      "router-node2.fortisoar.in:54558 weight=25 priority=1",
      "router-node3.fortisoar.in:54559 weight=25 priority=1"
      ]

      In the node discovery section, you need to add the secure message exchange for the nodes and provide their weight and priority to determine how requests to the load balancer will be addressed.

    • The last one checks the ‘health’ status of the MQ cluster:
      For example, on node 1 it could be:
      [servers.routerapi.healthcheck]
      fails = 1
      passes = 1
      interval = "2s"
      timeout="1s"
      kind = "ping"
      ping_timeout_duration = "500ms"

      For example, on node 2 it could be:
      [servers.routertcp.healthcheck]
      fails = 1
      passes = 1
      interval = "2s"
      timeout="1s"
      kind = "ping"
      ping_timeout_duration = "500ms"

  5. Test your RabbitMQ cluster by opening your web browser and typing the IP address of a node, for example, node 1, whose port is set as '5671'.
    http://<node1IP>:5671/
    Type in the username and password you have configured. If everything is setup correctly, you will see the RabbitMQ admin Dashboard with the status of all the members of the cluster, i.e., node1, node2, and node3, displaying as up and running. You can click the Admin tab and click the Users menu to view the list of active users and the Policies menu to view the list of created policies.

Behavior that might be observed while publishing modules when you are accessing HA clusters using a load balancer

When you have initiated a publish for any module management activity and you are accessing your HA cluster with one or more active secondary nodes using a load balancer such as "HAProxy", then you might observe the following behaviors:

  • While the Publish operation is in progress, you might see many publish status messages on the UI.
  • If you have added a new field to the module, or you have removed a field from the module, then you might observe that these changes are not reflected on the UI. In such cases, you must log out of FortiSOAR and log back into FortiSOAR.
  • After a successful publish of the module(s), you might observe that the Publish button is yet enabled and the modules yet have the asterisk (*) sign. In such cases, you must log out of FortiSOAR and log back into FortiSOAR to view the correct state of the Publish operation.

Tunables

You can tune the following configurations:

  • max_wal_senders = 10
    This attribute defines the maximum number of walsender processes. By default, this is set as 10.
  • wal_keep_segments = 320
    This attribute contains a maximum of 5 GB data.
    Important: Both max_wal_senders and wal_keep_segments attributes are applicable when the database is internal.

Every secondary/passive node needs one wal sender process on the primary node, which means that the above setting can configure a maximum of 10 secondary/passive nodes.

If you have more than 10 secondary/passive nodes, then you need to edit the value of the max_wal_senders attribute in the /var/lib/pgsql/12/data/postgresql.conf file on the primary node and restart the PostgreSQL server using the following command: systemctl restart postgresql-12
Note: You might find multiple occurrences of max_wal_senders attribute in the postgresql.conf file. You always need to edit last occurrence of the max_wal_senders attribute in the postgresql.conf file.

The wal_keep_segments attribute has been set to 320, which means that the secondary nodes can lag behind by the maximum of 5GB. If the lag is more than 5GB, then replication will not work properly, and you will require to reconfigure the secondary node by running the join-cluster command

Also note that Settings changes that are done in any configuration file on an instance, such as changing the log level, etc., apply only to that instance. Therefore, if you want to apply the changed setting to all the node, you have to make those changes across all the cluster nodes.

Best practices

  • Fronting and accessing the FortiSOAR HA cluster with a Load Balancer or a Reverse Proxy is recommended so that the address remains unchanged on takeover.
  • You must ensure that the SIEM and other endpoints that FortiSOAR connects to are reachable on the virtualized host name (DNS) that would remain intact even after a failover (local or geo wide).
  • The FortiSOAR node connects outbound to the SIEM, to periodically pull the "Alerts" (Terminology for this would differ for each SIEM, Eg, ‘Offense’, ‘Corelated Event’, ‘Notable’). The "pull" model also ensures resiliency. In the case of downtime, once the FortiSOAR node comes back up, it would pull the alerts from last pulled time, ensuring there is no data loss even during down time.
  • Tune the wal_keep_segments setting in the /var/lib/pgsql/12/data/postgresql.conf file. When you have a heavy write on the primary node, for example, a lot of playbooks are being run or a lot of alerts are created, then there is a lot of data to be replicated. If the secondary node is in a distant data center, the replication speed might not be able to cope up with the write on the primary node. There is a fixed size for the buffer data (default is 5 GB) that the primary node keeps for data replicated, after which the data rolls over. If a secondary node has not yet copied the data before the data has rolled over, it can become 'out of sync' and require a full synchronization, and then this would cause failures.
    You might want to increase the wal_keep_segments setting in the following scenarios:
    The secondary node is in a distant datacenter and the network speed is slow.
    The secondary node is offline for a very long time due to some issues or for maintenance, etc.
    The secondary node is firedrilled and you want the restore operation to be faster using a differential sync instead of a full sync.
    It is recommended to increase the wal_keep_segments setting to 20 GB (instead of the default, i.e., 5GB) in the /var/lib/pgsql/12/data/postgresql.conf file as follows:
    The ``wal_keep_segmentssetting in the/var/lib/pgsql/12/data/postgresql.conffile appears as follows:wal_keep_segments = 320 # in logfile segments, 16MB each; 0 disables; keeps upto 5GB wal**Note**:postgresql.confcan set thewal_keep_segmentsvalue multiple times in the file. You must change the last occurrence ofwal_keep_segmentsin thepostgresql.conffile. To increase the last occurrence of thewal_keep_segmentssetting to 20 GB change it as follows:wal_keep_segments = 1280 # in logfile segments, 16MB each; 0 disables; keeps upto 20GB`
  • If you are planning to configure high availability in case of a multi-tenancy environment, i.e., for your master or tenant nodes, you must first configure high availability then configure MSSP. For more information on MSSP, see the "Multi-Tenancy support in FortiSOAR Guide".

Monitoring health of HA clusters

All secondary nodes in the cluster exchange HA heartbeat packets with the primary node so that the primary node can monitor and verify the status of all the secondary nodes and the secondary nodes can verify the status of the primary node.

Your system administrator can configure the monitoring of heartbeats on the System Configuration > Application Configuration > System & Cluster Health Monitoring > Cluster Health section. Once you have configured monitoring of heartbeats and if any node in the HA cluster is unreachable, then the other active nodes in the cluster, which are operational, send email notifications and write log messages to alert the system administrator that a failure has occurred. For more information, see the Configuring System and Cluster Health Monitoring topic in the System Administration chapter.

Understanding HA Cluster Health Notifications

HA cluster health notification checks on the primary node

On every scheduled monitoring interval, which defaults to 5 minutes on the primary node for every secondary/passive node, the HA cluster health notifications checks:

  • If there is a heartbeat miss from the secondary/passive node(s) in the last 15 minutes by taking the default values of monitoring interval (5 minutes) * missed heartbeat count (3). If there is a missed heartbeat, then the health notification check sends a "heartbeat failure" notification and exits.
  • If the data replication from the primary node is broken. If yes, then the health notification check sends a notification containing the replication lag with respect to the last known replay_lsn of secondary node and exits.
    Following is a sample notification:
    Following secondary FortiSOAR node(s) seem to have failed -
    Node: hasecondary.myorgdomain
    Current lag with primary node is 97 kB
    
    Failure reason:
    	1. The postgres database replication to the secondary node is not working due to data log rotation at the primary node.
        2. The secondary node has been shutdown/halted.
        3. PostgreSQL not running on node(s).
        4. nodeName from 'csadm ha list-nodes' differs from actual FQDN used during join-cluster.
    
    If node is up and running,
    	1. Check the status of PostgreSQL service using 'systemctl status postgresql-<postgresql-version-here> -l' on node to get more details.
    	2. If you see 'FATAL: could not receive data from WAL stream: requested WAL segment has already been removed' in the PostgreSQL service status, you need to re-join the cluster using 'csadm ha join-cluster --fetch-fresh-backup'
    
  • If the replication lag reaches or crosses the set threshold specified, then the health notification check sends a notification containing the replication lag as shown in the following sample notification:
    Replication lag threshold is reached on following node(s):
    Node: hasecondary.myorgdomain
    Current lag with primary node : 3431 MB
    Configured WAL size : 5120 MB
    Configured Threshold : 3072 MB (60% of the Configured WAL size)
    
  • If any service is not running, then the health notification check sends a "service failure" notification and exits.
  • If a firedrill is in progress on a secondary/passive node. If yes, then the health notification check sends the following notification and exits.
    Firedrill is in progress on following node(s):
    Node: hasecondary.myorgdomain
    Current lag with primary node : 52 kB
    You can ignore the lag that is displayed in this case since this lag indicates the amount of data the firedrill node needs to sync when csadm ha restore is performed.
    You can also check the lag using the get-replication-stat command on the primary node.

HA cluster health notification checks on the secondary node

On every scheduled monitoring interval, which defaults to 5 minutes on the secondary node, the HA cluster health notifications checks:

  • If there is a heartbeat miss from the primary node in the last 15 minutes by taking the default values of health beat interval (5minutes) * missed heartbeat count (3). If there is a missed heartbeat, then the health notification check sends a "heartbeat failure" notification and exits.
  • If there is no heartbeat failure but there is a service failure, then the health notification check sends a "service failure" notification and exits.

HA cluster health notification checks when the HA cluster is set up with an external PostgreSQL database

If the PostgreSQL database is externalized, the email notifications generated by the primary node are different from when the PostgreSQL database is not externalized. On the primary node for every secondary/passive node, the HA cluster health notifications checks:

  • If there is a heartbeat miss from the secondary/passive node(s) in the last 15 minutes by taking the default values of health beat interval (5 minutes) * missed heartbeat count (3). If there is a missed heartbeat, then the health notification check sends a "heartbeat failure" notification as follow and exits:
    Following secondary FortiSOAR node(s) seem to have failed -
    Node: hasecondary.myorgdomain
    Failure reason: Heartbeat failure. Check if the 'cyops-ha' service is running or not using 'systemctl status cyops-ha'.
  • If any service is not running, then the health notification check sends a "service failure" notification as follows and exits:
    Following secondary FortiSOAR node(s) seem to have failed -
    Node: hasecondary.myorgdomain
    Failure reason: cyops-auth service(s) not running.

HA cluster health notification checks when a secondary node is firedrilled

When a firedrill is in progress on a secondary/passive node, then you do not receive any 'out of sync' notification, instead the health notification check sends the following email notification and exits.
Firedrill is in progress on following node(s):
Node: hasecondary.myorgdomain
Current lag with primary node : 52 kB
You can ignore the lag that is displayed in this case since this lag indicates the amount of data the firedrill node needs to sync when csadm ha restore is performed.
You can also check the lag using the get-replication-stat command on the primary node. If a firedrill is in progress on a secondary/passive node, then you can ignore the lag displayed. This is because the 'total_lag' that gets shown in the get-replication-stat messages indicates the amount of data the secondary/passive node will need to sync when the csadm ha restore operation is performed on the node once the firedrill completes.

HA cluster health notification checks during takeover

  1. When takeover is in progress, the previous primary node might send 'out of sync' email notifications for the node that is taken over, because the previous primary sees it as not replicating data anymore. These can be ignored. After the takeover is completed, we mark the previous primary node as faulted. Therefore, you will not see any replication statistics on the old primary node.
  2. After the takeover is performed, you can ignore the messages of the get-replication-stat command on the new primary node. You can also ignore the 'out of sync' email notification that is generated by the new primary node since when we perform the takeover, the entries of all the nodes in the cluster are yet included in csadm ha list-nodes, and because the remaining nodes yet require to join the new primary node, this new primary node keeps generating the notification for all those nodes.
  3. When all the other nodes of a HA cluster join back to the new primary node, then the health notification check starts to work and there will not be any ignorable notification.

Troubleshooting issues based on the notifications

The following section provides details on how to check and fix the possible reasons of failures that are listed in the email notifications sent by the HA cluster check.

To troubleshoot HA issues, you can use the HA log located at: /var/log/cyops/cyops-auth/ha.log.

Heartbeat Failure

Resolution:

When you get a heartbeat failure notification for a secondary node, then do the following:

  1. Check if the cyops-ha service is running on that node, using the systemctl status cyops-ha command.
  2. If it is not running, then you must restart the cyops-ha service.

Node name differs from actual FQDN

Resolution:

Correct the notification such as nodeName from 'csadm ha list-nodes' differs from actual FQDN used during join-cluster. using the following steps:

  1. Login on the node for which you are receiving the above notification using SSH.
  2. Use the following command to correct the FQDN of the node:
    csadm ha set-node-name <enter-correct-FQDN-here>

Secondary/Passive node is out of sync with the Primary node

This issue could occur due to the following reasons:

  • PostgreSQL service status shows requested WAL segment <some-number-here> has already been removed
    OR
    The csadm ha get-replication-stat command shows a higher time lapsed from the last sync when compared to the general time lapsed.
    Resolution:
    In these cases, since the secondary/passive node is completely out of sync with the primary node, you need to perform the following steps on the secondary/passive node:
    1. Run the touch /home/csadmin/.joincluster_in_progress command to create the .joincluster_in_progress file. 2. Rejoin the cluster as follows:
    csadm ha join-cluster --status active/passive --role secondary --primary-node <Primary-node-FQDN> --fetch-fresh-backup
  • When there is heavy write on the primary node and the secondary node has not yet copied the data before the data has rolled over, it will be 'out of sync' and a full synchronization is needed, which can cause the above failures.
    Resolution:
    Increase the wal_keep_segments setting in the /var/lib/pgsql/12/data/postgresql.conf file as described in the Best Practices section.

PostgreSQL service is down on the primary node or PostgreSQL service is down on the externalized database host

If PostgreSQL service is down on the primary node, then the cyops-ha service on all the nodes will be down and there will be no notifications generated since the whole cluster is down; due to this you will also not be able to login to the FortiSOAR UI.

Resolution:

  1. Check the reason for the failure using the systemctl status postgresql-<postgresql-version-here> -l on the primary node or the externalized database host.
  2. Fix the issue based on the reason for failure.

Higher time lapsed from the last sync when compared to the general time lapsed

When you run the csadm ha get-replication-stat command on the secondary node you might get a 'time_elasped_from_last_sync' value is higher than usual message.
This issue could occur due to:

  • PostgreSQL down on the primary server.
  • No activity on primary server.
  • Secondary node is out of sync with the Primary node.

Resolution:

  1. Run the systemctl status postgresql-12 command and ensure that you are not getting the FATAL: could not receive data from WAL stream: requested WAL segment has already been removed. message.
  2. Run the csadm ha get-replication-stat command and check its results:
    • If you get the Value of 'sending_lag' is higher than usual on primary when checked message, this means that there is load on the primary node. Run the top command to check which process is taking more CPU time. You can also use the following command for a quick check:
      ps -eo pid,cmd,%mem,%cpu --sort=-%mem | head
    • If you get the Value of 'replaying_lag' is higher than usual on primary when checked message, this means that there is load on the secondary node. Run the top command to check which process is taking more CPU time. You can also use the following command for a quick check:
      ps -eo pid,cmd,%mem,%cpu --sort=-%mem | head

Sample scale test that were done in the lab to understand the behavior of 'csadm ha get-replication-stat'

What was done before observing the behavior:

First we stopped the PostgreSQL service on the secondary/passive node.

Next, generated data on the primary node using the following script. You need to kill the script after some time when enough data is generated on the primary node.

[root@cybersponse csadmin]# cat data_load.sh
#!/bin/sh

psql -U cyberpgsql -d das -c "CREATE TABLE scale_data (
   section NUMERIC NOT NULL,
   id1     NUMERIC NOT NULL,
   id2     NUMERIC NOT NULL
);"


psql -U cyberpgsql -d das -c "
INSERT INTO scale_data
SELECT sections.*, gen.*
     , CEIL(RANDOM()*100)
  FROM GENERATE_SERIES(1, 300)     sections,
       GENERATE_SERIES(1, 900000) gen
 WHERE gen <= sections * 3000;"
[root@cybersponse csadmin]#

During the data generation process, we ran the csadm ha get-replication-stat on the primary node and you can observe that the secondary node is lagging with 4702 MB.

get-replication-stat on the primary node:

[root@cybersponse csadmin]# csadm ha get-replication-stat
------------------------------------------------
Warning:
Following could be the issues with the nodes:
1. The postgres database replication to the secondary node is not working due to data log rotation at the primarynode.
2. The secondary node has been shutdown/halted.
3. PostgreSQL not running on node(s).
4. nodeName from 'csadm ha list-nodes' differs from actual FQDN used during join-cluster.
5. If a firedrill is in progress on the node, no action is required. The 'lag' that is displayed indicates the amount of data the firedrill node needs to sync when 'csadm ha restore' will be performed.

If node is up and running,
1.  Check the status of PostgreSQL service using 'systemctl status postgresql-12 -l' on node to get more details.
2. If you see 'FATAL: could not receive data from WAL stream: requested WAL segment has already been removed' in the PostgreSQL service status, you need to re-join the cluster using 'csadm ha join-cluster --fetch-fresh-backup' for this node.

------------------------------------------------

nodeId                            nodeName                 status    role       comment                                      total_lag
--------------------------------  -----------------------  --------  ---------  --------------------------------------------  -----------
469c6330613a332c30dd4d8e3a607cf2  hasecondary.myorgdomain  active    secondary  Joined cluster with haprimary.myorgdomain  4702 MB
[root@cybersponse csadmin]# 

Next, start PostgreSQL on the secondary node and observed the 'replication-stat' on both the primary and secondary nodes:

On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:27:31 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  4458 MB        11 MB            213 MB           4683 MB
                                                                                       

On the Secondary node:

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  287 MB       00:07:59.113185     


On the Primary node:

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  3600 MB        3456 kB          727 MB           4330 MB
                                                                                     

On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:27:49 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  854 MB       00:08:18.360359   


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:05 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  2774 MB        5632 kB          1273 MB          4052 MB  


On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:07 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  1486 MB      00:07:35.238068   


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:28 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  1910 MB        6784 kB          1803 MB          3719 MB   


On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:29 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  1952 MB      00:07:56.70475    




On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:44 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  1153 MB        1408 kB          2278 MB          3433 MB
                                                                                         
On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:39 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  2286 MB      00:07:04.28739
                                                         

On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:00 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  452 MB         3200 kB          2726 MB          3181 MB         

On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:12 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  2941 MB      00:07:33.857054 


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:25 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  0 bytes        0 bytes          2658 MB          2658 MB 

On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:30 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  2519 MB      00:06:48.870481  


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:46 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  0 bytes        154 kB           2172 MB          2172 MB    


On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:53 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  1985 MB      00:07:11.244842 


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:30:06 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  0 bytes        0 bytes          1687 MB          1687 MB     

On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:30:11 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  1552 MB      00:06:25.877238   

On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:30:57 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  2288 bytes   00:00:55.861428

On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:31:23 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  0 bytes      00:00:19.235799 

Troubleshooting

To troubleshoot HA issues, you can use the HA log located at: /var/log/cyops/cyops-auth/ha.log. To understand and troubleshoot the HA cluster health notifications, see the Monitoring health of HA clusters section.

Failure to create an HA cluster

If the process to configure HA using the automated join cluster fails, and the HA cluster is not created due to reasons such as, proxies set up etc, you can perform the steps mention in the following procedure and configure HA:

  1. Connect to your VM as a root user and run the following command:
    # csadm ha
    This will display the options available to configure HA.
  2. To configure a node as a secondary node, perform the following steps:
    1. SSH to the active primary node and run the csadm ha export-conf command to export the configuration details of the active primary node to a configuration file named ha.conf.
      You must copy the ha.conf file from the active primary node to the node that you want to configure as a secondary node.
    2. On the active primary server, add the hostnames of the secondary nodes to the allowlist, using the following command:
      # csadm ha allowlist --nodes
      Add the comma-separated list of hostnames of the cluster nodes that you want to the add to the allowlist after the --nodes argument.
      Important: In case of an externalized database, you need to add all the nodes in a cluster to the allowlist in the pg_hba.conf file.
    3. Ensure that all HA nodes are resolvable through DNS and then SSH to the server that you want to configure as a secondary node and run the following command:
      # csadm ha join-cluster --status <active, passive> --role <primary, secondary> --conf <location of the ha.conf file>
      For example, # csadm ha join-cluster --status passive --role secondary --conf tmp/ha.conf
      This will add the node as a secondary node in the cluster.
      Note: If you run the csadm ha join-cluster command without adding the hostnames of the secondary nodes to the allowlist, then you will get an error such as, Failed to verify....
      Also, when you join a node to an HA cluster, the list-nodes command does not display that a node is in the process of joining the cluster. The newly added node will be displayed in the list-nodes command only after it has been added to the HA cluster.

Unable to add a node to an HA cluster using join-cluster, and the node gets stuck at a service restart

This issue occurs when you are performing join-cluster of any node and that node sticks at service restart, specifically at PostgreSQL restart.

Resolution

Terminate the join-cluster process and retry join-cluster using an additional parameter --fetch-fresh-backup.

Fixing the HA cluster when the Primary node of that cluster is halted and then resumed

If your primary node is halted due to a system crash or other such events, and a new cluster is made with the other nodes in the HA cluster, the list-nodes command on other nodes will display that the primary node is in the Faulted state. Since the administrator has triggered takeover on other cluster nodes, the administrator will be aware of the faulted primary node. Also, note that even after the primary node resumes, post the halt, the primary node still remains the primary node of its own cluster, and therefore, after the resume, the list-nodes command on the primary node will display this node as Primary Active.

Resolution

To fix the HA cluster to have only one node as primary active node, do the following:

  1. On the primary node, which got resumed, run leave-cluster, which will remove this node from the HA cluster.
  2. Run join-cluster command to join this node to the HA cluster with the new primary node.

Unable to join a node to an HA cluster when a proxy is enabled

You are unable to join a node to an HA cluster using the join-cluster command when you have enabled a proxy using which clients should connect to the HA cluster.

Resolution

Run the following commands on your primary node:

$ sudo firewall-cmd --zone=trusted --add-source=<CIDR> --add-port=<ElasticSearchPort>/tcp --permanent

$ sudo firewall-cmd --reload

For example,

$ sudo firewall-cmd --zone=trusted --add-source=64.39.96.0/20 --add-port=9200/tcp --permanent

$ sudo firewall-cmd --reload

Changes made in nodes in an active-active cluster fronted with a load balancer take some time to reflect

In the case of a FortiSOAR active-active cluster that is fronted with a load balancer or reverse proxy such as an HAProxy, changes such as, adding a module to the FortiSOAR navigation, updating or adding the permissions of the logged-in user, or updates done to the logged-in user's parents, child, and sibling hierarchy, do not get reflected immediately.

These issues occur due to local caching of these settings at the individual cluster nodes.

Resolution

Log off and log back into the FortiSOAR user interface after ten minutes to see the recent updates.

OR

If you want the settings to reflect immediately, run the following command on the "active" nodes in the cluster:
php /opt/cyops-api/app/console --env=prod app:cache:clear --env=prod
Important: You do not require to run the above command on the "passive" nodes of the cluster.

Post Takeover the nodes in an HA cluster do not point to the new active primary node

This issue occurs when during the takeover process either the previous primary node is down or automatic join-cluster fails. In case of an internal database cluster, when the failed primary node comes online after the takeover, it still thinks of itself as the active primary node with all its services running. In case of an external database cluster, when the failed primary node comes online after the takeover, it detects its status as "Faulted" and disables all its services.

Resolution

Run the csadm ha join-cluster command to point all the nodes to the new active primary node. For details on join-cluster, see Process for configuring HA.

After performing the leave-cluster operation, the license is not found on a secondary node

In case of an internal DB, after you have performed the leave-cluster operation on a secondary node, if for example, you are upgrading the node, and when you are rejoining the node to the cluster, once the upgrade is done, you might see the following error: "License not found on the system". You might also see this error while trying to perform the 'restore' operation on a secondary node after completing the 'firedrill' operation.

Resolution

Run the following script as a root user on the secondary node on which you are getting the "License not found on the system" error:

#!/bin/bash

init(){
    current_pg_version=$(/bin/psql --version | egrep -o '[0-9]{1,}\.' | cut -d'.' -f1)
    re_join_cluster_sql="/opt/cyops-auth/.re-join-cluster.sql"
    db_config="/opt/cyops/configs/database/db_config.yml"
}

re_join_cluster(){
    if [ ! -f "$re_join_cluster_sql" ]; then
        echo "File [$re_join_cluster_sql] does not exist. Contact the Fortinet support team for further assistance"
        exit 1
    fi
    csadm services --stop
    if [ ! -d "/var/lib/pgsql/${current_pg_version}.bkp" ]; then
        mv /var/lib/pgsql/$current_pg_version /var/lib/pgsql/${current_pg_version}.bkp 
    fi
    # Below rm is required if in case user re-run the script again.
    rm -rf /var/lib/pgsql/${current_pg_version}
    rm -f /opt/cyops/configs/cyops_pg_${current_pg_version}_configured
    mkdir -p /var/lib/pgsql/${current_pg_version}/data
    chown -R postgres:postgres /var/lib/pgsql/${current_pg_version}
    chmod -R 700 /var/lib/pgsql/${current_pg_version}
    /opt/cyops-postgresql/config/config.sh ${current_pg_version}
    local hkey=$(csadm license --get-device-uuid)
    sudo -Hiu postgres psql -U postgres -c "ALTER USER cyberpgsql WITH ENCRYPTED PASSWORD '$hkey';"
    createdb -U cyberpgsql -e -w --no-password das -O cyberpgsql -E UTF8
    psql -U cyberpgsql -d das < $re_join_cluster_sql
    touch /home/csadmin/.joincluster_in_progress
    if [ ! -f "${db_config}.bkp" ]; then
        yes| cp ${db_config} ${db_config}.bkp
    fi
    local db_pass_encrypted=$(python3 /opt/cyops/configs/scripts/manage_passwords.py --encrypt $hkey)
    /opt/cyops/configs/scripts/confUtil.py -f $db_config -k 'pg_password' -v "$db_pass_encrypted"
    systemctl start postgresql-${current_pg_version}
}

echo_manual_steps(){
echo "Perform below steps manually"
echo "
    1. If node is passive, then run below command, else skip it.
       csadm ha join-cluster --status passive --primary-node <primary-node> --fetch-fresh-backup
    2. If node is active/secondary.
       csadm ha join-cluster --status active --role secondary --primary-node <primary-node> --fetch-fresh-backup
    3. rm -rf /var/lib/pgsql/${current_pg_version}.bkp
    4. rm -f /opt/cyops/configs/database/db_config.yml.bkp
    5. rm -rf /var/lib/pgsql/test_dr_backups"
}
####################
# Main/main/MAIN starts here
####################

# Stop right after any command failure
set -e
# Debug mode
set -x
init
re_join_cluster
####################
# You need to perform the below step manually.
########
# Turn off the dubug mode to see the steps to perform manually clearly
set +x
echo_manual_steps
exit 0

The leave-cluster operation fails at the "Starting PostgreSQL Service" step when a node in the cluster is faulted

This issue occurs in case of an active-active-passive cluster that has an internal db and whose HA cluster contains a node whose status is 'Faulted'. In this case, when you run the leave-cluster operation it fails at the "Starting service postgresql-12" step.

Resolution

To resolve this issue, run the following commands:

systemctl stop postgresql-12
rm -f /var/lib/pgsql/12/data/standby.signal
systemctl start postgresql-12

Once you have completed performing the above steps, run the csadm ha leave-cluster command.

Resetting the password for an instance that is part of active/active cluster causes the other instances of that cluster to not able to log in to FortiSOAR

If you reset the password of an instance that is part of an active/active cluster, then the FortiSOAR login page is not displayed for the other instances of this cluster. You will also observe data login failure errors in the ha.log and the prod.log in case of both active/active and active/passive clusters.

Resolution

On the other instances that are part of the cluster, do the following:

  1. Copy the encrypted password from the db_config.yml file located at /opt/cyops/configs/database/db_config.yml on the active node and then update the new password in the db_config.yml file on the secondary nodes.
  2. Run the cache:clear command:
    $ sudo -u nginx php /opt/cyops-api/app/console cache:clear
  3. Restart FortiSOAR services:
    # csadm services --restart

High Availability support in FortiSOAR

High Availability (HA) can be achieved using the following methods:

  • Nightly database backups and incremental VM snapshots: FortiSOAR provides backup scripts that are scheduled to run at pre-defined intervals and take full database backup on a shared or backed up drive. The full backups have to be supplemented with incremental Virtual Machine (VM) snapshots whenever there are changes made to the file system, such as connector installation changes, config file changes, upgrades, schedule changes, etc. For more information, see the Backing up and Restoring FortiSOAR chapter.
  • HA provided by the underlying virtualization platform: Your Virtualization platform also provides HA, such as VMware HA and AWS EBS snapshots. This method relies on your expertise and infrastructure.
  • Externalized Database: This method allows you to externalize your PostgreSQL database and uses your own database's HA solution. VM snapshots have to be taken when there are changes made to the file system, such as connector installation changes, config file changes, upgrades, schedule changes, etc.
    For more information on externalizing PostgreSQL database, see the Externalization of your FortiSOAR PostgreSQL database chapter.
  • HA clusters: FortiSOAR provides a clustering solution with more than one FortiSOAR node joined to form an HA cluster. When you deploy FortiSOAR instance, the FortiSOAR Configuration Wizard configures the instance as a single node cluster, and it is created as an active primary node. You can join more nodes to this node to form a multi-node cluster. This method is explained in detail in this chapter.

FortiSOAR implements HA Clustering with the use of PostgreSQL database clustering. It supports Active/Active and Active/Passive configurations with both internal and external PostgreSQL databases. HA clusters can be used to fulfill the following two use cases: Disaster Recovery (DR) and Scaling. For DR you can configure an Active/Passive cluster that has the passive node located in a remote datacenter. For scaling workflow execution across multiple nodes, you can use co-located Active/Active cluster nodes.

High Availability Types supported with FortiSOAR

You can configure FortiSOAR with either an externalized PostgreSQL database or an internal PostgreSQL database. For both cases you can configure Active-Active or Active-Passive high availability clusters.

High Availability with an internal PostgreSQL database

FortiSOAR HA/DR is based on internal clustering that takes care of replicating data (PostgreSQL) to all cluster nodes, and provides an administration CLI (csadm) to manage the cluster and perform the "Takeover" operation, when necessary. FortiSOAR uses PostgreSQL streaming replication, which is asynchronous in nature. For more information, see PostgreSQL: Documentation.

You can configure FortiSOAR for high availability (HA) with an internal PostgreSQL database in the following two ways:

  • In an Active-Active HA cluster configuration, at least two nodes are actively running the same kind of service simultaneously. The main aim of the active-active cluster is to achieve load balancing and horizontal scaling, while data is being replicated asynchronously. You should front multiple active nodes with a proxy or a load balancer to effectively direct requests to all nodes.
    FortiSOAR™ with an internal database and an Active/Active configuration
  • In an Active-Passive HA cluster configuration, one or more passive or standby nodes are available to take over if the primary node fails. Processing is done only by the primary node. However, when the primary node fails, then a standby node can be promoted as the primary node. In this configuration, you can have one active node and one or more passive nodes configured in a cluster, which provides redundancy, while data is being replicated asynchronously.
    FortiSOAR™ with an internal database and an Active/Passive configuration

High Availability with an externalized PostgreSQL database

In case of an externalized database, the user will use their own database's HA solution. FortiSOAR ensures that changes done in the file system of any of the cluster nodes arising from the connector install/uninstall or any changes in the module definitions are synced across every node so a secondary or passive node can takeover in the least time in case of a failure of the primary node.

FortiSOAR™ with an external database and an Active/Active configuration

Cluster Licensing

FortiSOAR version 6.4.4 and later does not mandate 'Additional Users' entitlement to be the same across all cluster nodes, i.e., you do not require to buy additional user licenses for clustered nodes. User count entitlement will now always be validated from the primary node. The secondary nodes can have the basic two-user entitlement.

The HA cluster shares the user count details from primary node of the cluster. Hence, all 'Concurrent Users' count restrictions apply as per the primary node. If a node leaves the cluster, the restriction will apply as per its own original license. For more information about FortiSOAR licensing, see the Licensing FortiSOAR chapter in the "Deployment Guide."

Viewing and updating the license of an HA cluster

In case your FortiSOAR instance is part of a High Availability cluster, the License Manager page also displays the information about the nodes in the cluster, if you have added secondary node(s) as shown in the following image:
License Manager Page in case of your FortiSOAR™ instance is part of a High Availability cluster
As shown in the above image, the primary node is Node 2 and that node is licensed with 7 users, therefore the Allowed User Seats count displays as 7 users.
To update the license for each node, click Update License and upload the license for that node.

Note

If you update a license that does not match with the system UUID, then you will get a warning on UI while updating the license. If you update the same license in more than one environment then the license is detected duplicate and you require to correct the license, else your FortiSOAR UI will be blocked in 2 hours.

If a license on one node of an HA cluster expires, you will not be able to access any nodes of that HA cluster. All nodes in that HA cluster will display the same FortiSOAR UI page, asking you to deploy a new valid license for the expired nodes:
FortiSOAR UI displaying nodes in an HA cluster that have an expired license

Prerequisites to configuring High Availability

  • Your FortiSOAR instance must be a 5.0.0 and later instance, either a fresh install of 5.0.0 and later or your instance must be upgraded to 5.0.0 and later.
  • All nodes of a cluster should DNS resolvable from each other.
  • Ensure that the ssh session does not time out by entering into the screen mode. For more information, see the Handle session timeouts while running the FortiSOAR upgrade article present in the Fortinet Knowledge Base.
  • If you have a security group (AWS) or an external firewall between the HA nodes, then you must open the following ports between HA nodes on AWS or the external firewall:
    For PostgreSQL: 5432, for MQ TCP traffic: 5671, and for ElasticSearch: 9200
  • Fronting and accessing the FortiSOAR HA Cluster with a Load Balancer such as HAProxy, Gobetween, or a Reverse Proxy is recommended so that the address remains unchanged on takeover.

Process for configuring High Availability

Steps to configure FortiSOAR HA cluster with an internal PostgreSQL database

If you are configuring HA with an internal PostgreSQL database, ensure that you have met all the Prerequisites criteria (see the Prerequisites to configuring High Availability section) and then perform the following steps:

Important: You must join nodes to a HA cluster in a sequentially order.

  1. Use the FortiSOAR Admin CLI (csadm) to configure HA for your FortiSOAR instances. For more information, see the FortiSOAR Admin CLI chapter. Connect to your VM as a root user and run the following command:
    # csadm ha
    This will display the options available to configure HA:
    FortiSOAR™ Admin CLI - csadm ha command output
  2. To configure a node as a secondary node, ensure that all HA nodes are resolvable through DNS and then SSH to the server that you want to configure as a secondary node and run the following command:
    # csadm ha join-cluster --status <active, passive> --role secondary --primary-node <DNS_Resolvable_Primary_Node_Name>
    Once you enter this command, you will be prompted to enter the SSH password to access your primary node.
    In case of a cloud environment, where authentication is key-based, you require to run the following command:
    # csadm ha join-cluster --status <active, passive> --role <primary, secondary> --primary-node <DNS_Resolvable_Primary_Node_Name> --primary-node-ssh-key <Path_To_Pem_File>
    This will add the node as a secondary node in the cluster.
    Note: When you join a node to an HA cluster, the list-nodes command does not display that a node is in the process of joining the cluster. The newly added node will be displayed in the list-nodes command only after it has been added to the HA cluster.
Note

If you have upgraded FortiSOAR and are joining a freshly provisioned node using the join-cluster operation to a cluster having some connectors installed, then you are required to manually reinstall the connectors that were present on the existing node on the new node.

Also, note that if you have built your own custom connector, then you must upload the .tgz file of the connector on all the nodes within the HA cluster.
When you are uploading the .tgz file on all the nodes, you must ensure that you select the Delete all existing versions checkbox. You must also ensure that you have uploaded the same version of the connector to all the nodes.

Steps to configure FortiSOAR HA cluster with an external PostgreSQL database

If you are configuring HA with an external PostgreSQL database, perform the following steps:

  1. Externalize the PostgreSQL database for the primary node of your HA configuration. For the procedure for externalizing PostgreSQL databases, see the Externalization of your FortiSOAR PostgreSQL database chapter.
  2. Add the hostnames of the secondary nodes to the allowlist in the external database.
  3. Add the hostnames of the secondary notes to the pg_hba.conf (/var/lib/pgsql/12/data/pg_hba.conf) file in the external database. This ensures that the external database trusts the FortiSOAR server for incoming connections.
  4. Ensure that you have met all the Prerequisites criteria (see the Prerequisites to configuring High Availability section).
  5. Create the HA cluster by following the steps mentioned in the Steps to configure FortiSOAR HA cluster with an internal PostgreSQL database section.

Takeover

Use the csadm ha takeover command to perform a takeover when your active primary node is down. Run this command on the secondary node that you want to configure as your active primary node.

If during takeover you specify no to the Do you want to invoke ‘join-cluster’ on other cluster nodes? prompt, or if any node(s) is not reachable, then you will have to reconfigure all the nodes (or the node(s) that were not reachable) in the cluster to point to the new active primary node using the csadm ha join-cluster command.

During the takeover operation, if the secondary node license user entitlement is lesser than that on the primary node, then the licenses get swapped between the new primary node (node B) and the old primary node (node A). To prevent any undesirable node lockouts, FortiSOAR checks the user count entitlement of both licenses before exchanging the licenses between Node A and Node B. If Node B already has a higher user entitlement, then the licenses are not swapped. Therefore, no duplicate license violation will occur once Node A comes back online in case of matching user entitlements of cluster nodes.

The swapping of licenses during takeover leads to the following scenarios:

  • If Node A is alive at the time of the takeover operation, then whether Node A joins back the HA cluster or not, it synchronizes to the Fortinet Licensing Portal with the license previously associated with Node B.
  • If Node A is not alive at the time of the takeover operation, then it synchronizes with FDN with its old license, which is being used by Node B as well; and this might cause a node lockout, if this is not corrected manually, by deploying the old Node B license onto Node A, in the grace window of two hours. Note, that FortiSOAR allows a grace period of two hours even when FDN reports a duplicate license.
Note

After you have performed takeover and configured a secondary node as the active primary node, then you will observe that the log forwarder configurations are not present on the new primary node. This is because Syslog settings are not replicated to the passive node since the passive node could be in a remote datacenter and with network latencies between datacenters. Also, the same Syslog server might not be the ideal choice for log forwarding from the DR node. If you want to forward logs from the passive node, you must enable it manually using the csadm log forward command. For more information, see the FortiSOAR Admin CLI chapter.

Usage of the csadm ha command

Certain operations, such as takeover, join cluster, etc. might take a longer period of time to run, therefore you must ensure that your ssh session does not timed out by entering into the screen mode. For more information, see the Handle session timeouts while running the FortiSOAR upgrade article present in the Fortinet Knowledge Base.

You can get help for the csadm ha command and subcommands using the --help parameter.

Note

It is recommended that you perform operations such as join-cluster, leave-cluster, etc sequentially. For example, when you are adding nodes to a cluster, it is recommended that you add the nodes in a sequence, i.e., one after the other rather than adding them in parallel.

The following table lists all the subcommands that you can use with the csadm ha command:

Subcommand Brief Description
list-nodes Lists all the nodes that are available in the cluster with their respective node names and ID, status, role, and a comment that contains information about which nodes have joined the specific HA cluster and the primary server.
List Nodes command output
You can filter nodes for specific status, role, etc.
For example, if you want to retrieve only those nodes that are active use the following command: csadm ha list-nodes --active, or if you want to retrieve secondary active nodes, then use the following command: csadm ha list-nodes --active --secondary.
Note: The list-nodes command will not display a node that is in the process of joining the cluster, i.e., it will display the newly added node only after it has been added to the HA cluster.
export-conf Exports the configuration of details of the active primary node to a configuration file named ha.conf. For more details on export-conf, see the Process for configuring HA section.
allowlist Adds the hostnames of the secondary nodes in the HA cluster to the allowlist on the active primary node. For more details on allowlist, see the Process for configuring HA section.
Important: Ensure that incoming TCP traffic from the IP address(es) [xxx.xxx.xx.xxx] of your FortiSOAR instance(s) on port(s) 5432, 9200, and 5671 is not blocked by your organization's firewall.
join-cluster Adds a node to the cluster with the role and status you have specified. For more details on join-cluster, see the Process for configuring HA section.
get-replication-stat Displays the replication statistics, i.e., the replication lag and status between cluster nodes.
In the case of secondary nodes, information about total lag and time elapsed from last sync is displayed. In the case of the primary node, information about sending lag, receiving lag, relaying lag, and total lag is displayed.
You can use the following option with this subcommand:
--verbose: Displays detailed replication statistics.
In the case of secondary nodes, information about receive lsn, replay lsn, total lag, and time elapsed from last sync is displayed.
Do not use this subcommand on the primary node since no additional details are displayed for the primary node when you use the --verbose subcommand.
Note: If you have configured FortiSOAR with an externalized PostgreSQL database, then replication statistics will not be displayed for the cluster nodes.
show-health Displays the health information for the current node.
You can use the following option with this subcommand:
--all nodes: Displays the health information for all the nodes in an HA cluster. This information is also available for a single node, and can be used to setup monitoring and sending health statistics of a FortiSOAR instance to external monitoring applications.
--json: Displays the health information in the JSON format.
firedrill Tests your disaster recovery configuration.
You can perform a firedrill on a secondary (active or passive) node only. Running the firedrill suspends the replication to the node's database and sets it up as a standalone node pointing to its local database. Since the firedrill is primarily performed to ensure that the database replication is set up correctly, hence it is not applicable when the database is externalized.
Once you have completed the firedrill, ensure that you perform restore, to get the nodes back to replication mode.
Licenses on a firedrilled node:
- If the node license had a user license entitlement matching the primary node user entitlement, all users can login to the firedrilled node.
- If the node license had a basic user entitlement and the HA cluster had more active users, then only the csadmin user can login to the UI of the firedrilled node. The csadmin user can then activate two users who need to test the firedrill and make the rest of the users inactive.
Note: This does not cause any impact to the primary node or other nodes in the HA cluster. Post-restore, the firedrilled node will join the cluster back and maximum active users as per the entitlement will be honored.
Schedules on a firedrilled node:
The node on which a firedrill is being performed will have their schedules and playbooks stopped, i.e., celerybeatd will be disabled on this node. This is done intentionally as any configured schedules or playbooks should not run when the node is in the firedrill mode.
restore Restores the node back to its original state in the cluster after you have performed a firedrill. That is, csadm ha restore restores the node that was converted to the active primary node after the firedrill back to its original state of a secondary node.
The restore command discards all activities such as record creation, that is done during the firedrill since that data is assumed to be test data. This command will restore the database from the content backed up prior to firedrill.
takeover Performs a takeover when your active primary node is down. Therefore, you must run the csadm ha takeover command on the secondary node that you want to configure as your active primary node.
leave-cluster Removes a node from the cluster and the node goes back to the state it was in before joining the cluster.

Overview of nodes in a FortiSOAR HA cluster

  • A FortiSOAR HA cluster can have only one active primary node, all the other nodes are either active secondary nodes or passive nodes. Uniqueness of the primary node is due to the following:
    • In case of an internal database, all active nodes talk to the database of the primary node for all reads/writes. The database of all other nodes is in the read-only mode and setup for replication from the primary node.
    • Although the queued workflows are distributed amongst all active nodes, the Workflow scheduler runs only on the primary node.
    • All active nodes index the data for quick search into ElasticSearch at the primary node.
    • All integrations or connectors that have a listener configured for notifications, such as IMAP, Exchange, Syslog, etc run the listeners only on the primary node.
      Therefore, if the primary node goes down, one of the other nodes in the cluster must be promoted as the new primary node and the other nodes should rejoin the cluster connecting to the new primary.
  • Active secondary nodes connect to the database of the active primary node and serve FortiSOAR requests. However, passive nodes are used only for disaster recovery and they do not serve any FortiSOAR requests.

Checking replication between nodes in an active-passive configuration

When using an active-passive configuration with internal databases, ensure that replication between the nodes is working correctly using the following steps:

  • Perform the firedrill operation at regular intervals to ensure that the passive node can takeover successfully, when required.
  • Schedule full nightly backups at the active primary node using the FortiSOAR backup and restore scripts. For more information on backup and restore, see the Backing up and Restoring FortiSOAR chapter.

Upgrading an HA cluster

For the procedure on how to upgrade a FortiSOAR High Availability Cluster to 6.4.4, see the Upgrading a FortiSOAR High Availability Cluster to 6.4.4 section in the "Upgrade Guide."

Load Balancer

The clustered instances should be fronted by a TCP Load Balancer such as HAProxy or by gobetween, and clients should connect to the cluster using the address of the proxy.

Setting up HAProxy as a TCP load balancer fronting the two clustered nodes

The following steps list out the steps to install "HAProxy" as a load balancer on a CentOS Virtual Machine:

  1. # yum install haproxy
  2. In the /etc/haproxy/haproxy.cfg file, add the policy as shown in the following image:
    Load balancer policy in the haproxy configuration file
  3. To reload the firewall, run the following commands:
    $ sudo firewall-cmd --zone=public --add-port=<portspecifiedwhilebindingHAProxy>/tcp --permanent
    $ sudo firewall-cmd --reload
  4. Restart haproxy using the following command:
    # systemctl restart haproxy
  5. Use the bind address (instead of the IP address of the node in the cluster) for accessing the FortiSOAR UI.

Using Gobetween load balancer

Gobetween is a minimalistic yet powerful high-performance L4 TCP, TLS, and UDP based load balancer.

It works on multiple platforms like Windows, Linux, Docker, Darwin, etc., and you can build your own load balancer using from source code. Balancing is done based on the following algorithms that you can choose in the configuration:

  • IP hash

  • World famous - Round Robin

  • Least bandwidth

  • Least connection

  • Weight

Configuring Gotbetween for FortiSOAR A-A HA Cluster

Installation:

Gobetween can be installed either on the Linux platform or on Windows. For details on installing gobetween, see 'Installation' section of the gobetween documentation.

Configuration:

Edit the gobetween.toml configuration file and then restart the gobetween service for the changes to take effect. A sample configuration follows:

The configuration has three sections,

  • The first one describes the protocol to be used and defines the port to which the load balancer will be bound:
    [servers.fsr]
    protocol = "tcp"
    bind = "0.0.0.0:3000"

  • The second describes how the FortiSOAR nodes are discovered:
    [servers.fsr.discovery]
    kind = "static"
    static_list = [
    "qa-env5.fortisoar.in:443 weight=25 priority=1",
    "qa-env7.fortisoar.in:443 weight=25 priority=1",
    "qa-env9.fortisoar.in:443 weight=25 priority=1",
    "qa-env10.fortisoar.in:443 weight=25 priority=1"
    ]

    In the node discovery section, you need to add FortiSOAR nodes and provide their weight and priority to determine how requests to the load balancer will be addressed.

  • The last one checks the ‘health’ status of each node:
    [servers.fsr.healthcheck]
    fails = 1
    passes = 1
    interval = "2s"
    timeout="1s"
    kind = "ping"
    ping_timeout_duration = "500ms"

For more details about configuration, see the gobetween documentation.

Configuring Gotbetween for a MQ Cluster

Initial procedure for setting up a RabbitMQ cluster, such as setting up the hosts file, installing the RabbitMQ server, etc, should already have been completed. For more information see the How to Set up RabbitMQ Cluster on CentOS 7 article. Once the initial setup is completed, do the following:

  1. Set up the RabbitMQ cluster: To setup the RabbitMQ cluster, ensure that the .erlang.cookie file is the same on all nodes. To achieve this, copy the '.erlang.cookie' file from the /var/lib/rabbitmq directory of the primary node to the other nodes. For our example, let us assume the primary node is 'node1' and secondary nodes are 'node2' and 'node3'. To copy the '.erlang.cookie' file use the scp command from the primary node ('node1'). For example:
    scp /var/lib/rabbitmq/.erlang.cookie root@node2:/var/lib/rabbitmq/
    scp /var/lib/rabbitmq/.erlang.cookie root@node3:/var/lib/rabbitmq/
    Ensure that there are no errors on both the servers, then join the node2 and node3 to node1, using the join-cluster command, to create a RabbitMQ cluster. For more information, see the Process for configuring High Availability section.
  2. Configure RabbitMQ Setup Queue Mirroring: You must configure the 'ha policy' cluster for queue mirroring and replication to all cluster nodes. If the node that hosts queue master fails, the oldest mirror will be promoted to the new master as long as it synchronized, depending on the 'ha-mode' and 'ha-params' policies.
    Following are some examples of the RabbitMQ ha policies:
    Setup an ha policy named 'ha-all' with all queues on the RabbitMQ cluster that will be mirrored to all nodes on the cluster:
    sudo rabbitmqctl set_policy ha-all ".*" '{"ha-mode":"all"}'
    Setup ha policy named 'ha-nodes' with all queue names that start with 'nodes' and that will be mirrored to two specific nodes 'node02' and 'node03' on the cluster:
    sudo rabbitmqctl set_policy ha-nodes "^nodes\." \
    '{"ha-mode":"nodes","ha-params":["rabbit@node02", "rabbit@node03"]}'
    You can check all the available policies using the following command:
    sudo rabbitmqctl list_policies;
    If you want to remove a specific policy, use the following command:
    sudo rabbitmqctl clear_policy <name_of_policy>
  3. Ensure that the SSL certificates that you specify while configuring the secure message exchange must be the same on all the nodes and should have the secure message exchange's CN name or should be a wildcard.
    For information on adding a secure message exchange, see the Deploying FortiSOAR chapter in the "Deployment Guide." When you are adding or configuring the secure message exchange, in the Add New Secure Message Exchange dialog ensure the following:
    • In the TCP Port field ensure that you enter the same TCP port that you have specified while configuring the secure message exchange. Also, ensure that the FortiSOARnode has outbound connectivity to the secure message exchange at this port.
    • In the Certificate field, you must copy-paste the certificate text of the Certificate Authority (CA) that has signed the secure message exchange certificate in the pem format. If it is a chain, then the complete chain must be provided. By default, the CA certificate for the FortiSOAR self-signed certificate is present at the following location: /opt/cyops/configs/rabbitmq/ssl/cyopsca/cacert.pem
    • Enter the required details in the other fields and save the secure message exchange configuration.
  4. Edit the gobetween.toml configuration file on each of the nodes in the MQ cluster and then restart the gobetween service for the changes to take effect. A sample configuration follows:
    The configuration has three sections,
    • The first one describes the protocol to be used and defines the ports to which the load balancer will be bound on various nodes of the MQ cluster. Ensure that you the enter the same TCP port that you have specified while configuring the secure message exchange and added in the Add New Secure Message Exchange dialog.
      For example, on node 1 it could be:
      [servers.routerapi]
      protocol = "tcp"
      bind = "0.0.0.0:3000"

      For example, on node 2 it could be:
      [servers.routertcp]
      protocol = "tcp"
      bind = "0.0.0.0:3000"

    • The second describes how the MQ cluster nodes are discovered:
      For example, on node 1 it could be:
      [servers.routerapi.discovery]
      kind = "static"
      static_list = [
      "router-node1.fortisoar.in:15671 weight=25 priority=1",
      "router-node2.fortisoar.in:54549 weight=25 priority=1",
      "router-node3.fortisoar.in:54549 weight=25 priority=2"
      ]

      For example, on node 2 it could be:
      [servers.routertcp.discovery]
      kind = "static"
      static_list = [
      "router-node1.fortisoar.in:5671 weight=25 priority=1",
      "router-node2.fortisoar.in:54558 weight=25 priority=1",
      "router-node3.fortisoar.in:54559 weight=25 priority=1"
      ]

      In the node discovery section, you need to add the secure message exchange for the nodes and provide their weight and priority to determine how requests to the load balancer will be addressed.

    • The last one checks the ‘health’ status of the MQ cluster:
      For example, on node 1 it could be:
      [servers.routerapi.healthcheck]
      fails = 1
      passes = 1
      interval = "2s"
      timeout="1s"
      kind = "ping"
      ping_timeout_duration = "500ms"

      For example, on node 2 it could be:
      [servers.routertcp.healthcheck]
      fails = 1
      passes = 1
      interval = "2s"
      timeout="1s"
      kind = "ping"
      ping_timeout_duration = "500ms"

  5. Test your RabbitMQ cluster by opening your web browser and typing the IP address of a node, for example, node 1, whose port is set as '5671'.
    http://<node1IP>:5671/
    Type in the username and password you have configured. If everything is setup correctly, you will see the RabbitMQ admin Dashboard with the status of all the members of the cluster, i.e., node1, node2, and node3, displaying as up and running. You can click the Admin tab and click the Users menu to view the list of active users and the Policies menu to view the list of created policies.

Behavior that might be observed while publishing modules when you are accessing HA clusters using a load balancer

When you have initiated a publish for any module management activity and you are accessing your HA cluster with one or more active secondary nodes using a load balancer such as "HAProxy", then you might observe the following behaviors:

  • While the Publish operation is in progress, you might see many publish status messages on the UI.
  • If you have added a new field to the module, or you have removed a field from the module, then you might observe that these changes are not reflected on the UI. In such cases, you must log out of FortiSOAR and log back into FortiSOAR.
  • After a successful publish of the module(s), you might observe that the Publish button is yet enabled and the modules yet have the asterisk (*) sign. In such cases, you must log out of FortiSOAR and log back into FortiSOAR to view the correct state of the Publish operation.

Tunables

You can tune the following configurations:

  • max_wal_senders = 10
    This attribute defines the maximum number of walsender processes. By default, this is set as 10.
  • wal_keep_segments = 320
    This attribute contains a maximum of 5 GB data.
    Important: Both max_wal_senders and wal_keep_segments attributes are applicable when the database is internal.

Every secondary/passive node needs one wal sender process on the primary node, which means that the above setting can configure a maximum of 10 secondary/passive nodes.

If you have more than 10 secondary/passive nodes, then you need to edit the value of the max_wal_senders attribute in the /var/lib/pgsql/12/data/postgresql.conf file on the primary node and restart the PostgreSQL server using the following command: systemctl restart postgresql-12
Note: You might find multiple occurrences of max_wal_senders attribute in the postgresql.conf file. You always need to edit last occurrence of the max_wal_senders attribute in the postgresql.conf file.

The wal_keep_segments attribute has been set to 320, which means that the secondary nodes can lag behind by the maximum of 5GB. If the lag is more than 5GB, then replication will not work properly, and you will require to reconfigure the secondary node by running the join-cluster command

Also note that Settings changes that are done in any configuration file on an instance, such as changing the log level, etc., apply only to that instance. Therefore, if you want to apply the changed setting to all the node, you have to make those changes across all the cluster nodes.

Best practices

  • Fronting and accessing the FortiSOAR HA cluster with a Load Balancer or a Reverse Proxy is recommended so that the address remains unchanged on takeover.
  • You must ensure that the SIEM and other endpoints that FortiSOAR connects to are reachable on the virtualized host name (DNS) that would remain intact even after a failover (local or geo wide).
  • The FortiSOAR node connects outbound to the SIEM, to periodically pull the "Alerts" (Terminology for this would differ for each SIEM, Eg, ‘Offense’, ‘Corelated Event’, ‘Notable’). The "pull" model also ensures resiliency. In the case of downtime, once the FortiSOAR node comes back up, it would pull the alerts from last pulled time, ensuring there is no data loss even during down time.
  • Tune the wal_keep_segments setting in the /var/lib/pgsql/12/data/postgresql.conf file. When you have a heavy write on the primary node, for example, a lot of playbooks are being run or a lot of alerts are created, then there is a lot of data to be replicated. If the secondary node is in a distant data center, the replication speed might not be able to cope up with the write on the primary node. There is a fixed size for the buffer data (default is 5 GB) that the primary node keeps for data replicated, after which the data rolls over. If a secondary node has not yet copied the data before the data has rolled over, it can become 'out of sync' and require a full synchronization, and then this would cause failures.
    You might want to increase the wal_keep_segments setting in the following scenarios:
    The secondary node is in a distant datacenter and the network speed is slow.
    The secondary node is offline for a very long time due to some issues or for maintenance, etc.
    The secondary node is firedrilled and you want the restore operation to be faster using a differential sync instead of a full sync.
    It is recommended to increase the wal_keep_segments setting to 20 GB (instead of the default, i.e., 5GB) in the /var/lib/pgsql/12/data/postgresql.conf file as follows:
    The ``wal_keep_segmentssetting in the/var/lib/pgsql/12/data/postgresql.conffile appears as follows:wal_keep_segments = 320 # in logfile segments, 16MB each; 0 disables; keeps upto 5GB wal**Note**:postgresql.confcan set thewal_keep_segmentsvalue multiple times in the file. You must change the last occurrence ofwal_keep_segmentsin thepostgresql.conffile. To increase the last occurrence of thewal_keep_segmentssetting to 20 GB change it as follows:wal_keep_segments = 1280 # in logfile segments, 16MB each; 0 disables; keeps upto 20GB`
  • If you are planning to configure high availability in case of a multi-tenancy environment, i.e., for your master or tenant nodes, you must first configure high availability then configure MSSP. For more information on MSSP, see the "Multi-Tenancy support in FortiSOAR Guide".

Monitoring health of HA clusters

All secondary nodes in the cluster exchange HA heartbeat packets with the primary node so that the primary node can monitor and verify the status of all the secondary nodes and the secondary nodes can verify the status of the primary node.

Your system administrator can configure the monitoring of heartbeats on the System Configuration > Application Configuration > System & Cluster Health Monitoring > Cluster Health section. Once you have configured monitoring of heartbeats and if any node in the HA cluster is unreachable, then the other active nodes in the cluster, which are operational, send email notifications and write log messages to alert the system administrator that a failure has occurred. For more information, see the Configuring System and Cluster Health Monitoring topic in the System Administration chapter.

Understanding HA Cluster Health Notifications

HA cluster health notification checks on the primary node

On every scheduled monitoring interval, which defaults to 5 minutes on the primary node for every secondary/passive node, the HA cluster health notifications checks:

  • If there is a heartbeat miss from the secondary/passive node(s) in the last 15 minutes by taking the default values of monitoring interval (5 minutes) * missed heartbeat count (3). If there is a missed heartbeat, then the health notification check sends a "heartbeat failure" notification and exits.
  • If the data replication from the primary node is broken. If yes, then the health notification check sends a notification containing the replication lag with respect to the last known replay_lsn of secondary node and exits.
    Following is a sample notification:
    Following secondary FortiSOAR node(s) seem to have failed -
    Node: hasecondary.myorgdomain
    Current lag with primary node is 97 kB
    
    Failure reason:
    	1. The postgres database replication to the secondary node is not working due to data log rotation at the primary node.
        2. The secondary node has been shutdown/halted.
        3. PostgreSQL not running on node(s).
        4. nodeName from 'csadm ha list-nodes' differs from actual FQDN used during join-cluster.
    
    If node is up and running,
    	1. Check the status of PostgreSQL service using 'systemctl status postgresql-<postgresql-version-here> -l' on node to get more details.
    	2. If you see 'FATAL: could not receive data from WAL stream: requested WAL segment has already been removed' in the PostgreSQL service status, you need to re-join the cluster using 'csadm ha join-cluster --fetch-fresh-backup'
    
  • If the replication lag reaches or crosses the set threshold specified, then the health notification check sends a notification containing the replication lag as shown in the following sample notification:
    Replication lag threshold is reached on following node(s):
    Node: hasecondary.myorgdomain
    Current lag with primary node : 3431 MB
    Configured WAL size : 5120 MB
    Configured Threshold : 3072 MB (60% of the Configured WAL size)
    
  • If any service is not running, then the health notification check sends a "service failure" notification and exits.
  • If a firedrill is in progress on a secondary/passive node. If yes, then the health notification check sends the following notification and exits.
    Firedrill is in progress on following node(s):
    Node: hasecondary.myorgdomain
    Current lag with primary node : 52 kB
    You can ignore the lag that is displayed in this case since this lag indicates the amount of data the firedrill node needs to sync when csadm ha restore is performed.
    You can also check the lag using the get-replication-stat command on the primary node.

HA cluster health notification checks on the secondary node

On every scheduled monitoring interval, which defaults to 5 minutes on the secondary node, the HA cluster health notifications checks:

  • If there is a heartbeat miss from the primary node in the last 15 minutes by taking the default values of health beat interval (5minutes) * missed heartbeat count (3). If there is a missed heartbeat, then the health notification check sends a "heartbeat failure" notification and exits.
  • If there is no heartbeat failure but there is a service failure, then the health notification check sends a "service failure" notification and exits.

HA cluster health notification checks when the HA cluster is set up with an external PostgreSQL database

If the PostgreSQL database is externalized, the email notifications generated by the primary node are different from when the PostgreSQL database is not externalized. On the primary node for every secondary/passive node, the HA cluster health notifications checks:

  • If there is a heartbeat miss from the secondary/passive node(s) in the last 15 minutes by taking the default values of health beat interval (5 minutes) * missed heartbeat count (3). If there is a missed heartbeat, then the health notification check sends a "heartbeat failure" notification as follow and exits:
    Following secondary FortiSOAR node(s) seem to have failed -
    Node: hasecondary.myorgdomain
    Failure reason: Heartbeat failure. Check if the 'cyops-ha' service is running or not using 'systemctl status cyops-ha'.
  • If any service is not running, then the health notification check sends a "service failure" notification as follows and exits:
    Following secondary FortiSOAR node(s) seem to have failed -
    Node: hasecondary.myorgdomain
    Failure reason: cyops-auth service(s) not running.

HA cluster health notification checks when a secondary node is firedrilled

When a firedrill is in progress on a secondary/passive node, then you do not receive any 'out of sync' notification, instead the health notification check sends the following email notification and exits.
Firedrill is in progress on following node(s):
Node: hasecondary.myorgdomain
Current lag with primary node : 52 kB
You can ignore the lag that is displayed in this case since this lag indicates the amount of data the firedrill node needs to sync when csadm ha restore is performed.
You can also check the lag using the get-replication-stat command on the primary node. If a firedrill is in progress on a secondary/passive node, then you can ignore the lag displayed. This is because the 'total_lag' that gets shown in the get-replication-stat messages indicates the amount of data the secondary/passive node will need to sync when the csadm ha restore operation is performed on the node once the firedrill completes.

HA cluster health notification checks during takeover

  1. When takeover is in progress, the previous primary node might send 'out of sync' email notifications for the node that is taken over, because the previous primary sees it as not replicating data anymore. These can be ignored. After the takeover is completed, we mark the previous primary node as faulted. Therefore, you will not see any replication statistics on the old primary node.
  2. After the takeover is performed, you can ignore the messages of the get-replication-stat command on the new primary node. You can also ignore the 'out of sync' email notification that is generated by the new primary node since when we perform the takeover, the entries of all the nodes in the cluster are yet included in csadm ha list-nodes, and because the remaining nodes yet require to join the new primary node, this new primary node keeps generating the notification for all those nodes.
  3. When all the other nodes of a HA cluster join back to the new primary node, then the health notification check starts to work and there will not be any ignorable notification.

Troubleshooting issues based on the notifications

The following section provides details on how to check and fix the possible reasons of failures that are listed in the email notifications sent by the HA cluster check.

To troubleshoot HA issues, you can use the HA log located at: /var/log/cyops/cyops-auth/ha.log.

Heartbeat Failure

Resolution:

When you get a heartbeat failure notification for a secondary node, then do the following:

  1. Check if the cyops-ha service is running on that node, using the systemctl status cyops-ha command.
  2. If it is not running, then you must restart the cyops-ha service.

Node name differs from actual FQDN

Resolution:

Correct the notification such as nodeName from 'csadm ha list-nodes' differs from actual FQDN used during join-cluster. using the following steps:

  1. Login on the node for which you are receiving the above notification using SSH.
  2. Use the following command to correct the FQDN of the node:
    csadm ha set-node-name <enter-correct-FQDN-here>

Secondary/Passive node is out of sync with the Primary node

This issue could occur due to the following reasons:

  • PostgreSQL service status shows requested WAL segment <some-number-here> has already been removed
    OR
    The csadm ha get-replication-stat command shows a higher time lapsed from the last sync when compared to the general time lapsed.
    Resolution:
    In these cases, since the secondary/passive node is completely out of sync with the primary node, you need to perform the following steps on the secondary/passive node:
    1. Run the touch /home/csadmin/.joincluster_in_progress command to create the .joincluster_in_progress file. 2. Rejoin the cluster as follows:
    csadm ha join-cluster --status active/passive --role secondary --primary-node <Primary-node-FQDN> --fetch-fresh-backup
  • When there is heavy write on the primary node and the secondary node has not yet copied the data before the data has rolled over, it will be 'out of sync' and a full synchronization is needed, which can cause the above failures.
    Resolution:
    Increase the wal_keep_segments setting in the /var/lib/pgsql/12/data/postgresql.conf file as described in the Best Practices section.

PostgreSQL service is down on the primary node or PostgreSQL service is down on the externalized database host

If PostgreSQL service is down on the primary node, then the cyops-ha service on all the nodes will be down and there will be no notifications generated since the whole cluster is down; due to this you will also not be able to login to the FortiSOAR UI.

Resolution:

  1. Check the reason for the failure using the systemctl status postgresql-<postgresql-version-here> -l on the primary node or the externalized database host.
  2. Fix the issue based on the reason for failure.

Higher time lapsed from the last sync when compared to the general time lapsed

When you run the csadm ha get-replication-stat command on the secondary node you might get a 'time_elasped_from_last_sync' value is higher than usual message.
This issue could occur due to:

  • PostgreSQL down on the primary server.
  • No activity on primary server.
  • Secondary node is out of sync with the Primary node.

Resolution:

  1. Run the systemctl status postgresql-12 command and ensure that you are not getting the FATAL: could not receive data from WAL stream: requested WAL segment has already been removed. message.
  2. Run the csadm ha get-replication-stat command and check its results:
    • If you get the Value of 'sending_lag' is higher than usual on primary when checked message, this means that there is load on the primary node. Run the top command to check which process is taking more CPU time. You can also use the following command for a quick check:
      ps -eo pid,cmd,%mem,%cpu --sort=-%mem | head
    • If you get the Value of 'replaying_lag' is higher than usual on primary when checked message, this means that there is load on the secondary node. Run the top command to check which process is taking more CPU time. You can also use the following command for a quick check:
      ps -eo pid,cmd,%mem,%cpu --sort=-%mem | head

Sample scale test that were done in the lab to understand the behavior of 'csadm ha get-replication-stat'

What was done before observing the behavior:

First we stopped the PostgreSQL service on the secondary/passive node.

Next, generated data on the primary node using the following script. You need to kill the script after some time when enough data is generated on the primary node.

[root@cybersponse csadmin]# cat data_load.sh
#!/bin/sh

psql -U cyberpgsql -d das -c "CREATE TABLE scale_data (
   section NUMERIC NOT NULL,
   id1     NUMERIC NOT NULL,
   id2     NUMERIC NOT NULL
);"


psql -U cyberpgsql -d das -c "
INSERT INTO scale_data
SELECT sections.*, gen.*
     , CEIL(RANDOM()*100)
  FROM GENERATE_SERIES(1, 300)     sections,
       GENERATE_SERIES(1, 900000) gen
 WHERE gen <= sections * 3000;"
[root@cybersponse csadmin]#

During the data generation process, we ran the csadm ha get-replication-stat on the primary node and you can observe that the secondary node is lagging with 4702 MB.

get-replication-stat on the primary node:

[root@cybersponse csadmin]# csadm ha get-replication-stat
------------------------------------------------
Warning:
Following could be the issues with the nodes:
1. The postgres database replication to the secondary node is not working due to data log rotation at the primarynode.
2. The secondary node has been shutdown/halted.
3. PostgreSQL not running on node(s).
4. nodeName from 'csadm ha list-nodes' differs from actual FQDN used during join-cluster.
5. If a firedrill is in progress on the node, no action is required. The 'lag' that is displayed indicates the amount of data the firedrill node needs to sync when 'csadm ha restore' will be performed.

If node is up and running,
1.  Check the status of PostgreSQL service using 'systemctl status postgresql-12 -l' on node to get more details.
2. If you see 'FATAL: could not receive data from WAL stream: requested WAL segment has already been removed' in the PostgreSQL service status, you need to re-join the cluster using 'csadm ha join-cluster --fetch-fresh-backup' for this node.

------------------------------------------------

nodeId                            nodeName                 status    role       comment                                      total_lag
--------------------------------  -----------------------  --------  ---------  --------------------------------------------  -----------
469c6330613a332c30dd4d8e3a607cf2  hasecondary.myorgdomain  active    secondary  Joined cluster with haprimary.myorgdomain  4702 MB
[root@cybersponse csadmin]# 

Next, start PostgreSQL on the secondary node and observed the 'replication-stat' on both the primary and secondary nodes:

On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:27:31 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  4458 MB        11 MB            213 MB           4683 MB
                                                                                       

On the Secondary node:

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  287 MB       00:07:59.113185     


On the Primary node:

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  3600 MB        3456 kB          727 MB           4330 MB
                                                                                     

On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:27:49 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  854 MB       00:08:18.360359   


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:05 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  2774 MB        5632 kB          1273 MB          4052 MB  


On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:07 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  1486 MB      00:07:35.238068   


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:28 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  1910 MB        6784 kB          1803 MB          3719 MB   


On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:29 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  1952 MB      00:07:56.70475    




On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:44 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  1153 MB        1408 kB          2278 MB          3433 MB
                                                                                         
On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:28:39 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  2286 MB      00:07:04.28739
                                                         

On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:00 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  452 MB         3200 kB          2726 MB          3181 MB         

On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:12 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  2941 MB      00:07:33.857054 


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:25 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  0 bytes        0 bytes          2658 MB          2658 MB 

On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:30 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  2519 MB      00:06:48.870481  


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:46 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  0 bytes        154 kB           2172 MB          2172 MB    


On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:29:53 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  1985 MB      00:07:11.244842 


On the Primary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:30:06 2020

------------------------------------------------
Note:
'sending_lag' indicates load on the primary node
'receiving_lag' indicates network delay or load on the passive/secondary node
'replaying_lag' indicates load on the passive/secondary node
------------------------------------------------

node_hostname            sending_lag    receiving_lag    replaying_lag    total_lag
-----------------------  -------------  ---------------  ---------------  -----------
hasecondary.mydomain  0 bytes        0 bytes          1687 MB          1687 MB     

On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:30:11 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  1552 MB      00:06:25.877238   

On the Secondary node:


Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:30:57 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  2288 bytes   00:00:55.861428

On the Secondary node:

Every 2.0s: csadm ha get-replication-stat                                                       Tue May 12 05:31:23 2020

primary_hostname          total_lag    time_elasped_from_last_sync
------------------------  -----------  -----------------------------
haprimary.mydomain  0 bytes      00:00:19.235799 

Troubleshooting

To troubleshoot HA issues, you can use the HA log located at: /var/log/cyops/cyops-auth/ha.log. To understand and troubleshoot the HA cluster health notifications, see the Monitoring health of HA clusters section.

Failure to create an HA cluster

If the process to configure HA using the automated join cluster fails, and the HA cluster is not created due to reasons such as, proxies set up etc, you can perform the steps mention in the following procedure and configure HA:

  1. Connect to your VM as a root user and run the following command:
    # csadm ha
    This will display the options available to configure HA.
  2. To configure a node as a secondary node, perform the following steps:
    1. SSH to the active primary node and run the csadm ha export-conf command to export the configuration details of the active primary node to a configuration file named ha.conf.
      You must copy the ha.conf file from the active primary node to the node that you want to configure as a secondary node.
    2. On the active primary server, add the hostnames of the secondary nodes to the allowlist, using the following command:
      # csadm ha allowlist --nodes
      Add the comma-separated list of hostnames of the cluster nodes that you want to the add to the allowlist after the --nodes argument.
      Important: In case of an externalized database, you need to add all the nodes in a cluster to the allowlist in the pg_hba.conf file.
    3. Ensure that all HA nodes are resolvable through DNS and then SSH to the server that you want to configure as a secondary node and run the following command:
      # csadm ha join-cluster --status <active, passive> --role <primary, secondary> --conf <location of the ha.conf file>
      For example, # csadm ha join-cluster --status passive --role secondary --conf tmp/ha.conf
      This will add the node as a secondary node in the cluster.
      Note: If you run the csadm ha join-cluster command without adding the hostnames of the secondary nodes to the allowlist, then you will get an error such as, Failed to verify....
      Also, when you join a node to an HA cluster, the list-nodes command does not display that a node is in the process of joining the cluster. The newly added node will be displayed in the list-nodes command only after it has been added to the HA cluster.

Unable to add a node to an HA cluster using join-cluster, and the node gets stuck at a service restart

This issue occurs when you are performing join-cluster of any node and that node sticks at service restart, specifically at PostgreSQL restart.

Resolution

Terminate the join-cluster process and retry join-cluster using an additional parameter --fetch-fresh-backup.

Fixing the HA cluster when the Primary node of that cluster is halted and then resumed

If your primary node is halted due to a system crash or other such events, and a new cluster is made with the other nodes in the HA cluster, the list-nodes command on other nodes will display that the primary node is in the Faulted state. Since the administrator has triggered takeover on other cluster nodes, the administrator will be aware of the faulted primary node. Also, note that even after the primary node resumes, post the halt, the primary node still remains the primary node of its own cluster, and therefore, after the resume, the list-nodes command on the primary node will display this node as Primary Active.

Resolution

To fix the HA cluster to have only one node as primary active node, do the following:

  1. On the primary node, which got resumed, run leave-cluster, which will remove this node from the HA cluster.
  2. Run join-cluster command to join this node to the HA cluster with the new primary node.

Unable to join a node to an HA cluster when a proxy is enabled

You are unable to join a node to an HA cluster using the join-cluster command when you have enabled a proxy using which clients should connect to the HA cluster.

Resolution

Run the following commands on your primary node:

$ sudo firewall-cmd --zone=trusted --add-source=<CIDR> --add-port=<ElasticSearchPort>/tcp --permanent

$ sudo firewall-cmd --reload

For example,

$ sudo firewall-cmd --zone=trusted --add-source=64.39.96.0/20 --add-port=9200/tcp --permanent

$ sudo firewall-cmd --reload

Changes made in nodes in an active-active cluster fronted with a load balancer take some time to reflect

In the case of a FortiSOAR active-active cluster that is fronted with a load balancer or reverse proxy such as an HAProxy, changes such as, adding a module to the FortiSOAR navigation, updating or adding the permissions of the logged-in user, or updates done to the logged-in user's parents, child, and sibling hierarchy, do not get reflected immediately.

These issues occur due to local caching of these settings at the individual cluster nodes.

Resolution

Log off and log back into the FortiSOAR user interface after ten minutes to see the recent updates.

OR

If you want the settings to reflect immediately, run the following command on the "active" nodes in the cluster:
php /opt/cyops-api/app/console --env=prod app:cache:clear --env=prod
Important: You do not require to run the above command on the "passive" nodes of the cluster.

Post Takeover the nodes in an HA cluster do not point to the new active primary node

This issue occurs when during the takeover process either the previous primary node is down or automatic join-cluster fails. In case of an internal database cluster, when the failed primary node comes online after the takeover, it still thinks of itself as the active primary node with all its services running. In case of an external database cluster, when the failed primary node comes online after the takeover, it detects its status as "Faulted" and disables all its services.

Resolution

Run the csadm ha join-cluster command to point all the nodes to the new active primary node. For details on join-cluster, see Process for configuring HA.

After performing the leave-cluster operation, the license is not found on a secondary node

In case of an internal DB, after you have performed the leave-cluster operation on a secondary node, if for example, you are upgrading the node, and when you are rejoining the node to the cluster, once the upgrade is done, you might see the following error: "License not found on the system". You might also see this error while trying to perform the 'restore' operation on a secondary node after completing the 'firedrill' operation.

Resolution

Run the following script as a root user on the secondary node on which you are getting the "License not found on the system" error:

#!/bin/bash

init(){
    current_pg_version=$(/bin/psql --version | egrep -o '[0-9]{1,}\.' | cut -d'.' -f1)
    re_join_cluster_sql="/opt/cyops-auth/.re-join-cluster.sql"
    db_config="/opt/cyops/configs/database/db_config.yml"
}

re_join_cluster(){
    if [ ! -f "$re_join_cluster_sql" ]; then
        echo "File [$re_join_cluster_sql] does not exist. Contact the Fortinet support team for further assistance"
        exit 1
    fi
    csadm services --stop
    if [ ! -d "/var/lib/pgsql/${current_pg_version}.bkp" ]; then
        mv /var/lib/pgsql/$current_pg_version /var/lib/pgsql/${current_pg_version}.bkp 
    fi
    # Below rm is required if in case user re-run the script again.
    rm -rf /var/lib/pgsql/${current_pg_version}
    rm -f /opt/cyops/configs/cyops_pg_${current_pg_version}_configured
    mkdir -p /var/lib/pgsql/${current_pg_version}/data
    chown -R postgres:postgres /var/lib/pgsql/${current_pg_version}
    chmod -R 700 /var/lib/pgsql/${current_pg_version}
    /opt/cyops-postgresql/config/config.sh ${current_pg_version}
    local hkey=$(csadm license --get-device-uuid)
    sudo -Hiu postgres psql -U postgres -c "ALTER USER cyberpgsql WITH ENCRYPTED PASSWORD '$hkey';"
    createdb -U cyberpgsql -e -w --no-password das -O cyberpgsql -E UTF8
    psql -U cyberpgsql -d das < $re_join_cluster_sql
    touch /home/csadmin/.joincluster_in_progress
    if [ ! -f "${db_config}.bkp" ]; then
        yes| cp ${db_config} ${db_config}.bkp
    fi
    local db_pass_encrypted=$(python3 /opt/cyops/configs/scripts/manage_passwords.py --encrypt $hkey)
    /opt/cyops/configs/scripts/confUtil.py -f $db_config -k 'pg_password' -v "$db_pass_encrypted"
    systemctl start postgresql-${current_pg_version}
}

echo_manual_steps(){
echo "Perform below steps manually"
echo "
    1. If node is passive, then run below command, else skip it.
       csadm ha join-cluster --status passive --primary-node <primary-node> --fetch-fresh-backup
    2. If node is active/secondary.
       csadm ha join-cluster --status active --role secondary --primary-node <primary-node> --fetch-fresh-backup
    3. rm -rf /var/lib/pgsql/${current_pg_version}.bkp
    4. rm -f /opt/cyops/configs/database/db_config.yml.bkp
    5. rm -rf /var/lib/pgsql/test_dr_backups"
}
####################
# Main/main/MAIN starts here
####################

# Stop right after any command failure
set -e
# Debug mode
set -x
init
re_join_cluster
####################
# You need to perform the below step manually.
########
# Turn off the dubug mode to see the steps to perform manually clearly
set +x
echo_manual_steps
exit 0

The leave-cluster operation fails at the "Starting PostgreSQL Service" step when a node in the cluster is faulted

This issue occurs in case of an active-active-passive cluster that has an internal db and whose HA cluster contains a node whose status is 'Faulted'. In this case, when you run the leave-cluster operation it fails at the "Starting service postgresql-12" step.

Resolution

To resolve this issue, run the following commands:

systemctl stop postgresql-12
rm -f /var/lib/pgsql/12/data/standby.signal
systemctl start postgresql-12

Once you have completed performing the above steps, run the csadm ha leave-cluster command.

Resetting the password for an instance that is part of active/active cluster causes the other instances of that cluster to not able to log in to FortiSOAR

If you reset the password of an instance that is part of an active/active cluster, then the FortiSOAR login page is not displayed for the other instances of this cluster. You will also observe data login failure errors in the ha.log and the prod.log in case of both active/active and active/passive clusters.

Resolution

On the other instances that are part of the cluster, do the following:

  1. Copy the encrypted password from the db_config.yml file located at /opt/cyops/configs/database/db_config.yml on the active node and then update the new password in the db_config.yml file on the secondary nodes.
  2. Run the cache:clear command:
    $ sudo -u nginx php /opt/cyops-api/app/console cache:clear
  3. Restart FortiSOAR services:
    # csadm services --restart