High Availability and Disaster Recovery - Elasticsearch

The following section provides an introduction and requirements for High Availability and Disaster Recovery for Elasticsearch.

Understanding High Availability (HA)
Understanding Disaster Recovery (DR)
Requirements for Successful HA Implementation
Requirements for Successful DR Implementation

Understanding High Availability (HA)

FortiSIEM High Availability feature enables you to run multiple Active-Active Supervisor nodes. This means that

User can log in to any Supervisor node and perform GUI Operations
Agents, Collectors and Worker nodes can communicate with any Supervisor node
Analytics operations like streaming rule and query processing are distributed across the Supervisor nodes
The databases are replicated among the Supervisor nodes as follows.
- The CMDB residing in a PostgreSQL database
- CMDB Device configurations residing in file system and managed by SVN-lite
- Profile data residing on SQLite databases on the Supervisor node.

An Active-Active Supervisor cluster is built around the concept of Leader and Follower Supervisor nodes, configured as a linked list.

The first Supervisor where you install FortiSIEM License, is the Leader. The Leader’s UUID matches the UUID in the license. On the Leader, Redis and PostGreSQL database services run in Master mode. That means that all PostGreSQL database writes from all Supervisors go to the Leader node.
Next, you add a Follower, which follows the Leader, in the sense that its PostGreSQL database is replicated from that of the Leader. On the Follower node, both Redis and PostGreSQL database services run in Follower (that is, read only) mode. CMDB and Profile data are also copied via rsync from the Leader to the Follower node.
You can add another Follower to follow the Follower node added in Step 2. Its PostGreSQL database is replicated from the Supervisor created in Step 2. On this node, both Redis and PostGreSQL database services run in Follower (that is, read only) mode. CMDB and Profile data are also copied via rsync from the Follower node in Step 2 to the Follower node in Step 3. A replication linked list is formed: Leader -> Follower in Step 2 -> Follower in Step 3.
You can add more Follower nodes to follow the last Follower node in the chain.

Detailed Leader/Follower Supervisor configuration steps are in Configuring High Availability. It is recommended that you configure a Load Balancer in front of the Active-Active Supervisor cluster. Then FortiSIEM GUI users, Collectors, and Agents can reach any of the Supervisors through the Load Balancer. Workers are aware of the Supervisors and their Leader/Follower role and communicate to the appropriate Supervisor in a load balanced manner.

If the Leader Supervisor goes down, then you need to login to the (immediate) Follower Supervisor and promote it to be the Leader. This is done by running a script on the Follower – see High Availability Operations for details. Note that while the Leader is down,

GUI users can not login to the Supervisor Cluster since the Master PostGreSQL database in Leader is not available.
Incidents may be lost since Rule Master cannot upload Incidents to the App Server, because App Server needs to authenticate Rule Master via the Master PostGreSQL database located on the Leader.
Event parsing and event insertion to database will continue

The Supervisor cluster will come up once the Follower is promoted to the Leader. Since the FortiSIEM license is tied to failed Leader’s UUID, you will see a message prompting you to install a new license with new Leader’s UUID within a 2 weeks grace period from the time of failure. If you fail to do this, then you will not be able to login to the system.

If a Follower Supervisor goes down, then there is no impact:

Load balancer will route GUI users to an available Supervisor
Rules assigned to the Failed Supervisor will be re-assigned to one of the available Supervisors

Once the failed Supervisor comes back up, then the Supervisor will automatically recover. You also have the option to delete the failed Supervisor node, clean it and add it back to the cluster as a follower of the last node in the chain.

Detailed failover steps are provided in High Availability Operations.

Understanding Disaster Recovery (DR)

Disaster Recovery provides availability in case the entire Supervisor, Worker and Elasticsearch cluster goes down. To satisfy this requirement, in addition to the Primary Site 1, a separate fully licensed FortiSIEM needs to run in a Secondary Site (Site 2). The two sites, Sites 1 and 2, need to be set up identically in terms of Supervisor, Workers, and event storage, with the exception that Site 2 does not allow multiple Supervisors.

Note that all the Supervisors in Primary Site 1 are Active-Active while the Supervisor in Secondary Site 2 is Hot Standby, with limited Read/Only GUI functionality.

Under normal operations:

Collectors upload events to Site 1 Worker nodes which store them in the Site 1 Elasticsearch Cluster.
Site 1 Elasticsearch Cluster replicates the events to the Site 2 Elasticsearch Cluster.
The following databases residing in Site 1 (Leader) Supervisor are replicated to Site 2 Supervisor:
- The CMDB residing in a PostgreSQL database
- CMDB Device configurations residing in file system and managed by SVN-lite in Supervisor node
- Profile data residing on SQLite databases on the Supervisor node

Site-1 Elasticsearch Cluster replicates Events to Site-1 Remote Elasticsearch Cluster using Bi-directional Cross-Cluster replication (CCR) mechanism.

On Site 2 Supervisor, PostGreSQL database services run in Follower (that is, read only) mode. User can logon to Site 2 Supervisor but can only do very limited operations.

Able to view CMDB, Incidents, Cases, Tasks, Resources and all settings in the ADMIN page except the License Usage Page, etc
Cannot run any queries on ANALYTICS and all widgets on Dashboards and all report related graphs such as the License Usage Page have no data.
Cannot do any Editing operations on all GUI pages.
All actions related to update operations do not work.

When disaster strikes and entire Site 1 is down:

User need to promote Site 2 Supervisor from Secondary to Primary
User must make DNS changes so that users can logon to Site 2 Supervisor, and Collectors can send events to Site 2 Workers.
Site 2 Workers writes events to Site 2 Elasticsearch Cluster. Elasticsearch queries go across both Site 2 clusters: Site 2 Cluster and Site 1 Remote Cluster.

When the Old Primary (Site 1) is recovered and powered up, the user needs to add Site 1 back as a Secondary site. Then Site 2 will sync missing data with Site 1.

Once the replication is complete, then the user can decide to return to the pre-disaster setup by promoting Site 1 to Primary.

Detailed Disaster Recovery Operations are available here.

Requirements for Successful HA Implementation

All Supervisor nodes must have the same hardware resources (CPU, memory, disk)
Make sure load balancer DNS name or Supervisor DNS names (if Load balancer is not used) are resolvable

Requirements for Successful DR Implementation

Two separate FortiSIEM licenses - one for each site.
The installation at both sites must be identical - workers, storage type, archive setup, hardware resources (CPU, Memory, Disk) of the FortiSIEM nodes.
DNS Names are used for the Supervisor nodes at the two sites. Elasticsearch clusters should also be set up identically on the two sites. Make sure that users, collectors, and agents can access both Supervisor nodes by their DNS names.
DNS Names are used for the Worker upload addresses.
TCP Ports for HTTPS (TCP/443), SSH (TCP/22, TCP/7900), PostgresSQL (TCP/5432), Elasticsearch replication (TCP/9200), and Private SSL Communication port between phMonitor (TCP/7900) are open between both sites.

High Availability and Disaster Recovery - Elasticsearch

The following section provides an introduction and requirements for High Availability and Disaster Recovery for Elasticsearch.

Understanding High Availability (HA)
Understanding Disaster Recovery (DR)
Requirements for Successful HA Implementation
Requirements for Successful DR Implementation

Understanding High Availability (HA)

FortiSIEM High Availability feature enables you to run multiple Active-Active Supervisor nodes. This means that

User can log in to any Supervisor node and perform GUI Operations
Agents, Collectors and Worker nodes can communicate with any Supervisor node
Analytics operations like streaming rule and query processing are distributed across the Supervisor nodes
The databases are replicated among the Supervisor nodes as follows.
- The CMDB residing in a PostgreSQL database
- CMDB Device configurations residing in file system and managed by SVN-lite
- Profile data residing on SQLite databases on the Supervisor node.

An Active-Active Supervisor cluster is built around the concept of Leader and Follower Supervisor nodes, configured as a linked list.

The first Supervisor where you install FortiSIEM License, is the Leader. The Leader’s UUID matches the UUID in the license. On the Leader, Redis and PostGreSQL database services run in Master mode. That means that all PostGreSQL database writes from all Supervisors go to the Leader node.
Next, you add a Follower, which follows the Leader, in the sense that its PostGreSQL database is replicated from that of the Leader. On the Follower node, both Redis and PostGreSQL database services run in Follower (that is, read only) mode. CMDB and Profile data are also copied via rsync from the Leader to the Follower node.
You can add another Follower to follow the Follower node added in Step 2. Its PostGreSQL database is replicated from the Supervisor created in Step 2. On this node, both Redis and PostGreSQL database services run in Follower (that is, read only) mode. CMDB and Profile data are also copied via rsync from the Follower node in Step 2 to the Follower node in Step 3. A replication linked list is formed: Leader -> Follower in Step 2 -> Follower in Step 3.
You can add more Follower nodes to follow the last Follower node in the chain.

GUI users can not login to the Supervisor Cluster since the Master PostGreSQL database in Leader is not available.
Incidents may be lost since Rule Master cannot upload Incidents to the App Server, because App Server needs to authenticate Rule Master via the Master PostGreSQL database located on the Leader.
Event parsing and event insertion to database will continue

If a Follower Supervisor goes down, then there is no impact:

Load balancer will route GUI users to an available Supervisor
Rules assigned to the Failed Supervisor will be re-assigned to one of the available Supervisors

Detailed failover steps are provided in High Availability Operations.

Understanding Disaster Recovery (DR)

Note that all the Supervisors in Primary Site 1 are Active-Active while the Supervisor in Secondary Site 2 is Hot Standby, with limited Read/Only GUI functionality.

Under normal operations:

Collectors upload events to Site 1 Worker nodes which store them in the Site 1 Elasticsearch Cluster.
Site 1 Elasticsearch Cluster replicates the events to the Site 2 Elasticsearch Cluster.
The following databases residing in Site 1 (Leader) Supervisor are replicated to Site 2 Supervisor:
- The CMDB residing in a PostgreSQL database
- CMDB Device configurations residing in file system and managed by SVN-lite in Supervisor node
- Profile data residing on SQLite databases on the Supervisor node

Site-1 Elasticsearch Cluster replicates Events to Site-1 Remote Elasticsearch Cluster using Bi-directional Cross-Cluster replication (CCR) mechanism.

On Site 2 Supervisor, PostGreSQL database services run in Follower (that is, read only) mode. User can logon to Site 2 Supervisor but can only do very limited operations.

Able to view CMDB, Incidents, Cases, Tasks, Resources and all settings in the ADMIN page except the License Usage Page, etc
Cannot run any queries on ANALYTICS and all widgets on Dashboards and all report related graphs such as the License Usage Page have no data.
Cannot do any Editing operations on all GUI pages.
All actions related to update operations do not work.

When disaster strikes and entire Site 1 is down:

User need to promote Site 2 Supervisor from Secondary to Primary
User must make DNS changes so that users can logon to Site 2 Supervisor, and Collectors can send events to Site 2 Workers.
Site 2 Workers writes events to Site 2 Elasticsearch Cluster. Elasticsearch queries go across both Site 2 clusters: Site 2 Cluster and Site 1 Remote Cluster.

When the Old Primary (Site 1) is recovered and powered up, the user needs to add Site 1 back as a Secondary site. Then Site 2 will sync missing data with Site 1.

Once the replication is complete, then the user can decide to return to the pre-disaster setup by promoting Site 1 to Primary.

Detailed Disaster Recovery Operations are available here.

Requirements for Successful HA Implementation

All Supervisor nodes must have the same hardware resources (CPU, memory, disk)
Make sure load balancer DNS name or Supervisor DNS names (if Load balancer is not used) are resolvable

Requirements for Successful DR Implementation

Two separate FortiSIEM licenses - one for each site.
The installation at both sites must be identical - workers, storage type, archive setup, hardware resources (CPU, Memory, Disk) of the FortiSIEM nodes.
DNS Names are used for the Supervisor nodes at the two sites. Elasticsearch clusters should also be set up identically on the two sites. Make sure that users, collectors, and agents can access both Supervisor nodes by their DNS names.
DNS Names are used for the Worker upload addresses.
TCP Ports for HTTPS (TCP/443), SSH (TCP/22, TCP/7900), PostgresSQL (TCP/5432), Elasticsearch replication (TCP/9200), and Private SSL Communication port between phMonitor (TCP/7900) are open between both sites.

High Availability and Disaster Recovery Procedures - Elasticsearch

High Availability and Disaster Recovery - Elasticsearch

High Availability and Disaster Recovery - Elasticsearch

Understanding High Availability (HA)

Understanding Disaster Recovery (DR)

Requirements for Successful HA Implementation

Requirements for Successful DR Implementation

High Availability and Disaster Recovery - Elasticsearch

Understanding High Availability (HA)

Understanding Disaster Recovery (DR)

Requirements for Successful HA Implementation

Requirements for Successful DR Implementation