Fortinet black logo

How to recover from an unhealthy service status

How to recover from an unhealthy service status

The service levels in the Security Event Manager is highly available and fault tolerant with data is replicated three times into different data hosts. If any one of the BigData hosts goes down, you can expect some service degradation (such as dropped insert rate and query performance), but all basic functionalities (such as FortiView, and LogView) are preserved with no data loss. While the system is mostly self-healing from failures, manual operation is required to address certain failure incidents.

The Monitor page contains tools to help you monitor the status and health of the hosts and services (see Monitor). We suggest scheduling a routine monitoring and maintenance window, and set up system alerts to enable rapid remediations and fault prevention. If you need to shut down your FortiAnalyzer-BigData, follow the best practices (see General maintenance and best practices to avoid damaging your database.

Stateful workloads occasionally require manual responses to recover from incidents. When unhealthy workloads are detected, check the status of all BigData hosts to ensure they are all functioning. In general, you should address host level incidents first before going into the service level.

This following section contains troubleshooting tips for when FortiAnalyzer-BigData services have an unhealthy status:

Core services

Core / Query

If Query service is unhealthy, or if FortiView or LogView stops working, you can try the following:

  1. From Cluster Manager > Services, check if the Data Lake service group is healthy, if not, fix it first.
  2. From Cluster Manager > Services > Core, manually restart the Query service, and then wait a few minutes to see if the issue is fixed.
Core / Ingestion

If the Ingestion service is unhealthy, or if the log insert rate remains at zero while receiving rate is higher, you can try the following:

  1. From Cluster Manager > Services, check if the Message Broker service group is healthy, if not, fix it first.
  2. In Cluster Manager > Services > Core, manually Restart the Ingestion service, and then wait a few minutes to see if the issue is fixed.
  3. If the issue persists after the restart, go to Cluster Manager > Jobs > Create Custom Job, and select Kafka Deep Clean as the template.
  4. Find the newly created "Kafka Deep Clean" job in the job list and click Run.
    CautionThis will purge all the data in the queue and a start a fresh Pipeline. Any unprocessed data will be lost.
Core / Pipeline

If the Pipeline service is unhealthy, or if the Pipeline Health Check in Monitor > Health remains unhealthy for hours, you can try the following:

  1. In Cluster Manager > Services, check if the Data Lake and Message Broker service groups are healthy, if not, fix them first.
  2. In Cluster Manager > Services > Core, manually restart the Pipeline service, and then wait a few minutes to see if the issue is fixed.
  3. If the issue persists after a few hours, go to Cluster Manager > Jobs > Create Custom Job and select Purge Data Pipeline as the template.
  4. Find the newly created "Purge Data Pipeline" job in the job list and click Run.
    Caution
    This will purge all the data in the queue and start a fresh Pipeline. Any processed data will be lost.

Data Lake services

Data Lake / Impala

If the Impala service is unhealthy, you can try the following:

  1. Check if the Metastore service group is healthy, if not, fix it first.
  2. From Cluster Manager > Services > Data Lake, manually Restart the Impala service and wait a few minutes to see if the issue is fixed.
Data Lake / Kudu

If the Kudu service is unhealthy, you can try the following:

  1. From Cluster Manager > Services, manually Stop the Core service group.
  2. Check if the Metastore service group is healthy, if not, fix it first.
  3. From Cluster Manager > Services > Data Lake, manually Restart the Kudu service and wait a few minutes to see if the issue is fixed.
  4. If the issue persists after the restart and the log indicates that Kudu failed to synchronize time, go to Cluster Manager > Jobs > Create Custom Job and select NTP Sync as the template.
  5. Find the newly created NTP Sync job in the job list and click Run.
  6. After the job finishes running, manually Start the Kudu service again to see if the status becomes healthy.
  7. Once the Kudu service is healthy, manually Start the Core service group again.

If the Kudu Health Check in Monitor > Health remains unhealthy for hours but the Kudu service status is healthy, you can try the following:

Tooltip

The Kudu Health Check may temporarily fail when the Storage Group Restore or Data Rebalance job is running. Once the jobs are finished running, the status will automatically clear. Make sure those jobs are not running before troubleshooting.

  1. From Cluster Manager > Services, manually Stop the Core service group.
  2. Wait about 15 minutes and then navigate to Monitor > Health to rerun the Kudu Health Check.
  3. If the health check returns as healthy, return to the Services page to manually Start the Core service group.

Message Broker services

Message Broker / Kafka
  1. If the Kafka service is unhealthy, you can try the following:
  2. From Cluster Manager > Services, manually Stop the Core service group.
  3. Go to Cluster Manager > Services > Message Broker, and manually Restart the Kafka service and check that the status becomes healthy.
  4. If the issue remains after the restart, go to Cluster Manager > Jobs > Create Custom Job and select Kafka Deep Clean as template.
  5. Find the newly created "Kafka Deep Clean" job in the job list and click Run.
    Caution
    This will purge all the data in the queue and start a fresh Pipeline. Any processed data will be lost.
  6. Return to Cluster Manager > Services and manually Start the Core service group.
Metastore / HDFS

If the HDFS service is unhealthy, or if the HDFS related Health Checks in Monitor > Health are remains, you can try the following:

  1. From Cluster Manager > Services > Metastore, manually Restart the HDFS service, and then wait a few minutes to see if the status changes to healthy.
  2. If the issue persists after restart and the logs indicate the HDFS is in safe mode, go to Cluster Manager > Jobs > Create Custom Job and select HDFS Safemode Leave as the template.
  3. Find the newly created "HDFS Safemode Leave" job in the job list and click Run.

How to recover from an unhealthy service status

The service levels in the Security Event Manager is highly available and fault tolerant with data is replicated three times into different data hosts. If any one of the BigData hosts goes down, you can expect some service degradation (such as dropped insert rate and query performance), but all basic functionalities (such as FortiView, and LogView) are preserved with no data loss. While the system is mostly self-healing from failures, manual operation is required to address certain failure incidents.

The Monitor page contains tools to help you monitor the status and health of the hosts and services (see Monitor). We suggest scheduling a routine monitoring and maintenance window, and set up system alerts to enable rapid remediations and fault prevention. If you need to shut down your FortiAnalyzer-BigData, follow the best practices (see General maintenance and best practices to avoid damaging your database.

Stateful workloads occasionally require manual responses to recover from incidents. When unhealthy workloads are detected, check the status of all BigData hosts to ensure they are all functioning. In general, you should address host level incidents first before going into the service level.

This following section contains troubleshooting tips for when FortiAnalyzer-BigData services have an unhealthy status:

Core services

Core / Query

If Query service is unhealthy, or if FortiView or LogView stops working, you can try the following:

  1. From Cluster Manager > Services, check if the Data Lake service group is healthy, if not, fix it first.
  2. From Cluster Manager > Services > Core, manually restart the Query service, and then wait a few minutes to see if the issue is fixed.
Core / Ingestion

If the Ingestion service is unhealthy, or if the log insert rate remains at zero while receiving rate is higher, you can try the following:

  1. From Cluster Manager > Services, check if the Message Broker service group is healthy, if not, fix it first.
  2. In Cluster Manager > Services > Core, manually Restart the Ingestion service, and then wait a few minutes to see if the issue is fixed.
  3. If the issue persists after the restart, go to Cluster Manager > Jobs > Create Custom Job, and select Kafka Deep Clean as the template.
  4. Find the newly created "Kafka Deep Clean" job in the job list and click Run.
    CautionThis will purge all the data in the queue and a start a fresh Pipeline. Any unprocessed data will be lost.
Core / Pipeline

If the Pipeline service is unhealthy, or if the Pipeline Health Check in Monitor > Health remains unhealthy for hours, you can try the following:

  1. In Cluster Manager > Services, check if the Data Lake and Message Broker service groups are healthy, if not, fix them first.
  2. In Cluster Manager > Services > Core, manually restart the Pipeline service, and then wait a few minutes to see if the issue is fixed.
  3. If the issue persists after a few hours, go to Cluster Manager > Jobs > Create Custom Job and select Purge Data Pipeline as the template.
  4. Find the newly created "Purge Data Pipeline" job in the job list and click Run.
    Caution
    This will purge all the data in the queue and start a fresh Pipeline. Any processed data will be lost.

Data Lake services

Data Lake / Impala

If the Impala service is unhealthy, you can try the following:

  1. Check if the Metastore service group is healthy, if not, fix it first.
  2. From Cluster Manager > Services > Data Lake, manually Restart the Impala service and wait a few minutes to see if the issue is fixed.
Data Lake / Kudu

If the Kudu service is unhealthy, you can try the following:

  1. From Cluster Manager > Services, manually Stop the Core service group.
  2. Check if the Metastore service group is healthy, if not, fix it first.
  3. From Cluster Manager > Services > Data Lake, manually Restart the Kudu service and wait a few minutes to see if the issue is fixed.
  4. If the issue persists after the restart and the log indicates that Kudu failed to synchronize time, go to Cluster Manager > Jobs > Create Custom Job and select NTP Sync as the template.
  5. Find the newly created NTP Sync job in the job list and click Run.
  6. After the job finishes running, manually Start the Kudu service again to see if the status becomes healthy.
  7. Once the Kudu service is healthy, manually Start the Core service group again.

If the Kudu Health Check in Monitor > Health remains unhealthy for hours but the Kudu service status is healthy, you can try the following:

Tooltip

The Kudu Health Check may temporarily fail when the Storage Group Restore or Data Rebalance job is running. Once the jobs are finished running, the status will automatically clear. Make sure those jobs are not running before troubleshooting.

  1. From Cluster Manager > Services, manually Stop the Core service group.
  2. Wait about 15 minutes and then navigate to Monitor > Health to rerun the Kudu Health Check.
  3. If the health check returns as healthy, return to the Services page to manually Start the Core service group.

Message Broker services

Message Broker / Kafka
  1. If the Kafka service is unhealthy, you can try the following:
  2. From Cluster Manager > Services, manually Stop the Core service group.
  3. Go to Cluster Manager > Services > Message Broker, and manually Restart the Kafka service and check that the status becomes healthy.
  4. If the issue remains after the restart, go to Cluster Manager > Jobs > Create Custom Job and select Kafka Deep Clean as template.
  5. Find the newly created "Kafka Deep Clean" job in the job list and click Run.
    Caution
    This will purge all the data in the queue and start a fresh Pipeline. Any processed data will be lost.
  6. Return to Cluster Manager > Services and manually Start the Core service group.
Metastore / HDFS

If the HDFS service is unhealthy, or if the HDFS related Health Checks in Monitor > Health are remains, you can try the following:

  1. From Cluster Manager > Services > Metastore, manually Restart the HDFS service, and then wait a few minutes to see if the status changes to healthy.
  2. If the issue persists after restart and the logs indicate the HDFS is in safe mode, go to Cluster Manager > Jobs > Create Custom Job and select HDFS Safemode Leave as the template.
  3. Find the newly created "HDFS Safemode Leave" job in the job list and click Run.