Fortinet black logo

Administration Guide

How to fix Kudu consensus mismatch

How to fix Kudu consensus mismatch

In rare situations, the Kudu tablet consensus may break, such as when an ungraceful host is powered off. When this occurs, Monitor > Health > Kudu Health Check will fail and report CONSENSUS_MISMATCH in the check result.

Example:

Tablet fcdb22e988f54674bf7bd81957d96d99 of table db_log_public.__root_fgt_hyperscale is conflicted: 1 replicas' active configurations disagree with the leader master's:

69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): RUNNING

538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING

077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]

All reported replicas are:

A = 69c7e95e57f748bd801be9562db9684e

B = 538735a93bb8421b8fc2794fb31c52a7

C = 077b5c932f2e4266820d13fb23442964

D = d06a84c881704e5d9f363a77cfd721d5

To fix the consensus mismatch:
  1. Go to Cluster Manager > Jobs > Create Custom Job.
  2. From the Template dropdown, select Kudu Replica Rebuild and configure the following settings:
    • Tablet Server Address: Use any of the two non-leader replica’s hostnames from the table conflict output.

    • Tablet Id: Use the Tablet id that appears in the first line of the table conflict output.

    • Reason: Enter a description of the error.

      In the example above, the fields will be configured as follows:

  3. In the Jobs table view, find the Kudu Replica Rebuild row, and click Run in Actions.
  4. Repeat Steps 1-3 if there are multiple conflicts in the Kudu Health Check result to run one job against each of the conflicted tablets.

After the job is submitted, the tablet goes into recovering mode (see the example Kudu Health Check result below). The recovery may take several minutes, depending on the tablet size. Run the Kudu Health Check repeatedly until the health check returns success.

Example:

Tablet fcdb22e988f54674bf7bd81957d96d99 of table 'db_log_public.__root_fgt_hyperscale' is recovering: 1 on-going tablet copies

69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): not running

State: INITIALIZED

Data state: TABLET_DATA_COPYING

Last status: Tablet Copy: Downloading block 0000000022163966 (8961/25704)

538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING

077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]

How to fix Kudu consensus mismatch

In rare situations, the Kudu tablet consensus may break, such as when an ungraceful host is powered off. When this occurs, Monitor > Health > Kudu Health Check will fail and report CONSENSUS_MISMATCH in the check result.

Example:

Tablet fcdb22e988f54674bf7bd81957d96d99 of table db_log_public.__root_fgt_hyperscale is conflicted: 1 replicas' active configurations disagree with the leader master's:

69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): RUNNING

538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING

077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]

All reported replicas are:

A = 69c7e95e57f748bd801be9562db9684e

B = 538735a93bb8421b8fc2794fb31c52a7

C = 077b5c932f2e4266820d13fb23442964

D = d06a84c881704e5d9f363a77cfd721d5

To fix the consensus mismatch:
  1. Go to Cluster Manager > Jobs > Create Custom Job.
  2. From the Template dropdown, select Kudu Replica Rebuild and configure the following settings:
    • Tablet Server Address: Use any of the two non-leader replica’s hostnames from the table conflict output.

    • Tablet Id: Use the Tablet id that appears in the first line of the table conflict output.

    • Reason: Enter a description of the error.

      In the example above, the fields will be configured as follows:

  3. In the Jobs table view, find the Kudu Replica Rebuild row, and click Run in Actions.
  4. Repeat Steps 1-3 if there are multiple conflicts in the Kudu Health Check result to run one job against each of the conflicted tablets.

After the job is submitted, the tablet goes into recovering mode (see the example Kudu Health Check result below). The recovery may take several minutes, depending on the tablet size. Run the Kudu Health Check repeatedly until the health check returns success.

Example:

Tablet fcdb22e988f54674bf7bd81957d96d99 of table 'db_log_public.__root_fgt_hyperscale' is recovering: 1 on-going tablet copies

69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): not running

State: INITIALIZED

Data state: TABLET_DATA_COPYING

Last status: Tablet Copy: Downloading block 0000000022163966 (8961/25704)

538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING

077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]