Fortinet black logo

Administration Guide

How to fix Kudu metadata corruption

How to fix Kudu metadata corruption

In rare situations, the Kudu tablet consensus may break and the metadata may corrupt, such as when an ungraceful host is powered off. When this occurs, Monitor > Health > Kudu Health Check will fail and report table conflict, unavailable, or under-replicated in the check result.

Example 1:

Tablet fcdb22e988f54674bf7bd81957d96d99 of table db_log_public.__root_fgt_hyperscale is conflicted: 1 replicas' active configurations disagree with the leader master's:

69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): RUNNING

538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING

077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]

All reported replicas are:

A = 69c7e95e57f748bd801be9562db9684e

B = 538735a93bb8421b8fc2794fb31c52a7

C = 077b5c932f2e4266820d13fb23442964

D = d06a84c881704e5d9f363a77cfd721d5

Example 2:

Tablet 13762b6c83fa4e18843bdabccb4ecdc6 of table 'db_log_public.__kavsprwq_fgt_traffic' is unavailable: 2 replica(s) not RUNNING

2504215476754ff59ad3f0fcb0d58355 (blade-198-18-1-9:7050): RUNNING

04735952e86f4a4a98f748d1c2546fd8 (blade-198-18-1-8:7050): not running [LEADER]

State: FAILED

Data state: TABLET_DATA_READY

Last status: Corruption: Failed log replay. Reason: ...

799029fbd6e54dbf8d7c52f4bb837111 (blade-198-18-1-2:7050): not running

State: FAILED

Data state: TABLET_DATA_READY

Last status: Corruption: Failed log replay. Reason: ...

All reported replicas are:

A = 2504215476754ff59ad3f0fcb0d58355

B = 04735952e86f4a4a98f748d1c2546fd8

C = 799029fbd6e54dbf8d7c52f4bb837111

Example 3:

Tablet 2aaf23af4cc241e1865d292462418046 of table 'db_log_public.__kavsprwq_fgt_traffic' is under-replicated: 1 replica(s) not RUNNING

4df0f22bb51948fbbc0540383b0cfe74 (blade-198-18-1-5:7050): RUNNING

e84166ee4a1f4fb58753d5570273eaa1 (blade-198-18-1-10:7050): RUNNING [LEADER]

f02719025c0a41db92c64e0f37b94c53 (blade-198-18-1-4:7050): not running

State: FAILED

Data state: TABLET_DATA_READY

Last status: Corruption: Failed log replay. Reason: ...

All reported replicas are:

A = 4df0f22bb51948fbbc0540383b0cfe74

B = e84166ee4a1f4fb58753d5570273eaa1

C = f02719025c0a41db92c64e0f37b94c53

To fix the metadata corruption:
  1. Rerun the Kudu Health Check multiple times and see if the issue persists. Sometimes Kudu may be able to heal by itself.
  2. If the issue persists after a few retries, go to Cluster Manager > Jobs > Create Custom Job.
  3. From the Template dropdown, select Kudu Recover Corrupted Tablets and Create.
  4. In the Jobs table view, find the Kudu Recover Corrupted Tablets row, and click Run in Actions.

    After the job is submitted, the tablet goes into recovering mode (see the example Kudu Health Check result below). The recovery may take several minutes, depending on the tablet size. Run the Kudu Health Check repeatedly until the health check returns success.

Example:

Tablet fcdb22e988f54674bf7bd81957d96d99 of table 'db_log_public.__root_fgt_hyperscale' is recovering: 1 on-going tablet copies

69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): not running

State: INITIALIZED

Data state: TABLET_DATA_COPYING

Last status: Tablet Copy: Downloading block 0000000022163966 (8961/25704)

538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING

077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]

How to fix Kudu metadata corruption

In rare situations, the Kudu tablet consensus may break and the metadata may corrupt, such as when an ungraceful host is powered off. When this occurs, Monitor > Health > Kudu Health Check will fail and report table conflict, unavailable, or under-replicated in the check result.

Example 1:

Tablet fcdb22e988f54674bf7bd81957d96d99 of table db_log_public.__root_fgt_hyperscale is conflicted: 1 replicas' active configurations disagree with the leader master's:

69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): RUNNING

538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING

077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]

All reported replicas are:

A = 69c7e95e57f748bd801be9562db9684e

B = 538735a93bb8421b8fc2794fb31c52a7

C = 077b5c932f2e4266820d13fb23442964

D = d06a84c881704e5d9f363a77cfd721d5

Example 2:

Tablet 13762b6c83fa4e18843bdabccb4ecdc6 of table 'db_log_public.__kavsprwq_fgt_traffic' is unavailable: 2 replica(s) not RUNNING

2504215476754ff59ad3f0fcb0d58355 (blade-198-18-1-9:7050): RUNNING

04735952e86f4a4a98f748d1c2546fd8 (blade-198-18-1-8:7050): not running [LEADER]

State: FAILED

Data state: TABLET_DATA_READY

Last status: Corruption: Failed log replay. Reason: ...

799029fbd6e54dbf8d7c52f4bb837111 (blade-198-18-1-2:7050): not running

State: FAILED

Data state: TABLET_DATA_READY

Last status: Corruption: Failed log replay. Reason: ...

All reported replicas are:

A = 2504215476754ff59ad3f0fcb0d58355

B = 04735952e86f4a4a98f748d1c2546fd8

C = 799029fbd6e54dbf8d7c52f4bb837111

Example 3:

Tablet 2aaf23af4cc241e1865d292462418046 of table 'db_log_public.__kavsprwq_fgt_traffic' is under-replicated: 1 replica(s) not RUNNING

4df0f22bb51948fbbc0540383b0cfe74 (blade-198-18-1-5:7050): RUNNING

e84166ee4a1f4fb58753d5570273eaa1 (blade-198-18-1-10:7050): RUNNING [LEADER]

f02719025c0a41db92c64e0f37b94c53 (blade-198-18-1-4:7050): not running

State: FAILED

Data state: TABLET_DATA_READY

Last status: Corruption: Failed log replay. Reason: ...

All reported replicas are:

A = 4df0f22bb51948fbbc0540383b0cfe74

B = e84166ee4a1f4fb58753d5570273eaa1

C = f02719025c0a41db92c64e0f37b94c53

To fix the metadata corruption:
  1. Rerun the Kudu Health Check multiple times and see if the issue persists. Sometimes Kudu may be able to heal by itself.
  2. If the issue persists after a few retries, go to Cluster Manager > Jobs > Create Custom Job.
  3. From the Template dropdown, select Kudu Recover Corrupted Tablets and Create.
  4. In the Jobs table view, find the Kudu Recover Corrupted Tablets row, and click Run in Actions.

    After the job is submitted, the tablet goes into recovering mode (see the example Kudu Health Check result below). The recovery may take several minutes, depending on the tablet size. Run the Kudu Health Check repeatedly until the health check returns success.

Example:

Tablet fcdb22e988f54674bf7bd81957d96d99 of table 'db_log_public.__root_fgt_hyperscale' is recovering: 1 on-going tablet copies

69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): not running

State: INITIALIZED

Data state: TABLET_DATA_COPYING

Last status: Tablet Copy: Downloading block 0000000022163966 (8961/25704)

538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING

077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]