How to fix Kudu metadata corruption
In rare situations, the Kudu tablet consensus may break and the metadata may corrupt, such as when an ungraceful host is powered off. When this occurs, Monitor > Health > Kudu Health Check will fail and report table conflict, unavailable, or under-replicated in the check result.
Example 1:
Tablet fcdb22e988f54674bf7bd81957d96d99 of table db_log_public.__root_fgt_hyperscale is conflicted: 1 replicas' active configurations disagree with the leader master's:
69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): RUNNING
538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING
077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]
All reported replicas are:
A = 69c7e95e57f748bd801be9562db9684e
B = 538735a93bb8421b8fc2794fb31c52a7
C = 077b5c932f2e4266820d13fb23442964
D = d06a84c881704e5d9f363a77cfd721d5
Example 2:
Tablet 13762b6c83fa4e18843bdabccb4ecdc6 of table 'db_log_public.__kavsprwq_fgt_traffic' is unavailable: 2 replica(s) not RUNNING
2504215476754ff59ad3f0fcb0d58355 (blade-198-18-1-9:7050): RUNNING
04735952e86f4a4a98f748d1c2546fd8 (blade-198-18-1-8:7050): not running [LEADER]
State: FAILED
Data state: TABLET_DATA_READY
Last status: Corruption: Failed log replay. Reason: ...
799029fbd6e54dbf8d7c52f4bb837111 (blade-198-18-1-2:7050): not running
State: FAILED
Data state: TABLET_DATA_READY
Last status: Corruption: Failed log replay. Reason: ...
All reported replicas are:
A = 2504215476754ff59ad3f0fcb0d58355
B = 04735952e86f4a4a98f748d1c2546fd8
C = 799029fbd6e54dbf8d7c52f4bb837111
Example 3:
Tablet 2aaf23af4cc241e1865d292462418046 of table 'db_log_public.__kavsprwq_fgt_traffic' is under-replicated: 1 replica(s) not RUNNING
4df0f22bb51948fbbc0540383b0cfe74 (blade-198-18-1-5:7050): RUNNING
e84166ee4a1f4fb58753d5570273eaa1 (blade-198-18-1-10:7050): RUNNING [LEADER]
f02719025c0a41db92c64e0f37b94c53 (blade-198-18-1-4:7050): not running
State: FAILED
Data state: TABLET_DATA_READY
Last status: Corruption: Failed log replay. Reason: ...
All reported replicas are:
A = 4df0f22bb51948fbbc0540383b0cfe74
B = e84166ee4a1f4fb58753d5570273eaa1
C = f02719025c0a41db92c64e0f37b94c53
To fix the metadata corruption:
- Rerun the Kudu Health Check multiple times and see if the issue persists. Sometimes Kudu may be able to heal by itself.
- If the issue persists after a few retries, go to Cluster Manager > Jobs > Create Custom Job.
- From the Template dropdown, select Kudu Recover Corrupted Tablets and Create.
- In the Jobs table view, find the Kudu Recover Corrupted Tablets row, and click Run in Actions.
After the job is submitted, the tablet goes into recovering mode (see the example Kudu Health Check result below). The recovery may take several minutes, depending on the tablet size. Run the Kudu Health Check repeatedly until the health check returns success.
Example:
Tablet fcdb22e988f54674bf7bd81957d96d99 of table 'db_log_public.__root_fgt_hyperscale' is recovering: 1 on-going tablet copies
69c7e95e57f748bd801be9562db9684e (blade-10-0-1-6:7050): not running
State: INITIALIZED
Data state: TABLET_DATA_COPYING
Last status: Tablet Copy: Downloading block 0000000022163966 (8961/25704)
538735a93bb8421b8fc2794fb31c52a7 (blade-10-0-1-5:7050): RUNNING
077b5c932f2e4266820d13fb23442964 (blade-10-0-1-7:7050): RUNNING [LEADER]