最近ElasticSearch频繁的宕掉,检查其健康状态发现有一个索引的分片是UNASSIGNED
的状态,于是就开始着手处理这个问题,并在此记录下解决这个问题的办法。
1. 查找分片异常的索引
curl -X GET 10.10.161.1:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
返回结果:
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 51215 100 51215 0 0 479k 0 --:--:-- --:--:-- --:--:-- 480k test_index 3 r UNASSIGNED ALLOCATION_FAILED
由此信息得知是第三个分片的从分片出现无法分配的情况。
2. 尝试解决
2.1 在Kibana中尝试重试分配分片
POST _cluster/reroute?retry_failed
失败的分配返回如下:
{ "state": "INITIALIZING", "primary": false, "node": "KbRtC52uQWS_6C5l1kocKA", "relocating_node": null, "shard": 3, "index": "test_index", "recovery_source": { "type": "PEER" }, "allocation_id": { "id": "V6ePWlEBSCGve4sIXXWm1w" }, "unassigned_info": { "reason": "ALLOCATION_FAILED", "at": "2019-09-12T06:38:16.568Z", "failed_attempts": 5, "delayed": false, "details": "failed recovery, failure RecoveryFailedException[[test_index][3]: Recovery failed from {5cCOK1o}{5cCOK1oWT_KrCgerQGfcaA}{Lk2vCAhzQJK6r5C6UzXLZw}{10.10.161.102}{10.10.161.102:9300} into {KbRtC52}{KbRtC52uQWS_6C5l1kocKA}{0KKoe57ATfyA6oKehRoArw}{10.10.161.103}{10.10.161.103:9300}]; nested: RemoteTransportException[[5cCOK1o][10.10.161.102:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [test_index][3] from primary shard with sync id but number of docs differ: 4609345 (5cCOK1o, primary) vs 4609267(KbRtC52)]; ", "allocation_status": "no_attempt" } }
这个报错说明主从的doc数量不一致导致无法重新分配分片。
2.2 查看该索引的配置信息:
GET /test_index/_settings
输出如下:
{ "test_index": { "settings": { "index": { "creation_date": "1567406954255", "number_of_shards": "5", "number_of_replicas": "1", "uuid": "1g3CpeaVRX-fp3-PBniCJg", "version": { "created": "5020299" }, "provided_name": "test_index" } } } }
由此得知,该索引有5个分片,一个副本。于是手动分配一个分片:
POST /_cluster/reroute { "commands": [ { "allocate_replica": { "index": "test_index", "shard": 3, "node": "10.10.161.103" } } ] }
结果返回如下错误:
{ "error": { "root_cause": [ { "type": "remote_transport_exception", "reason": "[5cCOK1o][10.10.161.102:9300][cluster:admin/reroute]" } ], "type": "illegal_argument_exception", "reason": "[allocate_replica] allocation of [test_index][3] on node {KbRtC52}{KbRtC52uQWS_6C5l1kocKA}{0KKoe57ATfyA6oKehRoArw}{10.10.161.103}{10.10.161.103:9300} is not allowed, reason: [NO(shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-09-12T09:02:57.343Z], failed_attempts[6], delayed=false, details[failed recovery, failure RecoveryFailedException[[test_index][3]: Recovery failed from {5cCOK1o}{5cCOK1oWT_KrCgerQGfcaA}{Lk2vCAhzQJK6r5C6UzXLZw}{10.10.161.102}{10.10.161.102:9300} into {KbRtC52}{KbRtC52uQWS_6C5l1kocKA}{0KKoe57ATfyA6oKehRoArw}{10.10.161.103}{10.10.161.103:9300}]; nested: RemoteTransportException[[5cCOK1o][10.10.161.102:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [test_index][3] from primary shard with sync id but number of docs differ: 4609345 (5cCOK1o, primary) vs 4609267(KbRtC52)]; ], allocation_status[no_attempt]]])][YES(primary shard for this replica is already active)][YES(explicitly ignoring any disabling of allocation due to manual allocation commands via the reroute API)][YES(target node version [5.2.2] is the same or newer than source node version [5.2.2])][YES(the shard is not being snapshotted)][YES(node passes include/exclude/require filters)][YES(the shard does not exist on the same node)][YES(enough disk for shard on node, free: [3tb], shard size: [0b], free after allocating shard: [3tb])][YES(below shard recovery limit of outgoing: [0 < 2] incoming: [0 < 2])][YES(total shard limits are disabled: [index: -1, cluster: -1] <= 0)][YES(allocation awareness is not enabled, set cluster setting [cluster.routing.allocation.awareness.attributes] to enable it)]" }, "status": 400 }
看来手动无法给分配分片了。
2.3 最终解决方案
将其先设置为一个副本
PUT /test_index/_settings { "index" : { "number_of_replicas" : 0 } }
查看该索引的分片
GET /_cat/shards/test_index
此时,所有的分片都没有副本,于是就不存在什么主从(P\R)。
test_index 3 p STARTED 4609345 1.3gb 10.10.161.102 5cCOK1o test_index 2 p STARTED 4609050 1.6gb 10.10.161.103 KbRtC52 test_index 1 p STARTED 4607156 1.5gb 10.10.161.103 KbRtC52 test_index 4 p STARTED 4596510 1.3gb 10.10.161.102 5cCOK1o test_index 0 p STARTED 4605846 1.4gb 10.10.161.102 5cCOK1o
此时再将其在设置为1
PUT /test_index/_settings { "index" : { "number_of_replicas" : 1 } }
等待一会儿,该索引便可添加一个副本
curl -X GET 10.10.161.1:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
输出如下:
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0test_index 3 r UNASSIGNED REPLICA_ADDED test_index 2 r UNASSIGNED REPLICA_ADDED test_index 1 r UNASSIGNED REPLICA_ADDED test_index 4 r UNASSIGNED REPLICA_ADDED 100 53195 100 53195 0 0 500k 0 --:--:-- --:--:-- --:--:-- 504k