We accidently configured all nodes the master role and wrong data path in upgrade and result in all shards unassigned with cluster red status, this led to data loss and corrupted shards.
For example, from the cluster health API the cluster status after upgrade:
{ "cluster_name":"xxx", "status":"red", "timed_out":false, "number_of_nodes":5, // no data nodes "number_of_data_nodes":0, // no primary shards "active_primary_shards":0, "active_shards":0, "relocating_shards":0, "initializing_shards":0, // all unassigned "unassigned_shards":33, "delayed_unassigned_shards":0, "number_of_pending_tasks":0, "number_of_in_flight_fetch":0, "task_max_waiting_in_queue_millis":0, "active_shards_percent_as_number":0 }
In this case you need to take a glance at node status, it turns out that we have wrong configuration, all node are set to master, data path is wrong too:
1 2 3 4 5 6 7 8 9 10
curl "localhost:9200/_cat/nodes"
172.16.0.14158610.000.100.19 im - 172.16.0.141 172.16.0.140248620.020.140.16 im - 172.16.0.140 172.16.0.13846600.270.180.24 im - 172.16.0.138 172.16.0.13747300.000.070.12 im - 172.16.0.137 172.16.0.15248610.000.040.09 im * 172.16.0.152
// for data path, master and data nodes may different but the same kind // should be the same path
The solution is to set the right configuration (node role and data path) and restart the whole cluster, Usually, the node rejoin will transform unassigned to assigned/started, if not, the data may lose and corrupted so still unassigned.
{ "note":"No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.", "index":"elastalert-status", "shard":0, "primary":true, "current_state":"unassigned", "unassigned_info":{ "reason":"CLUSTER_RECOVERED", "at":"2021-12-20T23:54:16.720Z", "last_allocation_status":"no_valid_shard_copy" }, "can_allocate":"no_valid_shard_copy", "allocate_explanation":"cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster" }