List some common Elasticsearch APIs to check different objects:
Elastic Stack Compatibility
The table shows the stack components’ version compatibility.
Some Strategies
If the cluster is unhealthy, check health API the shard status, as well as check node API for node status, are all nodes in good roles? Also check path.data in date node. Check this post
If remove and rejoin node intentionally in a short time, e.g: upgrade, OS maintenance, etc, can delay the unassigned replica shards re-allocation. This setting goes into every index so it may take some time, you can revert setting back after node rejoins.
# cluster health, examine: # total shard number, primary shard number, etc # 1. yellow or red (relocating or unassigned shards?) # 2. node number: master and data (any lost?) # 3. huge number of pending tasks (may get stuck the whole cluster, then # need to check what kind of pending tasks) GET _cluster/health curl -s "http://localhost:9200/_cluster/health?pretty"
# total index number in cluster GET _cluster/stats?filter_path=indices.count # total shard number in cluster GET _cluster/stats?filter_path=indices.shards.total
# explain shard’s current allocation. # can see on going allocation explanation GET _cluster/allocation/explain
# current settings # defaults, transient and persistent(usually made by dynamic APIs) GET _cluster/settings?flat_settings&include_defaults
# set persistent/transient setting PUT _cluster/settings { "<transient or persistent>" : { "cluster.routing.allocation.disk.watermark.flood_stage": "90%", "cluster.routing.allocation.disk.watermark.high": "75%", "cluster.routing.allocation.disk.watermark.low": "70%", "cluster.routing.allocation.cluster_concurrent_rebalance" : "6" } }
# set cluster_concurrent_rebalance and node_concurrent_recoveries bigger can # speed up rebalancing PUT _cluster/settings { "transient" : { "cluster.routing.allocation.cluster_concurrent_rebalance" : "10", "indices.recovery.max_bytes_per_sec" : "250mb", "cluster.routing.allocation.node_concurrent_recoveries": "10" } }
# remove persistent/transient setting PUT _cluster/settings { "<transient or persistent>" : { "cluster.routing.allocation.disk.watermark.low": null } }
# super helpful! # show allocation details on each node # shards, disk.percent columns can help observe the rebalancing GET _cat/allocation?v&s=shards:desc GET _cat/allocation?v&s=disk.percent:desc
Pending Task
1 2 3 4 5 6
# pending tasks list # usually when cluster is yellow or heavy load GET _cluster/pending_tasks?pretty curl -s "http://localhost:9200/_cluster/pending_tasks?pretty" > pending_tasks
# then analyze the "source" to examine what kind of pending tasks is occupied
curl -s "http://localhost:9200/_cat/nodes?pretty" # list all header parameters GET _cat/nodes?v
# check master and data role are properly set GET _cat/nodes?help
# check node ES version, useful in upgrade GET _cat/nodes?v&h=ip,v
# check node heap used, ram.percent # ram.percent: used + cached!! GET _cat/nodes?v&h=ip,heap.current,heap.percent,ram.percent,ram.current,node,role,master
# node metrics # desc sort used disk space percent # as well as show index number in each node GET _cat/nodes?h=ip,disk.total,disk.used_percent,indexing.index_total&s=disk.used_percent:desc
# list custom node attributes GET _cat/nodeattrs?v
Index Check
Delete indices can be performed on Kibana Index Management browser.
# list all header parameters # h=xx,xx,xx GET _cat/indices?help curl -s "http://localhost:9200/_cat/indices?help"
# check index mapping and setting GET <index name>?pretty
# view 2 doc in this index # so you can have a glapse of doc content GET <index name>/_search?pretty { "size": 2 } # when you know doc id GET <index name>/_doc/<unique id>
# sort by creation date # creation.date.string: human-readable # creation.date: Epoch & Unix Timestamp # sort to see the creation time window for a specific index pattern GET _cat/indices/[*-index-pattern-2021.11]h=i,creation.date.string&s=creation.date # you can use converter here just in case # https://www.epochconverter.com/
# pri.store.size: combined size of all primary shard for an index GET _cat/indices/h=i,pri.store.size
# delete index curl -XDELETE 'localhost:9200/<index name>/'
Index Template
1 2 3 4 5 6 7 8
# cat GET _cat/templates/<template name>?v curl -s "http://localhost:9200/_cat/templates/<template name>?v"
# display template definition, for example # index-patterns field # alias field GET _template/<template name>
Index Alias
1 2 3 4 5 6 7 8
# get index alias GET <index name>/_alias
# get available alias list GET _alias/*
# get list of alias for all indexes, empty is showed GET */_alias
# list primary/replic shards of specific index # show doc number in each and host node GET _cat/shards/<index name>?v curl -s "http://localhost:9200/_cat/shards/<index name>?v"
# check shard allocation per node, disk usage # useful for distribution/balance assess GET _cat/allocation?v
# unassigned reason # relocating direction GET _cat/shards?h=index,state,prirep,unassigned.reason
# list reallocating shard # show reallocating shard source -> target node GET _cat/shards?v&h=index,shard,state,node&s=st:desc
# sort shards by size # s=sto:desc, descending order GET _cat/shards?h=i,shard,p,ip,st,sto&s=sto:desc,ip:desc
# list all shareds in specified node and sort by shard size desc curl "localhost:9200/_cat/shards?h=i,shard,ip,prirep,st,store&s=store:desc" \ | grep "<node ip or node name>" > shards.txt
# query hot shards distribution on data nodes # sort node ip with order: desc or asc curl -s "localhost:9200/_cat/shards/<hot index pattern>?s=node:asc" > shards \ && cat shards | awk {'print $8'} | uniq -c | sort -rn # get hot shard total based on shards per node cat shards | awk {'print $8'} | uniq -c | sort -rn | \ awk 'BEGIN { sum = 0 } { sum += $1} END { print sum }' # get hot shard number average cat shards | awk {'print $8'} | uniq -c | sort -rn | \ awk 'BEGIN { sum = 0; count = 0 } { sum += $1; count += 1 } END { print sum / count }'
# reroute shard pri/rep # try dryrun first, the output can be big # the dryrun output contains the reasons from success or failure curl -XPOST "localhost:9200/_cluster/reroute?dry_run" \ -H 'Content-Type: application/json' \ -d \ '{ "commands": [ { "move": { "index": "<index name>", "shard": "<shard number>", "from_node": "<ip or node name>", "to_node": "<ip or node name>" } } ] }' # or run in Dev tool POST /_cluster/reroute?dry_run { "commands": [ { "move": { "index": "<index name>", "shard": "<shard number>", "from_node": "<ip or node name>", "to_node": "<ip or node name>" } } ] }
# it is sometimes possible there is unassigned shard due to max reties failed # https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html#cluster-reroute-api-request-body curl -XPOST "localhost:9200/_cluster/reroute?retry_failed=true" \ -H 'Content-Type: application/json' \ -d \ '{ "commands" : [ { "allocate_replica": { "index" : "<index name>", "shard" : "<shard number>", "node" : "<target node>" } }] }' # or in dev tool POST /_cluster/reroute?retry_failed=true { "commands" : [ { "allocate_replica": { "index" : "<index name>", "shard" : "<shard number>", "node" : "<target node>" } }] }
# attempt a single retry # if there are many good(if shards are stale/corrupted, this will not work) # unassigned shards blocked, retry all without request body # you may need to run multiple times to clean the backlog curl -XPOST "localhost:9200/_cluster/reroute?retry_failed=true" # or in dev tool POST /_cluster/reroute?retry_failed=true
# if shard gets stucked in initializing status from recovery # the reason could be # 1. the shard is big # 2. lot of replicas that lingers initial process
Data Stream
It is easy to examine DS in Kibana index management console.
1 2 3 4 5 6 7 8
# get specificed ds backing indices, template, ILM GET _data_stream/<data stream name>
# rollover a data stream POST <data stream name>/_rollover
# delete ds and all its backing indices DELETE _data_stream/<data stream name>
What I care about is the current write index of DS:
1 2 3 4 5 6 7 8 9 10 11
# list of all non-system, non-hidden data stream curl -s "http://localhost:9200/_data_stream/*?format=json&pretty" \ | jq -r '.data_streams[].name' | sort -r
# find current write index(last one), health status, template and policy curl -s "http://localhost:9200/_data_stream/<data-stream-name>?format=json&pretty" \ | jq -r '.data_streams[].indices[-1].index_name'
# data stream stats # total shards, total backing indices, total storage size curl -s "http://localhost:9200/_data_stream/<data-stream-name>/_stats?pretty&format=json"
Another important statistics is the distribution of data stream based hot shard, it is not straightforward and needs some calculation, I have wrote a script to display it.
ILM
The index lifecycle management APIs.
I have observed the huge number of pending tasks from ILM operations that slow down the cluster holistically (cs, traffic, usage, etc)
1 2 3 4 5 6 7 8 9 10 11
# examine shard age and phase state # and any ILM error GET <index name>/_ilm/explain
# remove ILM from ds or alias POST <ds or alias>/_ilm/remove # need to check if any index is closed by forcemerge, if yes, open it GET <ds or alias>
# retry after the ILM gets fixed(updated) POST <index name>/_ilm/retry
Stop and start the ILM system, used when performing schedule maintenance on cluster nodes and cloud impact ILM actions.