Elastic Stack Quick Start

Posted on 2020-05-20 Edited on 2025-06-22 In Elastic

//TODO: [ ] elasticsearch certificate [ ] es ML node for abnormal detect (这个还有点意思，用来分析数据的) [ ] logstash input data for testing, movielens [x] linkedin learning: [x] JKSJ elasticsearch training and git repo [x] cerebro, tutorial

ELK Stack

The Elastic Stack is one of the most effective ways to leverage open source technology to build a central logging, monitoring, and alerting system for servers and applications.

Elasticsearch: distributed, fast, highly scalable document database.
Logstash: aggregates, filters and supplyments log data, forwards them to Elasticsearch or others.
Kibana: web-based front-end to visualize and analyze log data.
Beats: lightweight utilities for reading logs from a varity of sources, sends data to Logstash or other backends.
Altering: send notifications to email, slack, pagerduty so on and so forth.

ELK Stack vs Prometheus: ELK is general-purpose no-sql stack can be used for monitoring, aggregating all the logging and shipping to elastic search for ease of browsing all the logging and similar things. Prometheus is dedicated monitoring system, alongside with service discovery consul and alert-manager.

My Vagrant elasticsearch cluster setup. Java runtime and /bin/bash that supports array are required, also note that elasticserach cannot boot by root user.

Another option is to use docker compose to create testing elasticsearch cluster, see my repo here.

Elasticsearch

Version Rolling Upgrade, some highlights:

Set index.unassigned.node_left.delayed_timeout to hours
Starts from data nodes, then master nodes, one by one, ensure config yaml file is correct for each role
Wait for recovery with big retries
Revert index.unassigned.node_left.delayed_timeout
Upgrade kibana version

This blog series talks about kafka + elastic architecture.

This blog shares ways to enable data high reliability as well as extending resources. As we see the kafka message queue also benefits the data reliability besides throttling.

There are several ways to install: binary, rpm or on kubernetes. The package is Java self-contained, you can also specify ES_JAVA_HOME to use external Java.

Install using archive. The Elasticsearch .tar.gz package does not include the systemd module (have to create by yourself). To manage Elasticsearch as a service easily, use the Debian or RPM package instead.

It is advisable to change the default locations of the config directory, the data directory, and the logs directory. Usually data directory is mounted on separate disk.

Before launching, go to edit $ES_HOME/config/elasticsearch.yml. The configuration files should contain settings which are node-specific (such as node.name, node.role and storage paths), or settings which a node requires in order to be able to join a cluster, such as cluster.name and network.host.

The config path can be changed by ES_PATH_CONF env variable.

elasticsearch.yml for configuring Elasticsearch
jvm.options for configuring Elasticsearch JVM settings, can refer to github jvm configuration for different release branch: https://github.com/elastic/elasticsearch/blob/master/distribution/src/config/jvm.options
log4j2.properties for configuring Elasticsearch logging, for example the logs path is /var/log/elasticsearch

Important elasticsearch settings

For example:

# cluster name
cluster.name: chengdol-es
# node name
node.name: master
# ip to access, the host public IP
# or using interface name such as _eth1_
network.host: 9.30.94.85

# a list of master-eligible nodes in the cluster
# Each address can be either an IP address or a hostname 
# that resolves to one or more IP addresses via DNS.
discovery.seed_hosts:
  - 192.168.1.10:9300
  # port default 9300
  - 192.168.1.11
  - seeds.mydomain.com
  # ipv6
  - [0:0:0:0:0:ffff:c0a8:10c]:9301

To form a production cluster, you need to specify, for node roles, see this document about how to statically specify master and data nodes.

cluster.name
network.host
discovery.seed_hosts
cluster.initial_master_nodes
# specify dedicated node role
# lower version have different syntax
node.roles: [ master ]
node.roles: [ data ]

The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes.

High availability (HA) clusters require at least three master-eligible nodes, at least two of which are not voting-only nodes. Such a cluster will be able to elect a master node even if one of the nodes fails.

Data nodes hold the shards that contain the documents you have indexed. Data nodes handle data related operations like CRUD, search, and aggregations. These operations are I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add more data nodes if they are overloaded.

Important system settings:

disable swapping
increase file descriptors
ensure sufficient virtual memory
JVM DNS cache settings
temporary directory not mounted with noexec
TCP retransmission timeout

Regarding JVM settings in production environment, see this blog:

set Xmx and Xms the same
java heap size <= 50% host memory capacity
heap <= 30GB

Check elasticsearch version

# 9200 is the defaul port
# on browser or kibana dev console
curl -XGET "http://<master/data bind IP>:9200"
# response from Elasticsearch server
{
  "name" : "master",
  "cluster_name" : "chengdol-es",
  "cluster_uuid" : "XIRbI3QxRq-ZXNuGDRqDFQ",
  "version" : {
    "number" : "7.11.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "ff17057114c2199c9c1bbecc727003a907c0db7a",
    "build_date" : "2021-02-15T13:44:09.394032Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Check cluster health and number of master and data nodes:

curl -X GET "http://<master/data bind IP>/_cat/health?v=true&format=json&pretty"
# response example
{
  "cluster_name" : "chengdol-es",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Besides master and data node role, there are ingest node, remote-eligible node, coordinating node(can be dedicated).

Indiex

[x] how to create and config index [x] how to check index, status, config [x] how to create actually document data: check document API [x] how to reroute shards, through cluster APIs

ES 10 concepts, understand what is indices, document, fields, mapping, shards, primary and replicas shards, document, data node, master node, and so on.

First, an index is some type of data organization mechanism, allowing the user to partition data a certain way. The second concept relates to replicas and shards, the mechanism Elasticsearch uses to distribute data around the cluster.

1 2	Schema - MySQL => Databases => Tables => Row => Columns (事务性, Join) Mapping - Elasticsearch => Indices => Types => Document => fields (相关性，高性能全文检索)

So just remember, Indices organize data logically, but they also organize data physically through the underlying shards. When you create an index, you can define how many shards you want. Each shard is an independent Lucene index that can be hosted anywhere in your cluster.

Index module: the settings for index and control all aspects related to an index, for example:

1 2	index.number_of_shards: The number of primary shards that an index should have index.number_of_replicas: The number of replicas each primary shard has. Defaults to 1.

Index template: tell Elasticsearch how to configure an index when it is created, Elasticsearch applies templates to new indices based on an index pattern that matches the index name. Templates are configured prior to index creation and then when an index is created either manually or through indexing a document, the template settings are used as a basis for creating the index. If a new data stream or index matches more than one index template, the index template with the highest priority is used.

There are two types of templates, index templates and component templates(注意old version只有index template, see this legacy index template), template 其实包含了index module的内容.

Get index template details through API.

Elasticsearch API, for example: cluster status, index, document and shards, and reroute, to examine the node type (master and data nodes), shards distribution and CPU load statistics.

Let’s see an example to display doc content:

# cat indices
curl -X GET "172.20.21.30:9200/_cat/indices?format=json" | jq
# search index
# get list of docs and its ids
curl -X GET "172.20.21.30:9200/<index name>/_search?format=json" | jq
# get doc via its id
curl -X GET "172.20.21.30:9200/<index name>/_doc/<doc id>?format=json" | jq

To upload single or bulk document data to elasticsearch, see document API. You can download sample here: sample data, let’s try accounts data:

# the accounts.json does not have index and type in it, so specify in curl command
# /bank/account is the index and type
# es will create index bank and type account for you automatically
curl -s -H "Content-Type: application/x-ndjson" \
  -XPOST 172.20.21.30:9200/bank/account/_bulk?pretty \
  --data-binary "@accounts.json"; echo

# display indices
curl -XGET 172.20.21.30:9200/_cat/indices
# check doc id 1 of index `bank`
curl -XGET 172.20.21.30:9200/bank/account/1?pretty

Query data, run on kibana dev console:

# query account from CA
curl -XGET 172.20.21.30:9200/bank/account/_search
{
  "query": {
    "match": {
      "state": "CA"
    }
  }
}

In the response message, the match has a _score field, it tells you how relevant is the match.

Plugins

Elasticsearch provides a variety of plugins to extend the system, for example, snapshot plugin, see here.

# list all installed plug-ins
bin/elasticsearch-plugin list
# example
bin/elasticsearch-plugin install analysis-icu
# api
localhost:9200/_cat/plugins

Kibana

Kibana get startd and download written in Node.js, no other dependencies needed.

Do a quick configuration, check my vagrant Kibana provision file and start in background:

server.port: 5601
server.host: "172.20.21.30"

server.name: "${KIBANA_SERVER_NAME}"
# 2 es nodes
elasticsearch.hosts: ["http://172.20.21.30:9200", "http://172.20.21.31:9200"]

pid.file: ${KIBANA_HOME}/kibana.pid
kibana.index: ".kibana"

Access by http://172.20.21.30:5601 in firefox browser.

Kibana has built-in sample data that you can play with, Go to add sample data then move to Analytics -> Discover to query and analyze the data. You need to know KQL to query document, Dashboard is also helpful.

Kibana dev console can issue HTTP request to explore ES APIs, command + enter to run (more shortcut see help menu, helpful!).

1
2
3

# echo command has a play botton
GET /_cat/indices?v
GET /_cat/nodes?v

Or the data can be ingested from Logstash, see below and my Vagrant demo. Need to create Index Pattern to load data and query.

Also you can install plug-ins for Kibana:

1
2
3

bin/kibana-plugin install <plugin>
bin/kibana-plugin list
bin/kibana-plugin remove <plugin>

Discover

I usually use Discover to filter and check log message, and use Dashboard to make graph, such Area, Bar, etc to extract data pattern.

How to draw graph easily from Dscovery:

In Discover, query and filter to get the target log records.
In the leftside panel, right click one of the selected fields -> Visualize.

I want to highlight that the Area graph, it will show you the proportion of target field value alone with the timeline. For example, in the graph settings, the horizontal axis is @timtstamp, vertical axis uses count and break down by the selected field of the message.

There is a saved object management, in Discover, Dashboard and Index pattern section, you can save items and manage them as well as export/import from other Kibana instance.

Logstash

Ingest data to elasticsearch or other downstream consumers, introduction. Usually be paired with Beats.

Logstash offers self-contained architecture-specific downloads that include AdoptOpenJDK 11, the latest long term support (LTS) release of JDK. Use the JAVA_HOME environment variable if you want to use a JDK other than the version that is bundled.

Logstash configuration Files Two types, pipeline config:

/etc/logstash/conf.d

and logstash settings:

startup.options
logstash.yml
jvm.options
log4j2.properties
pipelines.yml

There is a pipeline.workers setting in logstash.yml file and also some input plugin such as UDP has its own workers setting, what’s the difference? Read this post A History of Logstash Output Workers. So the input and (filter + output) are separated pools, they have separated worker thread settings, pipeline.workers is for (filter + output) part, the default value is equal to number of CPU core.

Getting started with Logstash

bin/logstash --help

# config file syntax check
# -f: config file
# -t/--config.test_and_exit: 
bin/logstash -f first-pipeline.conf --config.test_and_exit
# start logstash
# -r/--config.reload.automatic: used to avoid restart logstash when change conf file
bin/logstash -f first-pipeline.conf --config.reload.automatic

Ad-hoc pipeline config:

cd /usr/share/logstash
## test running
bin/logstash -e 'input { stdin { } } output { stdout {} }'
bin/logstash -e 'input { stdin { } } output { elasticsearch { hosts => ["<master/data node ip>:9200"] } }'
## then type something to send

There are 4 important parts to config processing pipeline:

# this output can be used for testing
output {
    # can have multiple output plugin
    elasticsearch {}
    kafka {}
    if "VERBOSE" in [message] {
      file {}
    }
    else {
      stdout { codec => rubydebug }
    }
}

For stdout plugin, the output can be examined from journalctl -ex -u logstash -f.

Read Beats Data

To read data from Beats:

Input: where is the data from? logs? beats?
Filter: how should we parse the data? grok filters, geoip filters, etc.
Output: where should we store the logs? backend? Elasticsearch?

Go to /etc/logstash/conf.d, create new file for example, beats.conf:

input {
    ## in Beats side, listening on port 5043
    beats {
        port => "5043"
    }
}

filter {
    if [type] == "syslog" {
        ## grok filter
        grok {
            match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
        }
        date {
           match => [ "syslog_timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss" ]
        }
    }
}

output {
    elasticsearch {
        ## Elasticsearch address
        hosts => [ "9.30.94.85:9200" ]
        ## write to index
        ## %{[@metadata][beat]}: these are field and sub_field in message
        index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
        document_type => "%{[@metadata][type]}"
    }
}

In above output %{[@metadata][beat]} is field access, please see Accessing event data and fields in logstash, the logstash data types

Then run command to testing file validity:

# --config.test_and_exit: parses configuration
# file and reports any errors.
bin/logstash -f beats.conf --config.test_and_exit

# The --config.reload.automatic: enables automatic 
# config reloading so that don’t have to stop and 
# restart Logstash every time modify the configuration file.
bin/logstash -f beats.conf --config.reload.automatic

Beats

https://www.elastic.co/beats/ Beats, written in golang, can output data to Elasticsearch, Logstash and Redis. But usually we send data to Logstash (pre-processing) then forward to Elasticsearch.

Each Beat has configure yaml file with detailed configuration guideline. For example, in the configure yaml file, comment out Elasticsearch output, use Logstash output.

Filebeat: text log files
Heartbeat: uptime
Metricbeat: OS and applications
Packetbeat: network monitoring
Winlogbeat: windows event log
Libbeat: write your own

X-Pack

X-Pack is an Elastic Stack extension that provides security, alerting, monitoring, reporting, machine learning, and many other capabilities. By default, when you install Elasticsearch, X-Pack is installed, it is open-source now.

Check X-pack, you will see the availability and status of each component:

1	curl -XGET http://9.30.94.85:9200/_xpack