Update Elastic Stack automatically using Ansible Playbooks

Analyzing logs for a network with thousands of devices was a complicated, lengthy, and tedious job before I switched to the Elastic Stack as my centralized logging platform. It turned out to be a smart choice. Now, I have a single location to search all my logs, get near-instant search results, benefit from powerful visualizations for analysis and troubleshooting, and utilize dashboards that provide a helpful network overview.

The Elastic Stack is continuously evolving, introducing impressive features at a rapid pace with two new releases almost every month. I prioritize keeping my environment current to leverage these advancements and address any bugs or security vulnerabilities, which requires frequent updates.

Despite the Elastic website’s clear and detailed documentation, including guidance on product upgrades, manually upgrading, particularly the Elasticsearch cluster, is intricate. It involves numerous steps in a specific order. That’s why I chose to automate the process using Ansible Playbooks.

This Ansible tutorial will guide you through the Ansible Playbooks I created to automate upgrading my Elastic Stack installation.

Understanding the Elastic Stack

Previously known as the ELK stack, the Elastic Stack, comprising Elasticsearch, Logstash, and Kibana from Elastic, delivers a robust platform for indexing, searching, and analyzing data. Its applications are wide-ranging, spanning logging, security analysis, application performance monitoring, and site search.

  • Elasticsearch, the heart of the stack, is a distributed search and analytics engine enabling near real-time search results even with massive data volumes.

  • Logstash functions as a processing pipeline, ingesting data from diverse sources (currently 50 official input plugins), parsing, filtering, transforming, and transmitting it to various outputs, including the crucial Elasticsearch output plugin.

  • Kibana acts as the user and operations interface, facilitating data visualization, search, navigation, and the creation of insightful dashboards.

Introducing Ansible

Ansible, an IT automation platform, streamlines system configuration, software deployment and upgrades, and complex IT task orchestration. Its simplicity and ease of use are key strengths. One standout feature is its agentless nature, eliminating the need for additional software installation and management on managed hosts and devices. We’ll leverage Ansible’s automation capabilities to upgrade our Elastic Stack.

Disclaimer and a Note of Caution

The playbooks shared here are based on the official product documentation and intended for upgrades within the same major version, such as 5.x5.y or 6.x6.y where x>y. Cross-major version upgrades often require additional steps not covered in these playbooks.

Always review the release notes, particularly the breaking changes section, before using these playbooks for upgrades. Fully understand the tasks within the playbooks and cross-reference the upgrade instructions to ensure no crucial changes are missed.

It’s worth noting that I’ve employed these playbooks (or earlier versions) since Elasticsearch version 2.2 without encountering issues. Initially, I had separate playbooks for each product due to differing version numbers.

That said, I assume no responsibility for any consequences arising from using the information presented in this article.

Our Hypothetical Environment

Our playbooks will operate within a simulated environment consisting of six CentOS 7 servers:

  • 1 x Logstash Server
  • 1 x Kibana Server
  • 4 x Elasticsearch nodes

The number of servers in your environment is inconsequential. Simply reflect the actual count in your inventory file, and the playbooks should execute smoothly. If you’re not using an RHEL-based distribution, adapt the few distribution-specific tasks (primarily package manager related) accordingly.

The example Elastic Stack our Ansible Playbooks will deploy

The Inventory

Ansible relies on an inventory file to identify the target hosts for playbook execution. In our hypothetical setup, we’ll use the following inventory file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12

[logstash]
server01 ansible_host=10.0.0.1

[kibana]
server02 ansible_host=10.0.0.2

[elasticsearch]
server03 ansible_host=10.0.0.3
server04 ansible_host=10.0.0.4
server05 ansible_host=10.0.0.5
server06 ansible_host=10.0.0.6

In this Ansible inventory file, sections enclosed in [] represent host groups. Our inventory includes three host groups: logstash, kibana, and elasticsearch. The playbooks utilize these group names, making the number of hosts in each group irrelevant as long as the group definitions are accurate.

The Upgrade Procedure

The upgrade process will adhere to these steps:

  1. Pre-download the packages

  2. Logstash Upgrade

  3. Rolling Upgrade of the Elasticsearch cluster

  4. Kibana Upgrade

The primary objective is to minimize downtime. Users should ideally experience no interruptions, though Kibana might be briefly unavailable, which is generally acceptable.

Main Ansible Playbook

The upgrade process is orchestrated through a series of playbooks. I’ll utilize Ansible’s import_playbook feature to consolidate all playbooks into a single main playbook for streamlined execution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12

- name: pre-download
  import_playbook: pre-download.yml

- name: logstash-upgrade
  import_playbook: logstash-upgrade.yml

- name: elasticsearch-rolling-upgrade
  import_playbook: elasticsearch-rolling-upgrade.yml

- name: kibana-upgrade
  import_playbook: kibana-upgrade.yml

This straightforward approach ensures the playbooks are executed in the correct sequence.

To illustrate the usage of this Ansible playbook, consider the following command, which upgrades the Elastic Stack to version 6.5.4:

1
2

$ ansible-playbook -i inventory -e elk_version=6.5.4 main.yml

Pre-downloading the Packages

This initial step, while optional, is a recommended practice. It involves stopping a service before upgrading. While a fast internet connection minimizes package download time, it’s not always guaranteed. To minimize service downtime, the first playbook utilizes yum to download all packages beforehand.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20

- hosts: logstash
  gather_facts: no

  tasks:
  - name: Validate logstash Version
    fail: msg="Invalid ELK Version"
    when: elk_version is undefined or not elk_version is match("\d+\.\d+\.\d+")
  - name: Get logstash current version
    command: rpm -q logstash --qf %{VERSION}
    args:
      warn: no
    changed_when: False
    register: version_found

  - name: Pre-download logstash install package
    yum:
      name: logstash-{{ elk_version }}
      download_only: yes
    when: version_found.stdout is version_compare(elk_version, '<')

The first line restricts this play to hosts within the logstash group, while the second line disables fact gathering, enhancing speed as long as no tasks depend on host facts.

The first task verifies the elk_version variable, representing the target Elastic Stack version passed during playbook invocation. An invalid variable format will halt the play. This validation is present in all plays for potential isolated execution.

The second task retrieves the current Logstash version using the rpm command and stores it in the version_found variable for subsequent use. The lines args:, warn: no, and changed_when: False primarily satisfy ansible-lint requirements but aren’t strictly necessary.

The final task executes the package pre-download command only if the installed Logstash version is older than the target, avoiding redundant downloads.

The other two plays are nearly identical, pre-downloading Elasticsearch and Kibana instead of Logstash:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

- hosts: elasticsearch
  gather_facts: no

  tasks:
  - name: Validate elasticsearch Version
    fail: msg="Invalid ELK Version"
    when: elk_version is undefined or not elk_version is match("\d+\.\d+\.\d+")

  - name: Get elasticsearch current version
    command: rpm -q elasticsearch --qf %{VERSION}
    args:
      warn: no
    changed_when: False
    register: version_found

  - name: Pre-download elasticsearch install package
    yum:
      name: elasticsearch-{{ elk_version }}
      download_only: yes
    when: version_found.stdout is version_compare(elk_version, '<')

- hosts: kibana
  gather_facts: no

  tasks:
  - name: Validate kibana Version
    fail: msg="Invalid ELK Version"
    when: elk_version is undefined or not elk_version is match("\d+\.\d+\.\d+")

  - name: Get kibana current version
    command: rpm -q kibana --qf %{VERSION}
    args:
      warn: no
    changed_when: False
    register: version_found

  - name: Pre-download kibana install package
    yum:
      name: kibana-{{ elk_version }}
      download_only: yes
    when: version_found.stdout is version_compare(elk_version, '<')

Upgrading Logstash

Logstash should be upgraded first due to its backward compatibility with older Elasticsearch versions.

The initial tasks in this play mirror those in the pre-download play:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14

- name: Upgrade logstash
  hosts: logstash
  gather_facts: no

  tasks:
  - name: Validate ELK Version
    fail: msg="Invalid ELK Version"
    when: elk_version is undefined or not elk_version is match("\d+\.\d+\.\d+")

  - name: Get logstash current version
    command: rpm -q logstash --qf %{VERSION}
    changed_when: False
    register: version_found

The final two tasks reside within a block:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14

  - block:
    - name: Update logstash
      yum:
        name: logstash-{{ elk_version }}
        state: present

    - name: Restart logstash
      systemd:
        name: logstash
        state: restarted
        enabled: yes
        daemon_reload: yes
    when: version_found.stdout is version_compare(elk_version, '<')

The when conditional ensures these tasks execute only if the target version surpasses the current version. The first task within the block handles the Logstash upgrade, while the second restarts the service.

Performing a Rolling Upgrade of the Elasticsearch Cluster

To prevent Elasticsearch cluster downtime, we’ll perform a rolling upgrade, upgrading one node at a time after verifying the cluster’s health (green state).

The start of this play introduces a new element:

1
2
3
4
5

- name: Elasticsearch rolling upgrade
  hosts: elasticsearch
  gather_facts: no
  serial: 1

While Ansible typically executes plays against multiple hosts concurrently, the serial: 1 line enforces sequential execution, one host at a time.

Next, we define variables for use within the play:

1
2
3
4
5
6

vars:
  es_disable_allocation: '{"transient":{"cluster.routing.allocation.enable":"none"}}'
  es_enable_allocation: '{"transient":{"cluster.routing.allocation.enable": "all","cluster.routing.allocation.node_concurrent_recoveries": 5,"indices.recovery.max_bytes_per_sec": "500mb"}}'
  es_http_port: 9200
  es_transport_port: 9300

The purpose of each variable will become apparent as they are used.

As always, the first task validates the target version:

1
2
3
4
5

 tasks:
  - name: Validate ELK Version
    fail: msg="Invalid ELK Version"
    when: elk_version is undefined or not elk_version is match("\d+\.\d+\.\d+")

Many subsequent tasks involve REST calls to the Elasticsearch cluster, which can target any node. While the current host seems logical, some commands execute while its Elasticsearch service is down. Therefore, we’ll dynamically choose a different host for REST calls using the set_fact module and Ansible inventory’s groups variable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10

 - name: Set the es_host for the first host
    set_fact:
      es_host: "{{ groups.elasticsearch[1] }}"
    when: "inventory_hostname == groups.elasticsearch[0]"

  - name: Set the es_host for the remaining hosts
    set_fact:
      es_host: "{{ groups.elasticsearch[0] }}"
    when: "inventory_hostname != groups.elasticsearch[0]"

Before proceeding, we ensure the current node’s service is running:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13

  - name: Ensure elasticsearch service is running
    systemd:
      name: elasticsearch
      enabled: yes
      state: started
    register: response

  - name: Wait for elasticsearch node to come back up if it was stopped
    wait_for:
      port: "{{ es_transport_port }}"
      delay: 45
    when: response.changed == true

Similar to previous plays, we check the current version, opting for the Elasticsearch REST API instead of rpm to showcase an alternative.

1
2
3
4
5
6
7
8

  - name: Check current version
    uri:
      url: http://localhost:{{ es_http_port }}
      method: GET
    register: version_found
    retries: 10
    delay: 10

The remaining tasks are enclosed in a block that executes only if the current version is outdated:

1
2
3
4
5
6
7
8

 - block:
    - name: Enable shard allocation for the cluster
      uri:
        url: http://localhost:{{ es_http_port }}/_cluster/settings
        method: PUT
        body_format: json
        body: "{{ es_enable_allocation }}"

Contrary to the documentation, this step enables shard allocation. This is intentional to counteract any prior shard allocation disablement and prevent the next task (waiting for a green cluster state) from hanging indefinitely.

With shard allocation confirmed, we wait for the cluster to achieve a green state:

1
2
3
4
5
6
7
8
9

    - name: Wait for cluster health to return to green
      uri:
        url: http://localhost:{{ es_http_port }}/_cluster/health
        method: GET
      register: response
      until: "response.json.status == 'green'"
      retries: 500
      delay: 15

The cluster might take a while to regain its green state after a node service restart. The retries: 500 and delay: 15 settings translate to a wait time of 125 minutes (500 x 15 seconds), generally sufficient but adjustable based on data volume.

Next, we disable shard allocation:

1
2
3
4
5
6
7

    - name: Disable shard allocation for the cluster
      uri:
        url: http://localhost:{{ es_http_port }}/_cluster/settings
        method: PUT
        body_format: json
        body: {{ es_disable_allocation }}

Before stopping the service, we perform the optional yet recommended sync flush. Ignore any harmless 409 errors, as they are added to the success status codes.

1
2
3
4
5
6

    - name: Perform a synced flush
      uri:
        url: http://localhost:{{ es_http_port }}/_flush/synced
        method: POST
        status_code: "200, 409"

The node is now ready for the upgrade:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10

    - name: Shutdown elasticsearch node
      systemd:
        name: elasticsearch
        state: stopped

    - name: Update elasticsearch
      yum:
        name: elasticsearch-{{ elk_version }}
        state: present

While the service is stopped, we wait for all shards to be allocated:

1
2
3
4
5
6
7

- name: Wait for all shards to be reallocated
  uri: url=http://{{ es_host }}:{{ es_http_port }}/_cluster/health method=GET
  register: response
  until: "response.json.relocating_shards == 0"
  retries: 20
  delay: 15

Once shard reallocation is complete, the Elasticsearch service is restarted, and we wait for it to become fully operational:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17

    - name: Start elasticsearch
      systemd:
        name: elasticsearch
        state: restarted
        enabled: yes
        daemon_reload: yes

    - name: Wait for elasticsearch node to come back up
      wait_for:
        port: "{{ es_transport_port }}"
        delay: 35

    - name: Wait for elasticsearch http to come back up
      wait_for:
        port: "{{ es_http_port }}"
        delay: 5

Before reenabling shard allocation, we ensure the cluster is either yellow or green:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20

    - name: Wait for cluster health to return to yellow or green
      uri:
        url: http://localhost:{{ es_http_port }}/_cluster/health
        method: GET
      register: response
      until: "response.json.status == 'yellow' or response.json.status == 'green'"
      retries: 500
      delay: 15

    - name: Enable shard allocation for the cluster
      uri:
        url: http://localhost:{{ es_http_port }}/_cluster/settings
        method: PUT
        body_format: json
        body: "{{ es_enable_allocation }}"
      register: response
      until: "response.json.acknowledged == true"
      retries: 10
      delay: 15

Finally, we wait for the node to fully recover before moving on to the next one:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10

    - name: Wait for the node to recover
      uri:
        url: http://localhost:{{ es_http_port }}/_cat/health
        method: GET
        return_content: yes
      register: response
      until: "'green' in response.content"
      retries: 500
      delay: 15

As emphasized, this entire block executes only during actual version upgrades:

1
2

  when: version_found.json.version.number is version_compare(elk_version, '<')

Upgrading Kibana

Kibana is the final component to be upgraded.

As expected, the first tasks resemble those in the Logstash upgrade and pre-download plays, with the addition of a variable definition:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18

- name: Upgrade kibana
  hosts: kibana
  gather_facts: no
  vars:
    set_default_index: '{"changes":{"defaultIndex":"syslog"}}'

  tasks:
  - name: Validate ELK Version
    fail: msg="Invalid ELK Version"
    when: elk_version is undefined or not elk_version is match("\d+\.\d+\.\d+")

  - name: Get kibana current version
    command: rpm -q kibana --qf %{VERSION}
    args:
      warn: no
    changed_when: False
    register: version_found

The set_default_index variable will be explained later.

The remaining tasks, executed only if the installed Kibana version is outdated, reside within a block. The first two tasks handle the Kibana update and restart:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12

    - name: Update kibana
      yum:
        name: kibana-{{ elk_version }}
        state: present

    - name: Restart kibana
      systemd:
        name: kibana
        state: restarted
        enabled: yes
        daemon_reload: yes

Ideally, this should suffice for Kibana. However, upgrades sometimes result in Kibana losing its default index pattern reference, prompting the first post-upgrade user to define it, leading to potential confusion. To prevent this, a task resets the default index pattern, which is syslog in this example but should be adjusted to match your setup. Before setting the index, we ensure Kibana is up and ready:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

    - name: Wait for kibana to start listening
      wait_for:
        port: 5601
        delay: 5

    - name: Wait for kibana to be ready
      uri:
        url: http://localhost:5601/api/kibana/settings
        method: GET
      register: response
      until: "'kbn_name' in response and response.status == 200"
      retries: 30
      delay: 5

    - name: Set Default Index
      uri:
        url: http://localhost:5601/api/kibana/settings
        method: POST
        body_format: json
        body: "{{ set_default_index }}"
        headers:
          "kbn-version": "{{ elk_version }}"

Conclusion

The Elastic Stack is an invaluable tool worthy of exploration. Its continuous improvement can be challenging to keep up with. I hope these Ansible Playbooks prove as beneficial to you as they are to me.

They are available on GitHub at https://github.com/orgito/elk-upgrade. I recommend thorough testing in a non-production environment before deployment.

If you happen to be a Ruby on Rails developer interested in integrating Elasticsearch into your application, I suggest checking out Elasticsearch for Ruby on Rails: A Tutorial to the Chewy Gem by Core Toptal Software Engineer Arkadiy Zabazhanov.

Licensed under CC BY-NC-SA 4.0