Elasticsearch Top10 Monitoring Indicators

Elasticsearch Top10 Monitoring Indicators

Introduction

Monitoring Elasticsearch clusters is crucial for ensuring the performance, health, and reliability of the system. In this article, we will explore the top 10 monitoring indicators for Elasticsearch clusters, covering various dimensions such as cluster health, search performance, indexing performance, node health, and JVM running status.

1. Cluster Health Dimensions: Fragmentation and Node

A cluster’s health is defined by its fragmentation and node status. Fragmentation refers to the number of fragments in the cluster, which can significantly impact performance. Excessive fragmentation can lead to increased query rejection rates, while too few fragments can result in underutilized node resources.

To monitor cluster health, use the GET _cluster/health command:

{
  "Cluster_name": "elasticsearch",
  "Status": "yellow",
  "Timed_out": false,
  "Number_of_nodes": 1,
  "Number_of_data_nodes": 1,
  "Active_primary_shards": 127,
  "Active_shards": 127,
  "Relocating_shards": 0,
  "Initializing_shards": 0,
  "Unassigned_shards": 120,
  "Delayed_unassigned_shards": 0,
  "Number_of_pending_tasks": 0,
  "Number_of_in_flight_fetch": 0,
  "Task_max_waiting_in_queue_millis": 0,
  "Active_shards_percent_as_number": 51.417004048582996
}

Key indicators include:

  • Status: The overall status of the cluster (green, yellow, or red).
  • Number_of_nodes: The total number of nodes in the cluster.
  • Number_of_data_nodes: The total number of data nodes in the cluster.
  • Active_primary_shards: The number of active primary shards.
  • Unassigned_shards: The number of unassigned shards.

2. Search Performance Dimensions: Request Rate and Delay

Search performance is critical for Elasticsearch clusters, and monitoring request rate and delay is essential. The request rate measures the number of requests processed by the cluster, while delay measures the time taken to process each request.

To monitor search performance, use the GET index_a/_stats command:

{
  "Open_contexts": 0,
  "Query_total": 10,
  "Query_time_in_millis": 0,
  "Query_current": 0,
  "Fetch_total": 1,
  "Fetch_time_in_millis": 0,
  "Fetch_current": 0,
  "Scroll_total": 5,
  "Scroll_time_in_millis": 15850,
  "Scroll_current": 0,
  "Suggest_total": 0,
  "Suggest_time_in_millis": 0,
  "Suggest_current": 0
}

Key indicators include:

  • Query_total: The total number of queries.
  • Query_time_in_millis: The total time taken to process all queries.
  • Fetch_total: The total number of fetch requests.
  • Fetch_time_in_millis: The total time taken to process all fetch requests.

3. Indexing Performance Dimensions: Refresh (Refresh) and Merge (Merge) Time

Indexing performance is critical for Elasticsearch clusters, and monitoring refresh and merge times is essential. Refresh time measures the time taken to refresh the index, while merge time measures the time taken to merge segments.

To monitor indexing performance, use the GET /_nodes/stats command:

{
  "Merges": {
    "Current": 0,
    "Current_docs": 0,
    "Current_size_in_bytes": 0,
    "Total": 245,
    "Total_time_in_millis": 58332,
    "Total_docs": 1351279,
    "Total_size_in_bytes": 640703378,
    "Total_stopped_time_in_millis": 0,
    "Total_throttled_time_in_millis": 0,
    "Total_auto_throttle_in_bytes": 2663383040
  },
  "Refresh": {
    "Total": 2955,
    "Total_time_in_millis": 244217,
    "Listeners": 0
  },
  "Flush": {
    "Total": 127,
    "Periodic": 0,
    "Total_time_in_millis": 13137
  }
}

Key indicators include:

  • Merges_Total: The total number of merges.
  • Merges_Total_time_in_millis: The total time taken to process all merges.
  • Refresh_Total: The total number of refreshes.
  • Refresh_Total_time_in_millis: The total time taken to process all refreshes.

4. Node Health Dimensions: Memory, Disk, and CPU Indicators

Node health is critical for Elasticsearch clusters, and monitoring memory, disk, and CPU indicators is essential. Memory usage measures the amount of memory used by each node, while disk usage measures the amount of disk space used by each node. CPU usage measures the percentage of CPU used by each node.

To monitor node health, use the GET /_cat/nodes?v&h=id,disk.total,disk.used,disk.avail,disk.used_percent,ram.current,ram.percent,ram.max,cpuid command:

id      disk.total  disk.used  disk.avail  disk.used_percent  ram.current  ram.percent  ram.max  cpu
Hk9w    931.3gb    472.5gb   458.8gb    50.73            6.1gb      78          7.8gb    14

Key indicators include:

  • disk.total: The total disk capacity.
  • disk.used: The total amount of disk usage.
  • disk.avail: The total amount of available disk space.
  • ram.current: The current memory usage.
  • ram.percent: The percentage of memory used.
  • cpu: The percentage of CPU used.

5. JVM Running Status Dimensions: Heap, GC, and Pool Size

JVM running status is critical for Elasticsearch clusters, and monitoring heap, GC, and pool size indicators is essential. Heap usage measures the amount of memory used by the JVM, while GC measures the time taken by the JVM to perform garbage collection. Pool size measures the amount of memory allocated to each pool.

To monitor JVM running status, use the GET /_nodes/stats command:

{
  "Jvm": {
    "Timestamp": 1557588707194,
    "Uptime_in_millis": 22970151,
    "Mem": {
      "Heap_used_in_bytes": 843509048,
      "Heap_used_percent": 40,
      "Heap_committed_in_bytes": 2077753344,
      "Heap_max_in_bytes": 2077753344,
      "Non_heap_used_in_bytes": 156752056,
      "Non_heap_committed_in_bytes": 167890944,
      "Pools": {
        "Young": {
          "Used_in_bytes": 415298464,
          "Max_in_bytes": 558432256,
          "Peak_used_in_bytes": 558432256,
          "Peak_max_in_bytes": 558432256
        },
        "Survivor": {
          "Used_in_bytes": 12178632,
          "Max_in_bytes": 69730304,
          "Peak_used_in_bytes": 69730304,
          "Peak_max_in_bytes": 69730304
        },
        "Old": {
          "Used_in_bytes": 416031952,
          "Max_in_bytes": 1449590784,
          "Peak_used_in_bytes": 416031952,
          "Peak_max_in_bytes": 1449590784
        }
      }
    },
    "Threads": {
      "Count": 116,
      "Peak_count": 119
    },
    "Gc": {
      "Collectors": {
        "Young": {
          "Collection_count": 260,
          "Collection_time_in_millis": 3463
        },
        "Old": {
          "Collection_count": 2,
          "Collection_time_in_millis": 125
        }
      }
    }
  }
}

Key indicators include:

  • Heap_used_in_bytes: The amount of heap memory used.
  • Heap_used_percent: The percentage of heap memory used.
  • Non_heap_used_in_bytes: The amount of non-heap memory used.
  • Pools_Young_Used_in_bytes: The amount of memory used by the young pool.
  • Pools_Survivor_Used_in_bytes: The amount of memory used by the survivor pool.
  • Pools_Old_Used_in_bytes: The amount of memory used by the old pool.

6. Elasticsearch Top10 Monitoring Indicators

After analyzing the various dimensions, the top 10 monitoring indicators for Elasticsearch clusters are:

  1. Cluster Health - Nodes and Shards
  2. Search Performance - Request Latency and Search Performance - Request Rate
  3. Indexing Performance - Refresh Times
  4. Indexing Performance - Merge Times
  5. Node Health - Memory Usage
  6. Node Health - Disk I/O
  7. Node Health - CPU
  8. JVM Health - Heap Usage and Garbage Collection
  9. JVM Health - JVM Pool Size

Conclusion

Monitoring Elasticsearch clusters is crucial for ensuring the performance, health, and reliability of the system. By monitoring the top 10 indicators, you can identify potential problems and take corrective action to prevent downtime and economic losses. Effective monitoring can save companies significant costs and improve overall system performance.