Elasticsearch 分片迁移与重新平衡监控指南-编程实验室

目录标题

Elasticsearch 分片迁移与重新平衡监控指南
- 目录
- 概述
- 环境信息
- - 当前集群状态
  - Kubernetes 集群节点
- 分片重新平衡机制
- - ES 默认平衡策略
  - 分片状态流转
- 查看分片迁移的命令
- - 1. 查看集群健康状态
  - 2. 查看所有分片状态
  - 3. 查看迁移进度（最详细）
  - 4. 只看正在进行的迁移
  - 5. 查看节点分配统计
  - 6. 查看集群平衡设置
  - 7. 在 Kubernetes 环境中执行
- 实际案例分析
- - 案例：节点 es-data-3 加入集群
  - - 时间线
    - 日志记录
    - 分片分布变化
    - 分片迁移详情
- 实时监控迁移进度
- - 持续监控命令
  - 查看迁移日志
- 手动触发分片迁移
- - 方法1: 排除节点
  - 方法2: 调整平衡配置
  - 方法3: 手动移动分片
- 常见问题
- - Q1: 为什么新节点加入后没有立即开始迁移？
  - Q2: es-data-4 为什么一直是 Pending 状态？
  - Q3: 如何加快分片迁移速度？
  - Q4: 迁移过程中对服务有影响吗？
  - Q5: 如何取消正在进行的迁移？
- 附录
- - 快速参考命令表
  - 相关配置参数

Elasticsearch 分片迁移与重新平衡监控指南

概述

当 Elasticsearch 集群添加或移除节点时，会自动触发分片重新平衡（Rebalancing），以实现：

负载均衡- 分片均匀分布到各节点
高可用性- 主分片和副本分片分布在不同节点
性能优化- 充分利用新节点的存储和计算资源

环境信息

当前集群状态

集群名称	状态	节点数	分片数	备注
es-dcdc4a67	green	2	20	es-data-2 Pending

Kubernetes 集群节点

NAME STATUS ROLES AGE VERSION qfusion1 Ready control-plane,master 21d v1.24.10 qfusion2 Ready control-plane,master 21d v1.24.10 qfusion3 Ready control-plane,master 21d v1.24.10 qfusion4 Ready <none> 21d v1.24.10

分片重新平衡机制

ES 默认平衡策略

┌─────────────────────────────────────────────────────────────────┐ │ Elasticsearch 平衡决策 │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. 节点加入检测 2. 平衡计算 3. 分片迁移 │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ 新节点加入 │ ──> │ 计算差异 │──> │ 迁移分片 │ │ │ │ 触发重平衡 │ │ 调度决策 │ │ 恢复副本 │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ 平衡考虑因素: │ │ • 每个节点的分片数量均衡 │ │ • 磁盘使用率均衡 │ │ • 副本分片与主分片不在同一节点 │ │ • 数据量大小（避免迁移大分片） │ │ │ └─────────────────────────────────────────────────────────────────┘

分片状态流转

INITIALIZING ──> RELOCATING ──> STARTED ↑ ↑ ↑ │ │ │ 正在初始化 正在迁移 正常运行

状态	说明
STARTED	正常运行，可提供服务
RELOCATING	正在迁移到其他节点
INITIALIZING	正在初始化（新建分片）
UNASSIGNED	未分配（可能资源不足）

查看分片迁移的命令

1. 查看集群健康状态

curl-u elastic:password"localhost:9200/_cluster/health?pretty"

输出示例：

{"cluster_name":"es-dcdc4a67","status":"green","number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":10,"active_shards":20,"relocating_shards":0,// 正在迁移的分片数"initializing_shards":0,// 正在初始化的分片数"unassigned_shards":0// 未分配的分片数}

关键字段：

字段	说明
`relocating_shards`	>0 表示有迁移正在进行
`initializing_shards`	正在初始化的分片数
`unassigned_shards`	未分配的分片数（异常）

2. 查看所有分片状态

curl-u elastic:password"localhost:9200/_cat/shards?v"

输出示例：

index shard prirep state docs store ip node test_index_01 0 p STARTED 9000 345kb 10.255.253.41 es-data-0 test_index_01 0 r STARTED 9000 345kb 10.255.253.67 es-data-1 test_index_02 0 p RELOCATING 9000 230kb 10.255.253.41 es-data-0 test_index_02 0 r STARTED 9000 345kb 10.255.253.67 es-data-1

3. 查看迁移进度（最详细）

curl-u elastic:password"localhost:9200/_cat/recovery?v"

输出示例：

index shard time type stage source_node target_node bytes bytes_pct files files_pct myindex 0 45s RELOCATION final node-0 node-3 150.2mb 100.0% 12 100.0% myindex 1 30s RELOCATION translog node-1 node-3 89.5mb 85.3% 8 90.0%

字段	说明
`type`	RELOCATION= 迁移，REPLICA = 副本恢复
`stage`	init（初始化）→ translog（事务日志）→ finalize（完成）
`bytes_pct`	数据传输进度百分比
`source_node`	源节点
`target_node`	目标节点

4. 只看正在进行的迁移

curl-u elastic:password"localhost:9200/_cat/recovery?v&active_only=true"

5. 查看节点分配统计

curl-u elastic:password"localhost:9200/_cat/allocation?v"

输出示例：

shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 12 83.3mb 102.1gb 97.7gb 199.9gb 51 245.0.2.130 245.0.2.130 es-data-3 11 42.5mb 110.2gb 79.7gb 189.9gb 58 245.0.3.73 245.0.3.73 es-data-1 12 2.4mb 22.6gb 177.2gb 199.9gb 11 10.255.254.189 10.255.254.189 es-data-0 11 134.9mb 91.4gb 108.4gb 199.9gb 45 245.0.1.150 245.0.1.150 es-data-2

6. 查看集群平衡设置

curl-u elastic:password"localhost:9200/_cluster/settings?flat_settings=true"

7. 在 Kubernetes 环境中执行

# 设置环境变量exportKUBECONFIG=/bpx/.148-admin.conf# 获取密码PASSWORD=$(kubectl get secret -n qfusion-admin es-dcdc4a67-es-elastic-user -ojsonpath='{.data.elastic}'|base64 -d)# 查看集群状态kubectlexec-n qfusion-admin es-dcdc4a67-es-data-0 --curl-s -u elastic:$PASSWORD"localhost:9200/_cluster/health?pretty"# 查看分片迁移kubectlexec-n qfusion-admin es-dcdc4a67-es-data-0 --curl-s -u elastic:$PASSWORD"localhost:9200/_cat/recovery?v"

实际案例分析

案例：节点 es-data-3 加入集群

时间线

时间	事件
02:08:32	es-data-3 加入集群
02:08:34	集群状态更新，版本 2220
02:08:35	新索引开始分配到新节点

日志记录

[es-data-3] master node changed {previous [], current [...es-data-2...]} [es-data-2] node-join[{es-data-3}... join existing leader] [es-data-0] added {es-data-3}, term: 5, version: 2220

分片分布变化

加入前（3节点）：

节点	分片数
es-data-0	15
es-data-1	15
es-data-2	16

加入后（4节点）：

节点	分片数
es-data-0	12
es-data-1	11
es-data-2	11
es-data-3	12

分片迁移详情

分片	迁移前	迁移后
filebeat 主分片	es-data-2	es-data-3
test_index_02 主分片	(新建)	es-data-3
test_index_08 主分片	(新建)	es-data-3

实时监控迁移进度

持续监控命令

# 方法1: 使用 watchwatch-n1'curl -s -u elastic:pass "localhost:9200/_cat/recovery?v&active_only=true"'# 方法2: 使用 while 循环whiletrue;doclearecho"===$(date)==="curl-s -u elastic:pass"localhost:9200/_cat/recovery?v"sleep2done# 方法3: 在 Kubernetes 中kubectlexec-n qfusion-admin<pod>--bash-c' while true; do curl -s -u elastic:pass "localhost:9200/_cat/recovery?v&active_only=true" sleep 1 done '

查看迁移日志

# 查看 Pod 日志中的迁移记录kubectl logs -n qfusion-admin<pod-name>-c elasticsearch --tail=500|grep-i"relocat"# 查看 ES 日志文件kubectlexec-n qfusion-admin<pod-name>--tail-f /usr/share/elasticsearch/logs/es-dcdc4a67.log

手动触发分片迁移

方法1: 排除节点

# 排除特定节点（触发迁移出该节点）curl-u elastic:pass -X PUT"localhost:9200/_cluster/settings"-H'Content-Type: application/json'-d' { "transient": { "cluster.routing.allocation.exclude._name": "es-data-0" } }'# 取消排除curl-u elastic:pass -X PUT"localhost:9200/_cluster/settings"-H'Content-Type: application/json'-d' { "transient": { "cluster.routing.allocation.exclude._name": "" } }'

方法2: 调整平衡配置

# 启用自动平衡curl-u elastic:pass -X PUT"localhost:9200/_cluster/settings"-H'Content-Type: application/json'-d' { "transient": { "cluster.routing.rebalance.enable": "all" } }'# 调整并发迁移数curl-u elastic:pass -X PUT"localhost:9200/_cluster/settings"-H'Content-Type: application/json'-d' { "transient": { "cluster.routing.allocation.cluster_concurrent_rebalance": 3 } }'

方法3: 手动移动分片

# 将特定分片移动到指定节点curl-u elastic:pass -X POST"localhost:9200/_cluster/reroute"-H'Content-Type: application/json'-d' { "commands": [ { "move": { "index": "test_index", "shard": 0, "from_node": "node-1", "to_node": "node-2" } } ] }'

常见问题

Q1: 为什么新节点加入后没有立即开始迁移？

A:ES 会根据以下因素决定是否迁移：

磁盘使用率差异
分片数量差异
数据量大小（避免迁移大分片影响性能）

可以通过以下配置调整：

curl-u elastic:pass -X PUT"localhost:9200/_cluster/settings"-H'Content-Type: application/json'-d' { "transient": { "cluster.routing.allocation.balance.shard": 0.45, "cluster.routing.allocation.balance.index": 0.55, "cluster.routing.allocation.balance.threshold": 1.0 } }'

Q2: es-data-4 为什么一直是 Pending 状态？

A:Pod Anti-Affinity 调度失败：

集群只有 4 个节点
ES 配置了 Pod 反亲和性（每个 ES 节点必须在不同 K8s 节点上）
第 5 个 ES 节点无法调度

解决方案：

添加新的 Kubernetes 节点
或调整 Pod 反亲和性配置
或减少 ES 数据节点数量

Q3: 如何加快分片迁移速度？

A:调整并发恢复参数：

curl-u elastic:pass -X PUT"localhost:9200/_cluster/settings"-H'Content-Type: application/json'-d' { "transient": { "cluster.routing.allocation.node_concurrent_recoveries": 6, "cluster.routing.allocation.node_initial_primaries_recoveries": 4, "indices.recovery.max_bytes_per_sec": "100mb" } }'

Q4: 迁移过程中对服务有影响吗？

副本分片迁移：无影响，服务正常
主分片迁移：短暂影响（毫秒级），自动切换到副本

Q5: 如何取消正在进行的迁移？

# 取消特定分片的迁移curl-u elastic:pass -X POST"localhost:9200/_cluster/reroute"-H'Content-Type: application/json'-d' { "commands": [ { "cancel": { "index": "test_index", "shard": 0, "node": "target_node", "allow_primary": false } } ] }'

附录

快速参考命令表

操作	命令
查看集群状态	`curl -u elastic:pass "/_cluster/health?pretty"`
查看分片列表	`curl -u elastic:pass "/_cat/shards?v"`
查看迁移进度	`curl -u elastic:pass "/_cat/recovery?v"`
查看节点分配	`curl -u elastic:pass "/_cat/allocation?v"`
查看节点列表	`curl -u elastic:pass "/_cat/nodes?v"`
查看集群设置	`curl -u elastic:pass "/_cluster/settings?flat_settings=true"`

参数	默认值	说明
`cluster.routing.rebalance.enable`	all	启用平衡类型
`cluster.routing.allocation.cluster_concurrent_rebalance`	2	集群并发平衡数
`cluster.routing.allocation.node_concurrent_recoveries`	2	节点并发恢复数
`indices.recovery.max_bytes_per_sec`	40mb	恢复速度限制
`cluster.routing.allocation.balance.shard`	0.45	分片平衡权重
`cluster.routing.allocation.balance.index`	0.55	索引平衡权重
`cluster.routing.allocation.balance.threshold`	1.0	平衡阈值

目录标题

Elasticsearch 分片迁移与重新平衡监控指南

目录

概述

环境信息

当前集群状态

Kubernetes 集群节点

分片重新平衡机制

ES 默认平衡策略

分片状态流转

查看分片迁移的命令

1. 查看集群健康状态

2. 查看所有分片状态

3. 查看迁移进度（最详细）

4. 只看正在进行的迁移

5. 查看节点分配统计

6. 查看集群平衡设置

7. 在 Kubernetes 环境中执行

实际案例分析

案例：节点 es-data-3 加入集群

时间线

日志记录

分片分布变化

分片迁移详情

实时监控迁移进度

持续监控命令

查看迁移日志

手动触发分片迁移

方法1: 排除节点

方法2: 调整平衡配置

方法3: 手动移动分片

常见问题

Q1: 为什么新节点加入后没有立即开始迁移？

Q2: es-data-4 为什么一直是 Pending 状态？

Q3: 如何加快分片迁移速度？

Q4: 迁移过程中对服务有影响吗？

Q5: 如何取消正在进行的迁移？

附录

快速参考命令表

相关配置参数

单片机温度测量和控制系统的设计与实现

单片机控制的自动门系统

用AI生成中文，然后翻译成英文，英文内容会被判定为AI生成吗？

RuoYi-Plus-Soybean：现代化企业级多租户管理系统的技术实践与深度解析

Git学习笔记

日志分析缓慢的核心原因与AI解决路径