从Rook/Ceph发生的不一致中恢复
首先
在确认了从停电中恢复的Kubernetes集群状态时,我发现只有一个Pod(gitbucket)仍处于ContainerCreating状态并长时间停止。
造成这个问题的原因是PVC无法挂载,然而,实际原因在于停电导致ext4的日志数据仍然保留下来了。
我很快就知道只需要运行fsck.ext4来进行修复,但是由于不直观,不太清楚应该对哪个卷执行,所以我留下了备忘录。
环境。
-
- Kubernetes v1.16.9 deployed by Kubespray
- Rook/Ceph v1.1.9 with FlexVolumes (BlockStorage using ext4)
由于分区破坏所导致的数据损坏,在以下版本中发生。
- Rook/Ceph v1.6.7 with CSI BlueStore (with HDD 4TBx2)
情况
通过使用kubectl命令进行确认,我们可以看到如下所示的状态。
$ kubectl -n gitbucket get all
NAME READY STATUS RESTARTS AGE
pod/bucket-9b8d9799-khsm8 0/1 Init:1/2 1 27h
pod/mysql-84d459c89c-w8pb6 0/1 ContainerCreating 0 15m
Pod: bucket在initContainers中等待mysql启动,所以它保持在初始状态并停止运行。一旦mysql容器启动,bucket容器应该自动恢复,因此我们首先要恢复mysql容器。
BlockStorage出现故障了。
错误状况
使用kubectl describe命令查询启动状态时,结果如下所示。
$ kubectl -n gitbucket describe pod/mysql-84d459c89c-w8pb6
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 16m default-scheduler Successfully assigned gitbucket/mysql-84d459c89c-w8pb6 to node04
Warning FailedMount 15m (x5 over 16m) kubelet, node04 MountVolume.SetUp failed for volume "pvc-8aaa318c-1212-415d-b00f-e88b9c3c3087" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-8aaa318c-1212-415d-b00f-e88b9c3c3087 for pod gitbucket/mysql-84d459c89c-w8pb6. Volume is already attached by pod gitbucket/mysql-84d459c89c-rllg6. Status Pending
Warning FailedMount 2m46s (x3 over 11m) kubelet, node04 Unable to attach or mount volumes: unmounted volumes=[mysql-persistent-storage], unattached volumes=[database-auth-conf default-token-txprc mysql-persistent-storage]: timed out waiting for the condition
Warning FailedMount 92s (x10 over 15m) kubelet, node04 MountVolume.SetUp failed for volume "pvc-8aaa318c-1212-415d-b00f-e88b9c3c3087" : mount command failed, status: Failure, reason: failed to mount volume /dev/rbd2 [ext4] to /var/lib/kubelet/plugins/ceph.rook.io/rook-ceph/mounts/pvc-8aaa318c-1212-415d-b00f-e88b9c3c3087, error 'fsck' found errors on device /dev/rbd2 but could not correct them: fsck from util-linux 2.31.1
/dev/rbd2: Superblock needs_recovery flag is clear, but journal has data.
/dev/rbd2: Run journal anyway
/dev/rbd2: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
.
Warning FailedMount 32s (x4 over 14m) kubelet, node04 Unable to attach or mount volumes: unmounted volumes=[mysql-persistent-storage], unattached volumes=[mysql-persistent-storage database-auth-conf default-token-txprc]: timed out waiting for the condition
从这个错误消息中可以看出,在节点node04和**/dev/rbd2**上出现了问题。
在node04上的修改。
起初,我不知道该如何处理,但是当我登录到node04并检查了/dev/目录后,发现了rbd2这个存在,于是我决定将其作为fsck.ext4命令的参数尝试一下。
$ ssh node04
$ sudo fsck.ext4 /dev/rbd2
e2fsck 1.44.1 (24-Mar-2018)
Superblock needs_recovery flag is clear, but journal has data.
Run journal anyway<y>? yes
/dev/rbd2: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (5067650, counted=5067649).
Fix<y>? yes
Free inodes count wrong (1310519, counted=1310518).
Fix<y>? yes
/dev/rbd2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/rbd2: 202/1310720 files (14.9% non-contiguous), 175231/5242880 blocks
Pod安全运行起来后,时间就这样过去了。
$ kubectl -n gitbucket get all
NAME READY STATUS RESTARTS AGE
pod/bucket-9b8d9799-khsm8 0/1 Init:1/2 1 27h
pod/mysql-84d459c89c-w8pb6 0/1 Running 0 27m
使用XFS文件系统设置可能更好一点,尽管(即使使用XFS)可能也会出现问题,但我略感后悔。
以上
随后发生的故障
这一次,在Harbor上使用的PVC无法进行挂载,但原因却是另外的。
$ kubectl -n harbor describe pod/my-harbor-harbor-redis-0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 12m (x228 over 7h42m) kubelet, node01 MountVolume.SetUp failed for volume "pvc-d37fd3f0-22ac-42c2-a90d-7427c2fa91d7" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-d37fd3f0-22ac-42c2-a90d-7427c2fa91d7 for pod harbor/my-harbor-harbor-redis-0. Volume is already attached by pod harbor/my-harbor-harbor-redis-0. Status Pending
Warning FailedMount 7m18s (x151 over 7h40m) kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data default-token-nnqmn]: timed out waiting for the condition
Warning FailedMount 2m47s (x51 over 7h38m) kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[default-token-nnqmn data]: timed out waiting for the condition
我将确认此时 PV: pvc-d37fd3f0-22ac-42c2-a90d-7427c2fa91d7 的状态如何。
$ kubectl -n harbor get pv pvc-d37fd3f0-22ac-42c2-a90d-7427c2fa91d7
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
STORAGECLASS REASON AGE
pvc-d37fd3f0-22ac-42c2-a90d-7427c2fa91d7 1Gi RWO Delete Bound harbor/data-my-harbor-harbor-redis-0 rook-ceph-block 232d
从Rook/Ceph的角度来看,没有出现特殊问题。
$ kubectl -n harbor get pvc data-my-harbor-harbor-redis-0
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-my-harbor-harbor-redis-0 Bound pvc-d37fd3f0-22ac-42c2-a90d-7427c2fa91d7 1Gi RWO rook-ceph-block 232d
$ kubectl -n rook-ceph exec -it rook-ceph-tools-7764cdbbc4-wkc64 ceph status
cluster:
id: 89ddf8a3-e7ce-41e6-a081-b6b5fab8b8e0
health: HEALTH_OK
services:
mon: 3 daemons, quorum v,w,y (age 7h)
mgr: a(active, since 7h)
mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
osd: 4 osds: 4 up (since 7h), 4 in (since 7h)
data:
pools: 13 pools, 104 pgs
objects: 2.64k objects, 6.1 GiB
usage: 567 GiB used, 6.6 TiB / 7.2 TiB avail
pgs: 104 active+clean
io:
client: 938 B/s rd, 9.5 KiB/s wr, 1 op/s rd, 0 op/s wr
$ kubectl -n rook-ceph logs -l rook-ceph-agent
...
2020-09-04 01:59:27.976756 I | flexdriver: calling agent to attach volume replicapool/pvc-d37fd3f0-22ac-42c2-a90d-7427c2fa91d7
2020-09-04 01:59:27.979645 I | flexvolume: volume attachment record rook-ceph/pvc-d37fd3f0-22ac-42c2-a90d-7427c2fa91d7 exists for pod: harbor/my-harbor-harbor-redis-0
...
应对
因为只想到了从StatefulSet的定义中删除它的方法,所以我决定使用helm将其引入,并从helm中删除它,然后重新安装。
$ (cd harbor-helm; helm delete my-harbor)
$ (cd harbor-helm; helm install --replace --name my-harbor --namespace harbor .)
果然,这之后没有问题地安装了PVC。
根据调查,我们发现每个节点的夜间重新启动时间非常接近,因此我们决定充分沟通,以减少不一致,并观察一段时间。
文件系统的故障
另一个 Pod 出现了无法挂载文件系统的情况。
Warning FailedMount 13m kubelet, node04 MountVolume.SetUp failed for volume "rook-data" : mount command failed, stat
us: Failure, reason: failed to mount filesystem myfs to /var/lib/kubelet/pods/297a0883-0d02-41c9-855b-b193f02210ec/volumes/c
eph.rook.io~rook/rook-data with monitor 10.233.32.156:6789,10.233.62.103:6789,10.233.34.233:6789:/ and options [name=admin s
ecret=AQBuk+pctG/7JBAATogIHHCDNN7W7u4zppFtaw== mds_namespace=myfs]: mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/297a0883-0d02-41c9-855b-b193f02210ec/
volumes/ceph.rook.io~rook/rook-data --scope -- mount -t ceph -o name=admin,secret=AQBuk+pctG/7JBAATogIHHCDNN7W7u4zppFtaw==,m
ds_namespace=myfs 10.233.32.156:6789,10.233.62.103:6789,10.233.34.233:6789:/ /var/lib/kubelet/pods/297a0883-0d02-41c9-855b-b
193f02210ec/volumes/ceph.rook.io~rook/rook-data
Output: Running scope as unit: run-r35a1db0e9e564106b94b7ea5d1973fb3.scope
mount: /var/lib/kubelet/pods/297a0883-0d02-41c9-855b-b193f02210ec/volumes/ceph.rook.io~rook/rook-data: mount(2) system call
failed: No route to host.
Warning FailedMount 13m kubelet, node04 MountVolume.SetUp failed for volume "rook-data" : mount command failed, stat
us: Failure, reason: failed to mount filesystem myfs to /var/lib/kubelet/pods/297a0883-0d02-41c9-855b-b193f02210ec/volumes/c
eph.rook.io~rook/rook-data with monitor 10.233.32.156:6789,10.233.62.103:6789,10.233.34.233:6789:/ and options [name=admin s
ecret=AQBuk+pctG/7JBAATogIHHCDNN7W7u4zppFtaw== mds_namespace=myfs]: mount failed: exit status 32
处理
虽然在MDS节点上没有特定的错误消息,但我会尝试重新启动节点。
为了安全起见,我将从备援节点开始重新启动。
$ kubectl -n rook-ceph exec -it rook-ceph-tools-6dd7c595bc-xvk92 ceph status
...
services:
mon: 3 daemons, quorum bm,bs,bu (age 15m)
mgr: a(active, since 13m)
mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
osd: 4 osds: 4 up (since 60m), 4 in (since 5w)
...
## "a=up:active"の表示から、b側をまず再起動しておく
$ kubectl -n rook-ceph delete pod rook-ceph-mds-myfs-b-78ccfc6754-jff2f
## kubectl -n rook-ceph get pod でmyfs-bが正常に動いていることを確認してから、a側をリスタートする
$ kubectl -n rook-ceph delete pod rook-ceph-mds-myfs-a-87448c57b-lp7hl
总的来说,我已经使用这种方法成功地重新启动了MDS节点,并且可以顺利地访问文件系统。
对于 HEALTH_ERR 的处理
当我注意到时,整个节点的功能已经停止,各种Pod仍然处于挂起或终止状态,并且在解决这些问题之后,ceph状态出现了错误。
首先,通过使用”shutdown -r now”命令重新启动了整个系统以解决初始的节点错误。对于此部分的根本原因,我有一些猜测,但并未深入探究。
我会留下关于一直发生的错误的备忘录。
错误的情况
在重新启动节点之前,存在着终止的Pod和挂起的Pod两者同时存在的奇怪情况,但是在重新启动后,我注意到Rook/Ceph出现了奇怪的结果。
$ kubectl -n rook-ceph exec -it ... -- ceph status
cluster:
id: 3cedbf9c-82a4-4ce8-988b-9e64b1b3d97a
health: HEALTH_ERR
Module 'rook' has failed: HTTPSConnectionPool(host='10.233.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/rook-ceph/pods (Caused by ProtocolError('Connection aborted.', error(99, 'Cannot assign requested address')))
services:
mon: 3 daemons, quorum bm,bs,bu (age 104s)
mgr: a(active, since 5m)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 4 osds: 4 up (since 112m), 4 in (since 8w)
data:
pools: 3 pools, 300 pgs
objects: 2.01M objects, 11 GiB
usage: 970 GiB used, 13 TiB / 14 TiB avail
pgs: 300 active+clean
io:
client: 1.2 KiB/s rd, 46 KiB/s wr, 2 op/s rd, 2 op/s wr
我的意思是,我注意到尽管MDS/OSD正常运行,但出现了HEALTH_ERR的问题。
经过尝试和纠错后,我重新启动了mgr pod。
$ kubectl -n rook-ceph delete replicaset.apps/rook-ceph-mgr-a-7644684465
replicaset.apps "rook-ceph-mgr-a-7644684465" deleted
过了一段时间后,我检查了ceph的状态,并确认问题已被修正。
$ kubectl -n rook-ceph exec -it ... -- ceph status
cluster:
id: 3cedbf9c-82a4-4ce8-988b-9e64b1b3d97a
health: HEALTH_OK
...
实际上,在mds pod的内部发生了一些错误和其他问题,因此我们从standby侧重新启动了它,然后在确认了所有正在运行的Pod的日志后进行了操作。
由于在本地的问题上并没有提供解决方案,所以我无法确定是应该禁用/启用模块,还是应该采用其他方法来应对。不过,暂时问题已经关闭了,这是件好事。
遇到全台shutdown引起的race condition时的备忘录。
这个备忘录并不是关于恢复记录的。它只是记录了对测试问题所进行的操作和结果。如果需要保留数据,请另外考虑方法。
【2021/08/18更新】
此现象的根本原因是/dev/sdb2、/dev/sdc2等变为bluestore。官方文档已更新,表示该现象会在v1.6.0及更高版本中发生,并推荐升级至v1.6.8及更高版本。解决方案是在升级至v1.6.8及更高版本后,删除OSD,对问题磁盘进行初始化并重新识别以实现正常化。
https://rook.github.io/docs/rook/v1.6/ceph-common-issues.html#unexpected-partitions-created
2021年07月,在关闭整个测试型k8s系统进行维护后重新启动系统后,出现了以下情况。
虽然有点冗长,但我还是将维护的BASH函数一并列出。
经常有报告称发生了竞争条件,并且在问题中有记录。还有一些备忘录提到,如果使用cri-o就不会遇到这个问题,但是在VMware上进行的测试范围内,并没有观察到libceph的错误发生。
-
- https://github.com/rook/rook/issues/7734
- https://github.com/rook/rook/issues/2517
这次发生的问题是由于第一个链接中的原始设备出现了故障。
整个集群中,每个节点使用两个4TB的硬盘,共计32TB的存储容量。
$ lsblk -l
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 55.4M 1 loop /snap/core18/1944
loop1 7:1 0 55.5M 1 loop /snap/core18/2074
loop2 7:2 0 67.6M 1 loop /snap/lxd/20326
loop3 7:3 0 32.3M 1 loop /snap/snapd/12159
loop5 7:5 0 32.3M 1 loop /snap/snapd/12398
loop6 7:6 0 70.3M 1 loop /snap/lxd/21029
sda 8:0 0 465.8G 0 disk
sda1 8:1 0 512M 0 part
sda2 8:2 0 465.3G 0 part /
sdb 8:16 0 3.7T 0 disk
sdb2 8:18 0 48G 0 part
sdb3 8:19 0 6.7M 0 part
sdc 8:32 0 3.7T 0 disk
sdc2 8:34 0 48G 0 part
sdc3 8:35 0 6.2M 0 part
sr0 11:0 1 1024M 0 rom
$ lsblk -f
NAME FSTYPE LABEL UUID FSAVAIL FSUSE% MOUNTPOINT
loop0 squashfs 0 100% /snap/core18/1944
loop1 squashfs 0 100% /snap/core18/2074
loop2 squashfs 0 100% /snap/lxd/20326
loop3 squashfs 0 100% /snap/snapd/12159
loop5 squashfs 0 100% /snap/snapd/12398
loop6 squashfs 0 100% /snap/lxd/21029
sda
├─sda1 vfat 5AD4-66D1
└─sda2 ext4 42db9b49-9bb0-4590-a0f5-846f1a74058b 420.2G 3% /
sdb ceph_bluestore
├─sdb2 ceph_bluestore
└─sdb3
sdc ceph_bluestore
├─sdc2 ceph_bluestore
└─sdc3
sr0
希望的是在使用/dev/sdb作为ceph_bluestore的情况下,/dev/sdb{2,3}作为分区应该是意外的存在。
$ type get_toolbox_podname
get_toolbox_podname is a function
get_toolbox_podname ()
{
sudo kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}'
}
$ type ceph_status
ceph_status is a function
ceph_status ()
{
name=$(get_toolbox_podname);
sudo kubectl -n rook-ceph exec -it "${name}" -- ceph status
}
## ここからがログ
$ ceph_status
cluster:
id: 454d7e30-a1f5-4369-b1e5-6b9b2700054c
health: HEALTH_ERR
4/507 objects unfound (0.789%)
1487 scrub errors
Too many repaired reads on 5 OSDs
Possible data damage: 4 pgs recovery_unfound, 26 pgs inconsistent
Degraded data redundancy: 12/1521 objects degraded (0.789%), 4 pgs degraded
3 pgs not deep-scrubbed in time
4 pgs not scrubbed in time
6 slow ops, oldest one blocked for 1584994 sec, daemons [osd.0,osd.1,osd.2,osd.4,osd.5] have slow
ops.
services:
mon: 3 daemons, quorum a,b,c (age 2w)
mgr: a(active, since 2w)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 16 osds: 16 up (since 2w), 16 in (since 2w)
rgw: 1 daemon active (my.store.a)
task status:
data:
pools: 11 pools, 177 pgs
objects: 507 objects, 686 MiB
usage: 19 GiB used, 29 TiB / 29 TiB avail
pgs: 12/1521 objects degraded (0.789%)
4/507 objects unfound (0.789%)
148 active+clean
13 active+clean+inconsistent
12 active+clean+inconsistent+failed_repair
3 active+recovery_unfound+degraded
1 active+recovery_unfound+degraded+inconsistent+failed_repair
io:
client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
这个日志是在故障发生后几天的状态下输出的。
为了清理工作,当删除所有PVC后,即使过了大约8个小时,删除处理仍未完成,这个日志会变成以下这样。
$ ceph_status
cluster:
id: 454d7e30-a1f5-4369-b1e5-6b9b2700054c
health: HEALTH_ERR
6/507 objects unfound (1.183%)
1213 scrub errors
Too many repaired reads on 5 OSDs
Possible data damage: 6 pgs recovery_unfound, 20 pgs inconsistent
Degraded data redundancy: 18/1521 objects degraded (1.183%), 6 pgs degraded
3 pgs not deep-scrubbed in time
4 pgs not scrubbed in time
8 slow ops, oldest one blocked for 1614550 sec, daemons [osd.0,osd.1,osd.2,osd.4,osd.5] have slow
ops.
services:
mon: 3 daemons, quorum a,b,c (age 2w)
mgr: a(active, since 2w)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 16 osds: 16 up (since 2w), 16 in (since 2w)
rgw: 1 daemon active (my.store.a)
task status:
data:
pools: 11 pools, 177 pgs
objects: 507 objects, 686 MiB
usage: 19 GiB used, 29 TiB / 29 TiB avail
pgs: 18/1521 objects degraded (1.183%)
6/507 objects unfound (1.183%)
154 active+clean
10 active+clean+inconsistent+failed_repair
7 active+clean+inconsistent
3 active+recovery_unfound+degraded
3 active+recovery_unfound+degraded+inconsistent+failed_repair
io:
client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
这时的 PVC 状态如下,只剩下了偶然在最后被删除的 Rook/Cassandra 使用的 PVC。删除步骤本身遵循文件中的说明。
$ sudo kubectl get pvc --all-namespaces
NAMESPACE NAME STATUS VOLUME
CAPACITY ACCESS MODES STORAGECLASS AGE
rook-cassandra rook-cassandra-data-rook-cassandra-us-east-1-us-east-1a-0 Terminating pvc-77d3d697-edd5-4
633-8b81-a8474eb94043 5Gi RWO rook-ceph-block 26d
rook-cassandra rook-cassandra-data-rook-cassandra-us-east-1-us-east-1a-1 Terminating pvc-c6bc0341-ba1d-4
419-b280-c143edf240bb 5Gi RWO rook-ceph-block 26d
rook-cassandra rook-cassandra-data-rook-cassandra-us-east-1-us-east-1a-2 Terminating pvc-c4261458-7d46-4
33a-a05f-7c94e93b7034 5Gi RWO rook-ceph-block 26d
目前来说,Ceph健康详情的输出情况就像是一个错误的连环马戏表演。
由于删除了所有的PV,所以我会尝试一些不同的方法。
# ceph health detail
HEALTH_ERR 1/397 objects unfound (0.252%); 920 scrub errors; Too many repaired reads on 5 OSDs; Possible data
damage: 1 pg recovery_unfound, 19 pgs inconsistent; Degraded data redundancy: 3/1191 objects degraded (0.252%)
, 1 pg degraded; 1 pgs not deep-scrubbed in time; 1 pgs not scrubbed in time; 4 slow ops, oldest one blocked f
or 1626835 sec, daemons [osd.1,osd.2,osd.5] have slow ops.
[WRN] OBJECT_UNFOUND: 1/397 objects unfound (0.252%)
pg 2.1a has 1 unfound objects
[ERR] OSD_SCRUB_ERRORS: 920 scrub errors
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 5 OSDs
osd.3 had 16 reads repaired
osd.1 had 18 reads repaired
osd.7 had 31 reads repaired
osd.2 had 32 reads repaired
osd.4 had 34 reads repaired
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 19 pgs inconsistent
pg 2.1 is active+clean+inconsistent+failed_repair, acting [7,0,6]
pg 2.3 is active+clean+inconsistent+failed_repair, acting [3,5,6]
pg 2.5 is active+clean+inconsistent, acting [6,4,1]
pg 2.7 is active+clean+inconsistent+failed_repair, acting [1,7,4]
pg 2.8 is active+clean+inconsistent+failed_repair, acting [3,2,0]
pg 2.a is active+clean+inconsistent+failed_repair, acting [2,4,7]
pg 2.11 is active+clean+inconsistent+failed_repair, acting [7,1,4]
pg 2.18 is active+clean+inconsistent+failed_repair, acting [3,1,2]
pg 2.1a is active+recovery_unfound+degraded, acting [5,6,4], 1 unfound
pg 2.1b is active+clean+inconsistent+failed_repair, acting [6,5,3]
pg 2.1c is active+clean+inconsistent+failed_repair, acting [2,1,0]
pg 2.1d is active+clean+inconsistent+failed_repair, acting [3,6,5]
pg 2.1e is active+clean+inconsistent+failed_repair, acting [4,6,3]
pg 2.1f is active+clean+inconsistent+failed_repair, acting [4,1,2]
pg 11.6 is active+clean+inconsistent, acting [3,0,1]
pg 11.9 is active+clean+inconsistent, acting [4,1,7]
pg 11.c is active+clean+inconsistent, acting [7,4,1]
pg 11.14 is active+clean+inconsistent, acting [1,4,3]
pg 11.1b is active+clean+inconsistent, acting [1,2,0]
pg 11.1d is active+clean+inconsistent, acting [1,11,3]
[WRN] PG_DEGRADED: Degraded data redundancy: 3/1191 objects degraded (0.252%), 1 pg degraded
pg 2.1a is active+recovery_unfound+degraded, acting [5,6,4], 1 unfound
[WRN] PG_NOT_DEEP_SCRUBBED: 1 pgs not deep-scrubbed in time
pg 2.1a not deep-scrubbed since 2021-06-24T13:46:01.226873+0000
[WRN] PG_NOT_SCRUBBED: 1 pgs not scrubbed in time
pg 2.1a not scrubbed since 2021-06-30T19:52:31.104830+0000
[WRN] SLOW_OPS: 4 slow ops, oldest one blocked for 1626835 sec, daemons [osd.1,osd.2,osd.5] have slow ops.
对于找不到的pg,我们将通过mark_unfound_lost来处理。
# ceph pg 2.1a mark_unfound_lost delete
pg has 1 objects unfound and apparently lost marking
在RedHat的文档中,对于OSD_TOO_MANY_REPAIRS的情况,介绍了一个名为clear_shards_repaired的选项,似乎可以强制清除该消息。然而,在一般的ceph中似乎没有提供这个选项。
我会尝试修复 OSD(对象存储设备)。
# ceph osd repair all
instructed osd(s) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 to repair
通过这样的努力,错误数量有所减少,变成了以下的状态。
# ceph health detail
HEALTH_ERR 864 scrub errors; Too many repaired reads on 5 OSDs; Possible data damage: 6 pgs inconsistent; 3 sl
ow ops, oldest one blocked for 1627269 sec, daemons [osd.1,osd.2] have slow ops.
[ERR] OSD_SCRUB_ERRORS: 864 scrub errors
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 5 OSDs
osd.3 had 16 reads repaired
osd.1 had 18 reads repaired
osd.7 had 31 reads repaired
osd.2 had 32 reads repaired
osd.4 had 35 reads repaired
[ERR] PG_DAMAGED: Possible data damage: 6 pgs inconsistent
pg 11.6 is active+clean+inconsistent, acting [3,0,1]
pg 11.9 is active+clean+inconsistent, acting [4,1,7]
pg 11.c is active+clean+inconsistent, acting [7,4,1]
pg 11.14 is active+clean+inconsistent, acting [1,4,3]
pg 11.1b is active+clean+inconsistent, acting [1,2,0]
pg 11.1d is active+clean+inconsistent, acting [1,11,3]
[WRN] SLOW_OPS: 3 slow ops, oldest one blocked for 1627269 sec, daemons [osd.1,osd.2] have slow ops.
过了一段时间,PG_DAMAGED全部消失了,只剩下了OSD_TOO_MANY_REPAIRS。
# ceph health detail
HEALTH_WARN Too many repaired reads on 5 OSDs; 3 slow ops, oldest one blocked for 1627335 sec, daemons [osd.1,
osd.2] have slow ops.
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 5 OSDs
osd.3 had 16 reads repaired
osd.1 had 18 reads repaired
osd.7 had 31 reads repaired
osd.2 had 32 reads repaired
osd.4 had 35 reads repaired
[WRN] SLOW_OPS: 3 slow ops, oldest one blocked for 1627335 sec, daemons [osd.1,osd.2] have slow ops.
# ceph status
cluster:
id: 454d7e30-a1f5-4369-b1e5-6b9b2700054c
health: HEALTH_WARN
Too many repaired reads on 5 OSDs
3 slow ops, oldest one blocked for 1627340 sec, daemons [osd.1,osd.2] have slow ops.
services:
mon: 3 daemons, quorum a,b,c (age 2w)
mgr: a(active, since 2w)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 16 osds: 16 up (since 2w), 16 in (since 2w)
rgw: 1 daemon active (my.store.a)
task status:
data:
pools: 11 pools, 177 pgs
objects: 343 objects, 184 MiB
usage: 18 GiB used, 29 TiB / 29 TiB avail
pgs: 177 active+clean
io:
client: 5.7 KiB/s rd, 426 B/s wr, 6 op/s rd, 2 op/s wr
我将尝试使用命令 “osd purge <id|osd.id> [–force] [–yes-i-really-mean-it]” 进行清除操作。
# ceph osd down osd.1
marked down osd.1.
# ceph osd down osd.2
marked down osd.2.
# ceph status
cluster:
id: 454d7e30-a1f5-4369-b1e5-6b9b2700054c
health: HEALTH_WARN
Too many repaired reads on 4 OSDs
2 slow ops, oldest one blocked for 1627677 sec, osd.2 has slow ops
services:
mon: 3 daemons, quorum a,b,c (age 2w)
mgr: a(active, since 2w)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 16 osds: 16 up (since 76s), 16 in (since 2w)
rgw: 1 daemon active (my.store.a)
task status:
data:
pools: 11 pools, 177 pgs
objects: 343 objects, 184 MiB
usage: 18 GiB used, 29 TiB / 29 TiB avail
pgs: 177 active+clean
io:
client: 3.5 KiB/s rd, 511 B/s wr, 4 op/s rd, 3 op/s wr
如果安装了Rook/Ceph,将生成许多osd。
在新构建的k8s集群中引入Rook/Ceph时,原本应该只有每台服务器一个BlueStore(HDD),但却创建了六个OSD。
USB驱动器插入的AutoInstall在引导操作系统时被保留。我曾担心ISO9660映像会被视为CD-ROM而被忽略,但系统将其识别为/dev/sdc,并成功启动了OSD模块。
当然,由于USB存储器被回收,因此/dev/sdc本身不存在,导致OSD无法启动并出现错误。
$ sudo kubectl -n rook-ceph get pod -l osd
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-0-5557b5c888-lj9x8 1/1 Running 2 (16h ago) 6d15h
rook-ceph-osd-1-66bcb89669-ch5bx 1/1 Running 2 (16h ago) 6d15h
rook-ceph-osd-2-9b8b5ff5b-44kbz 1/1 Running 2 (16h ago) 6d15h
rook-ceph-osd-3-59bcb68f97-g9nl9 0/1 Init:CrashLoopBackOff 177 (24s ago) 14h
rook-ceph-osd-4-6745cf4d86-7vnht 0/1 Init:CrashLoopBackOff 192 (3m4s ago) 6d15h
rook-ceph-osd-5-f6cc674f4-6sp5q 1/1 Running 2 (16h ago) 6d15h
rook-ceph-osd-6-f776bf965-xqbrg 0/1 Init:CrashLoopBackOff 175 (3m20s ago) 14h
环境
-
- Rook v1.9.10 (ceph 16.2.10)
- Kubernetes v1.23.7
工作步骤
由于我们知道这个OSD是不必要的,不需要进行任何BlueStore初始化,也没有构建PGs,因此我们将在没有受影响的PV的前提下进行操作。
按照官方指南《Ceph OSD管理》和官方指南《Ceph配置》,简单地删除不需要的OSD(ID: 3,4,6)。
-
- 操作员停止
-
- 从ToolBox执行ceph osd down
-
- 从ToolBox执行ceph osd out/crush remove/auth del/rm
-
- 删除与OSD对应的Deployment对象(如果removeOSDsIfOutAndSafeToRemove没有设置为true)
- 操作员恢复
操作员停止。
$ sudo kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
从ToolBox中执行ceph osd down。
$ sudo kubectl -n rook-ceph exec -it rook-ceph-tools-7d59b6df78-tlqrb -- bash
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd down osd.3
osd.3 is already down.
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd down osd.4
osd.4 is already down.
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd down osd.6
osd.6 is already down
执行 ToolBox 中的 ceph osd out/crush remove/auth del/rm。
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd out osd.6
marked out osd.6.
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd out osd.4
osd.4 is already out.
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd out osd.3
osd.3 is already out.
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd crush remove osd.3
removed item id 3 name 'osd.3' from crush map
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd crush remove osd.4
removed item id 4 name 'osd.4' from crush map
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd crush remove osd.6
removed item id 6 name 'osd.6' from crush map
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph auth del osd.3
updated
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph auth del osd.4
^Pupdated
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph auth del osd.6
updated
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd rm osd.3
removed osd.3
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd rm osd.4
removed osd.4
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ ceph osd rm osd.6
removed osd.6
[rook@rook-ceph-tools-7d59b6df78-tlqrb /]$ exit
exit
$
删除对应的Deployment对象
$ sudo kubectl -n rook-ceph delete deploy rook-ceph-osd-3
deployment.apps "rook-ceph-osd-3" deleted
$ sudo kubectl -n rook-ceph delete deploy rook-ceph-osd-4
deployment.apps "rook-ceph-osd-4" deleted
$ sudo kubectl -n rook-ceph delete deploy rook-ceph-osd-6
deployment.apps "rook-ceph-osd-6" deleted
Operator的回归
$ sudo kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
deployment.apps/rook-ceph-operator scaled
确认结果
我将进入ToolBox,然后进行ceph状态的确认。
cluster:
id: 8c9c83ef-4694-42b9-8cc7-cbb4a529e384
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 17h)
mgr: a(active, since 17h), standbys: b
osd: 4 osds: 4 up (since 17h), 4 in (since 41m); 2 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 9 pools, 113 pgs
objects: 2.92k objects, 7.1 GiB
usage: 16 GiB used, 3.6 TiB / 3.6 TiB avail
pgs: 125/8769 objects misplaced (1.425%)
111 active+clean
1 active+remapped+backfilling
1 active+remapped+backfill_wait
io:
client: 7.5 KiB/s rd, 204 KiB/s wr, 1 op/s rd, 1 op/s wr
recovery: 6.5 MiB/s, 6 objects/s
ceph osd tree 的输出如下所示。
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3.63879 root default
-5 0.90970 host s10rx61
1 hdd 0.90970 osd.1 up 1.00000 1.00000
-9 0.90970 host s10rx62
5 hdd 0.90970 osd.5 up 1.00000 1.00000
-7 0.90970 host s10rx63
2 hdd 0.90970 osd.2 up 1.00000 1.00000
-3 0.90970 host s10rx64
0 hdd 0.90970 osd.0 up 1.00000 1.00000