partially removed pve node / proxmox cluster
The case of the stale (removed but not removed) PVE node in our Proxmox cluster.
On one of our virtual machine clusters, a node — pve3
— had been
removed on purpose, yet is was still visible in the GUI with a big red
cross (because it was unavailable). This was not only ugly, but also
caused problems for the node enumeration done by
proxmove.
The node had been properly removed, according to the removing a cluster node documentation. Yet it was apparently still there.
# pvecm nodes
Membership information
----------------------
Nodeid Votes Name
1 1 pve1 (local)
2 1 pve2
3 1 pve4
5 1 pve5
This listing looked fine: pve3
(nodeid 4) was absent. And all
remaining nodes showed the same info.
But, a quick grep through /etc
did turn up some references to pve3
:
# grep pve3 /etc/* -rl
/etc/corosync/corosync.conf
/etc/pve/.version
/etc/pve/.members
/etc/pve/corosync.conf
Those two corosync.conf
config files are in sync. Both between
themselves and equal to those files on the other three nodes. But they
did contain a reference to the removed node:
nodelist {
...
node {
name: pve3
nodeid: 4
quorum_votes: 1
ring0_addr: 10.x.x.x
}
The .version
and .members
json files were different, albeit
similar on all nodes. They all included the 5 nodes (one too many):
# cat /etc/pve/.members
{
"nodename": "pve1",
"version": 77,
"cluster": { "name": "my-clustername", "version": 6, "nodes": 5, "quorate": 1 },
"nodelist": {
"pve1": { "id": 1, "online": 1, "ip": "10.x.x.x"},
"pve2": { "id": 2, "online": 1, "ip": "10.x.x.x"},
"pve3": { "id": 4, "online": 0, "ip": "10.x.x.x"},
"pve4": { "id": 3, "online": 1, "ip": "10.x.x.x"},
"pve5": { "id": 5, "online": 1, "ip": "10.x.x.x"}
}
}
The document versions were all a bit different, but the cluster
versions were the same between the nodes. Except for one node, on which
the cluster version was 5
instead of 6
.
Restarting corosync on that node fixed that problem: the cluster
versions were now 6
everywhere.
With that problem tackled, it was a matter of:
# pvecm expected 4
# pvecm delnode pve3
Killing node 4
All right! Even though it did not list nodeid 4 in the pvecm nodes
output, delnode did find the right one. And this properly removed all
traces of pve3
from the remaining files, making the cluster happy
again.