umount -l / needs --make-slave
The other day I learned — the hard way — that umount -l
can be
dangerous. Using the --make-slave
mount option makes it safer.
The scenario went like this:
A virtual machine on our Proxmox VE cluster wouldn't boot. No biggie, I thought. Just mount the filesystem on the host and do a proper grub-install from a chroot:
# fdisk -l /dev/zvol/zl-pve2-ssd1/vm-215-disk-3
/dev/zvol/zl-pve2-ssd1/vm-215-disk-3p1 * 2048 124999679 124997632 59.6G 83 Linux
/dev/zvol/zl-pve2-ssd1/vm-215-disk-3p2 124999680 125827071 827392 404M 82 Linux swap / Solaris
# mount /dev/zvol/zl-pve2-ssd1/vm-215-disk-3p1 /mnt/root
# cd /mnt/root
# for x in dev proc sys; do mount --rbind /$x $x; done
# chroot /mnt/root
There I could run the necessary commands to fix the boot procedure.
All done? Exit the chroot, unmount and start the VM:
# logout
# umount -l /mnt/root
# qm start 215
And at that point, things started failing miserably.
You see, in my laziness, I used umount -l
instead of four umounts for:
/mnt/root/dev
, /mnt/root/proc
, /mnt/root/sys
and lastly
/mnt/root
. But what I was unaware of, was that there were mounts
inside dev
, proc
and sys
too, that now also got unmounted.
And that led to an array of failures:
systemd complained about binfmt_misc.automount
breakage:
systemd[1]: proc-sys-fs-binfmt_misc.automount: Got invalid poll event 16 on pipe (fd=44)
systemd[1]: proc-sys-fs-binfmt_misc.automount: Failed with result 'resources'.
pvedaemon could not bring up any VMs:
pvedaemon[32825]: <root@pam> starting task qmstart:215:root@pam:
pvedaemon[46905]: start VM 215: UPID:pve2:ID:qmstart:215:root@pam:
systemd[1]: 215.scope: Failed to create cgroup /qemu.slice/215.scope:
No such file or directory
systemd[1]: 215.scope: Failed to create cgroup /qemu.slice/215.scope:
No such file or directory
systemd[1]: 215.scope: Failed to add PIDs to scope's control group:
No such file or directory
systemd[1]: 215.scope: Failed with result 'resources'.
systemd[1]: Failed to start 215.scope.
pvedaemon[46905]: start failed: systemd job failed
pvedaemon[32825]: <root@pam> end task qmstart:215:root@pam:
start failed: systemd job failed
The root runtime dir could not get auto-created:
systemd[1]: user-0.slice: Failed to create cgroup
/user.slice/user-0.slice: No such file or directory
systemd[1]: Created slice User Slice of UID 0.
systemd[1]: user-0.slice: Failed to create cgroup
/user.slice/user-0.slice: No such file or directory
systemd[1]: Starting User Runtime Directory /run/user/0...
systemd[4139]: user-runtime-dir@0.service: Failed to attach to
cgroup /user.slice/user-0.slice/user-runtime-dir@0.service:
No such file or directory
systemd[4139]: user-runtime-dir@0.service:
Failed at step CGROUP spawning /lib/systemd/systemd-user-runtime-dir:
No such file or directory
The Proxmox VE replication runner failed to start:
systemd[1]: pvesr.service: Failed to create cgroup
/system.slice/pvesr.service: No such file or directory
systemd[1]: Starting Proxmox VE replication runner...
systemd[5538]: pvesr.service: Failed to attach to cgroup
/system.slice/pvesr.service: No such file or directory
systemd[5538]: pvesr.service: Failed at step CGROUP spawning
/usr/bin/pvesr: No such file or directory
systemd[1]: pvesr.service: Main process exited, code=exited,
status=219/CGROUP
systemd[1]: pvesr.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Proxmox VE replication runner.
And, worst of all, new ssh logins to the host machine failed:
sshd[24551]: pam_systemd(sshd:session):
Failed to create session: Connection timed out
sshd[24551]: error: openpty: No such file or directory
sshd[31553]: error: session_pty_req: session 0 alloc failed
sshd[31553]: Received disconnect from 10.x.x.x port 55190:11:
disconnected by user
As you understand by now. This was my own doing, and it was caused by various missing mount points.
The failing ssh? A missing /dev/pts
.
Most of the other failures? Mostly mounts missing in /sys/fs/cgroup
.
Fixing
First order of business was to get this machine to behave again. Luckily I had a different machine where I could take a peek at what was supposed to be mounted.
On the other machine, I ran this one-liner:
$ mount | sed -e '/ on \/\(dev\|proc\|sys\)\//!d
s#^\([^ ]*\) on \([^ ]*\) type \([^ ]*\) (\([^)]*\)).*#'\
'mountpoint -q \2 || '\
'( mkdir -p \2; mount -n -t \3 \1 -o \4 \2 || rmdir \2 )#' |
sort -V
That resulted in this output that could be pasted into the one ssh shell I still had at my disposal:
mountpoint -q /dev/hugepages || ( mkdir -p /dev/hugepages; mount -n -t hugetlbfs hugetlbfs -o rw,relatime,pagesize=2M /dev/hugepages || rmdir /dev/hugepages )
mountpoint -q /dev/mqueue || ( mkdir -p /dev/mqueue; mount -n -t mqueue mqueue -o rw,relatime /dev/mqueue || rmdir /dev/mqueue )
mountpoint -q /dev/pts || ( mkdir -p /dev/pts; mount -n -t devpts devpts -o rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 /dev/pts || rmdir /dev/pts )
mountpoint -q /dev/shm || ( mkdir -p /dev/shm; mount -n -t tmpfs tmpfs -o rw,nosuid,nodev,inode64 /dev/shm || rmdir /dev/shm )
mountpoint -q /proc/sys/fs/binfmt_misc || ( mkdir -p /proc/sys/fs/binfmt_misc; mount -n -t autofs systemd-1 -o rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=45161 /proc/sys/fs/binfmt_misc || rmdir /proc/sys/fs/binfmt_misc )
mountpoint -q /sys/fs/bpf || ( mkdir -p /sys/fs/bpf; mount -n -t bpf none -o rw,nosuid,nodev,noexec,relatime,mode=700 /sys/fs/bpf || rmdir /sys/fs/bpf )
mountpoint -q /sys/fs/cgroup || ( mkdir -p /sys/fs/cgroup; mount -n -t tmpfs tmpfs -o ro,nosuid,nodev,noexec,mode=755,inode64 /sys/fs/cgroup || rmdir /sys/fs/cgroup )
mountpoint -q /sys/fs/cgroup/blkio || ( mkdir -p /sys/fs/cgroup/blkio; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,blkio /sys/fs/cgroup/blkio || rmdir /sys/fs/cgroup/blkio )
mountpoint -q /sys/fs/cgroup/cpuset || ( mkdir -p /sys/fs/cgroup/cpuset; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,cpuset /sys/fs/cgroup/cpuset || rmdir /sys/fs/cgroup/cpuset )
mountpoint -q /sys/fs/cgroup/cpu,cpuacct || ( mkdir -p /sys/fs/cgroup/cpu,cpuacct; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct || rmdir /sys/fs/cgroup/cpu,cpuacct )
mountpoint -q /sys/fs/cgroup/devices || ( mkdir -p /sys/fs/cgroup/devices; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,devices /sys/fs/cgroup/devices || rmdir /sys/fs/cgroup/devices )
mountpoint -q /sys/fs/cgroup/freezer || ( mkdir -p /sys/fs/cgroup/freezer; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,freezer /sys/fs/cgroup/freezer || rmdir /sys/fs/cgroup/freezer )
mountpoint -q /sys/fs/cgroup/hugetlb || ( mkdir -p /sys/fs/cgroup/hugetlb; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,hugetlb /sys/fs/cgroup/hugetlb || rmdir /sys/fs/cgroup/hugetlb )
mountpoint -q /sys/fs/cgroup/memory || ( mkdir -p /sys/fs/cgroup/memory; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,memory /sys/fs/cgroup/memory || rmdir /sys/fs/cgroup/memory )
mountpoint -q /sys/fs/cgroup/net_cls,net_prio || ( mkdir -p /sys/fs/cgroup/net_cls,net_prio; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio || rmdir /sys/fs/cgroup/net_cls,net_prio )
mountpoint -q /sys/fs/cgroup/perf_event || ( mkdir -p /sys/fs/cgroup/perf_event; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,perf_event /sys/fs/cgroup/perf_event || rmdir /sys/fs/cgroup/perf_event )
mountpoint -q /sys/fs/cgroup/pids || ( mkdir -p /sys/fs/cgroup/pids; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,pids /sys/fs/cgroup/pids || rmdir /sys/fs/cgroup/pids )
mountpoint -q /sys/fs/cgroup/rdma || ( mkdir -p /sys/fs/cgroup/rdma; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,rdma /sys/fs/cgroup/rdma || rmdir /sys/fs/cgroup/rdma )
mountpoint -q /sys/fs/cgroup/systemd || ( mkdir -p /sys/fs/cgroup/systemd; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd /sys/fs/cgroup/systemd || rmdir /sys/fs/cgroup/systemd )
mountpoint -q /sys/fs/cgroup/unified || ( mkdir -p /sys/fs/cgroup/unified; mount -n -t cgroup2 cgroup2 -o rw,nosuid,nodev,noexec,relatime /sys/fs/cgroup/unified || rmdir /sys/fs/cgroup/unified )
mountpoint -q /sys/fs/fuse/connections || ( mkdir -p /sys/fs/fuse/connections; mount -n -t fusectl fusectl -o rw,relatime /sys/fs/fuse/connections || rmdir /sys/fs/fuse/connections )
mountpoint -q /sys/fs/pstore || ( mkdir -p /sys/fs/pstore; mount -n -t pstore pstore -o rw,nosuid,nodev,noexec,relatime /sys/fs/pstore || rmdir /sys/fs/pstore )
mountpoint -q /sys/kernel/config || ( mkdir -p /sys/kernel/config; mount -n -t configfs configfs -o rw,relatime /sys/kernel/config || rmdir /sys/kernel/config )
mountpoint -q /sys/kernel/debug || ( mkdir -p /sys/kernel/debug; mount -n -t debugfs debugfs -o rw,relatime /sys/kernel/debug || rmdir /sys/kernel/debug )
mountpoint -q /sys/kernel/debug/tracing || ( mkdir -p /sys/kernel/debug/tracing; mount -n -t tracefs tracefs -o rw,relatime /sys/kernel/debug/tracing || rmdir /sys/kernel/debug/tracing )
mountpoint -q /sys/kernel/security || ( mkdir -p /sys/kernel/security; mount -n -t securityfs securityfs -o rw,nosuid,nodev,noexec,relatime /sys/kernel/security || rmdir /sys/kernel/security )
Finishing touches:
$ for x in /sys/fs/cgroup/*; do
test -L $x &&
echo ln -s $(readlink $x) $x
done
ln -s cpu,cpuacct /sys/fs/cgroup/cpu
ln -s cpu,cpuacct /sys/fs/cgroup/cpuacct
ln -s net_cls,net_prio /sys/fs/cgroup/net_cls
ln -s net_cls,net_prio /sys/fs/cgroup/net_prio
Running those commands returned the system to a usable state.
The real fix
Next time, I shall refrain from doing the lazy -l
umount.
But, as a better solution, I'll also be adding --make-slave
to the
rbind mount command. Doing that will ensure that an unmount in the
bound locations does not unmount the original mount points:
# for x in dev proc sys; do
mount --rbind --make-slave /$x $x
done
With --make-slave
a umount -l
of your chroot path does not break
your system.