A few days ago; sitting outside — drinking coffee and minding my own business, I received an email from my UPS:
Utility power not available.
2021/06/01 18:38:27
My homelab had lost both power sources 😮 This whole area had a power outage — a good opportunity to do a UPS and power down test 😛
Shutdown
26 minutes and 18 seconds after the power loss, I received another email:
The UPS batteries backup time is below the setting limit. [250 sec < 300 sec]
2021/06/01 19:04:46
The UPS has less than 5 minutes of runtime left, and shutdown of all servers was initiated.
I’m using a Raspberry Pi running nut (Network UPS Tools) server to communicate with my UPS, and send shutdown commands to all my servers using upsmon.
More information about the system on my homelab page.
Shutting down the file server and Docker host was quick. But the two hypervisors used a long time, Epsilon shut down just in time before the UPS turned off — Alpha did not.
I’m pretty sure the running KVM virtual machines are to blame. I’ve seen before that they take a long time to stop when shutting the host down. I usually run a script to stop all running VMs before shutting the servers down. I think I need to add that to the upsmon.conf
.
Apart from that; the shutdown went well 🙂
Power up
The power was out for about an hour. When it returned; the UPS powered up and so did all the equipment and servers (due to the power on after AC loss BIOS setting). Getting back up after a complete shutdown is always interesting, this time was no different.
- Raspberry Pi nut server didn’t establish connection with the UPS
- Servers and containers booted up before the core switch was ready
- NFS mounting failed
- Home Assistant was unable to initiate devices
- Two disk on my main ZFS pool was faulted 😮
Fixing the ZFS pool
pool: tank0
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub in progress since Tue Jun 1 19:30:49 2021
7.78T scanned at 891M/s, 7.32T issued at 838M/s, 39.0T total
0B repaired, 18.78% done, 0 days 11:00:34 to go
config:
NAME STATE READ WRITE CKSUM
tank0 DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
sdi ONLINE 0 0 0
sdb ONLINE 0 0 0
sdh ONLINE 0 0 0
sdg ONLINE 0 0 0
sdd ONLINE 0 0 0
sdf ONLINE 0 0 0
sdc ONLINE 0 0 0
sde ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 0
sdm ONLINE 0 0 0
6835734791109053608 FAULTED 0 0 0 was /dev/sdk1
16423333236828266746 FAULTED 0 0 0 was /dev/sdj1
sdn ONLINE 0 0 0
sdl ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
I suspect sdk
and sdj
had been swapped. This is why it’s not a good idea to use /dev/sdX
to identify drives in the pool.
I initially created the pool with /dev/disk/by-id
, but it got switched after I did an import. Time to switch it back.
I first added the following configuration to /etc/default/zfs
:
ZPOOL_IMPORT_PATH="/dev/disk/by-id"
This tells ZFS to use /dev/disk/by-id
when importing.
Then I exported and reimported the pool:
$ sudo zpool export tank0
$ sudo zpool import -d /dev/disk/by-id tank0
Voila!
pool: tank0
state: ONLINE
scan: resilvered 3.08M in 0 days 00:00:02 with 0 errors on Tue Jun 1 22:10:56 2021
config:
NAME STATE READ WRITE CKSUM
tank0 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x5000c500afxxxxxx ONLINE 0 0 0
wwn-0x5000c500b0xxxxxx ONLINE 0 0 0
wwn-0x5000c500b3xxxxxx ONLINE 0 0 0
wwn-0x50014ee265xxxxxx ONLINE 0 0 0
wwn-0x50014ee2b9xxxxxx ONLINE 0 0 0
wwn-0x50014ee211xxxxxx ONLINE 0 0 0
wwn-0x50014ee2b9xxxxxx ONLINE 0 0 0
wwn-0x50014ee265xxxxxx ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
wwn-0x500003999xxxxxxx ONLINE 0 0 0
wwn-0x500003999xxxxxxx ONLINE 0 0 0
wwn-0x500003999xxxxxxx ONLINE 0 0 0
wwn-0x5000cca26xxxxxxx ONLINE 0 0 0
wwn-0x5000039a5xxxxxxx ONLINE 0 0 0
wwn-0x5000039a5xxxxxxx ONLINE 0 0 0
wwn-0x5000039a5xxxxxxx ONLINE 0 0 0
wwn-0x5000c500bxxxxxxx ONLINE 0 0 0
errors: No known data errors
That’s some weird looking IDs, but I found this reply on the ServeTheHome forums:
[…] think of wwn like a mac address (it basically is the equivalent in storage land). I’ve never seen issues using wwn. […]
[…] I find WWN more reliable because some manufacturers don’t code serial numbers properly (e.g. those white labeled Seagate Exos X16 drives) and wwns are still unique and fully serviceable. […]
Seems fine — better than /dev/sdX
🙂 I’ll be adding these wwn IDs to my disk/bay/serial sheet.
Improving power up
It’s easy to forget about optimizing for shutdown recovery, because it very rarely happens. But there are some simple things I can do to improve it.
My PDU has a configurable delay between starting up each of the 8 outputs. This is to prevent everything coming on at once, causing a current spike. The delay is 1 second by default, but can be set up to 240 seconds.
By increasing this delay, and reorganizing the power outputs; I can make sure that the most critical equipment starts first.
- Network (router, switches)
- File server
- Primary hypervisor
- Secondary hypervisor
- Docker host
This may also solve the issue with the nut server not being able to communicate with the UPS.
I will also be adding the KVM shutdown script to the upsmon.conf
configuration on the hypervisors:
# SHUTDOWNCMD "<command>"
#
# upsmon runs this command when the system needs to be brought down.
SHUTDOWNCMD "/sbin/shutdown -h +0"
I’m hoping to get this done before the next unexpected test 😛
Kids’ reaction
Power outages are very rare where I live. So rare in fact that my kids, age 7, can’t remember experiencing one, and doesn’t understand what it means.
During the outage they wanted to watch TV. I explained that the TV doesn’t work because it needs electricity, and we don’t have that right now. So they wanted to play video games — again I explained why that wasn’t possible.
They got really upset and asked: “Does everything need electricity? Is it going to be like this forever?” 😛
Last commit 2024-04-05, with message: Tag cleanup.