oVirt Virtualization Cluster – UPS with LiFePO4 Batteries and Automatic Shutdown

Preface

We had oVirt virtualization cluster runnning for quite a while (on HP ProLiant hardware), and it (along with 2 other servers) was powered with APC SmartUPS 3000 with additional battery pack. Some time ago I noticed that battery level indicator was never 100% at UPS (while 100% at add-on battery pack). APC software indicated stored energy it’s still enough for 17 min of runtime, which presumably was still enough, since all my servers configured to shutdown after 5 min in case of 220V power line loss. We have backup diesel generator which turns within 1 min, so I assumed we still have some time to replace UPS. I ordered replacement, and just when it have arrived, we had accidental power outage. The whole server rack goes down instantly. APC SmartUPS log shown “too low battery voltage”. You have been warned – UPS software reports and numbers may not be coorrect at all.

UPS Consideration

Old APC SmartUPS 3000 was a very good device for its time, very reliable, yet it used lead acid batteries which need to be replaced each 4 years and had limited energy capacity. I decided to buy another online UPS with LiFePO4 batteries with lifespan 10 years and much higher energy capacity consuming same space (in server rack) compared to lead-acid battery pack.
At the end of 2024 well-known branded UPS from APC or Orvaldi with LiFePO4 batteries were priced at insane level. I opted for Chinese supplier listed on Alibaba. We had another 15kWt UPS running in another location since 2016 made by Chinses company East Group. While it was quite noisy we had no other quality-related complains or issues. The only service it required is to replace lead-acid batteries each 4 years.

New UPS

I have ordered 5KVa (~4.5kW) UPS and battery pack from the company called Henan Lithium Power Source (Alibaba page, review list on Alibaba), which sells its products under brand name “Green Batt”. Cost of the 2U 5KVA UPS – 386 USD, 3U LiFePO4 battery 51.2V 100Ah – 548 USD, plus shippping cost to Shenzhen port 143 USD per each box. Transport by sea to Europe 750 Euro + customs, which took 2.5 month. Battery is considered as dangerous goods, its MSDS must be approved by transport company, and generally shipping cost is very high. Supplier was almost one month too late with delivery, so as compensation they delivered 6kW unit instead of 4.5 kW (~5KVA).
As goods arrived there were good news, bad news, and great news. Build quality is good, for this price level nothing comparable was available through standard distribution channels.
Next are bad or not so good news. UPS has NO power outlets sockets at all, only M6/M8 screw terminals, so you have to make all cable works by yourself.  UPS has no web interface, only LED display with all-Chinese button interface (supplier e-mailed manual with translation). Battery level indicator is just several green LEDs on the front panel. UPS have several fans, but rear fan is only 1, which I don’t consider that good. Both UPS and battery have RS485 interface which I assume used by some software for monitoring, yet I suspect this software is available only in Chinese.
And now the great news. These UPS/battery modules can be stacked as parallel units for quantity up to 6 (need correct settings with DIP switches on front panel for each unit). So if you need more power later you wan’t need to replace the whole UPS system. Same applies if one of the stacked units fail for some reason. BTW, UPS was labeled as sine-wave inverter. I disconnected it from 220V power line, server rack continue to function as nothing happens.

Parallel single phase connection diagram

Automatic shutdown in case of power loss

As I mentioned before, we have diesel generator which turns on after power outage, and it needs about 1 minute to start. Since I have battery with much larger capacity, updated server shutdown script waits for 30 minutes before doing actual job (old version 5 minutes). This is enough to fix simple problems with diesel generator or its control panel module. Cron auto-shutdown script (installed on each physical server, incl. oVirt node) pings two (not single !) power loss detectors. I made mine with Rapsberry Pi Zero and PoE adapter. This solution does not depend upon UPS manufacturer, model and their software. In fact, it doesn’t require UPS to have any built-in and agent software at all.

Auto-shutdown script

Add it as cron job each 5 or 10 min, depend upon capacity of your power bank.

#!/bin/bash

# nano /etc/crontab
# run every 10 minutes, change dir of "/home/andrei/scripts/" to your location.
# */10 * * * * root /home/andrei/scripts/1anvpoweroff.sh

# Power line failure detectors IPs.
GW1="192.168.0.25"
GW2="192.168.0.27"

CONFUSRD='/var/run/'
LOGFILE='/var/log/anvpoweroff.log'
GWALLDOWNFL=$CONFUSRD"gw-all-down"
SCRACTIVEFL=$CONFUSRD"anvpoweroff.run"

FAILCOUNT1=0
FAILCOUNT2=0
MAXFAILCOUNT=5
PAUSEBWPINGS=2
PAUSEBWATTMP=300 # Adjust this according to your power bank energy capacity.

NOW=$(date +'%Y-%m-%d %T')

# Create log file if it doesn't exist.
if [ ! -f $LOGFILE ]; then
    touch $LOGFILE
fi

# Check detector #1
while true; do
    ping -c 1 $GW1 >/dev/null 2>&1
    if [ "$?" -ne 0 ] ; then #if ping exits nonzero...
        FAILCOUNT1=$[FAILCOUNT1+1]
    else
	FAILCOUNT1=0 # Zero if one of previous pings failed and now OK.
        break
    fi
    if [ $FAILCOUNT1 -ge $MAXFAILCOUNT ]; then
	    break
    fi
    sleep $PAUSEBWPINGS #check again in SLEEP seconds
done

# Check detector #2
while true; do
    ping -c 1 $GW2 >/dev/null 2>&1
    if [ "$?" -ne 0 ] ; then #if ping exits nonzero...
        FAILCOUNT2=$[FAILCOUNT2+1]
    else
        FAILCOUNT2=0 # Zero if one of previous pings failed and now OK.
        break
    fi
    if [ $FAILCOUNT2 -ge $MAXFAILCOUNT ]; then
        break
    fi
    sleep $PAUSEBWPINGS #check again in SLEEP seconds
done

if [ $FAILCOUNT1 -ge $MAXFAILCOUNT ] && [ $FAILCOUNT2 -ge $MAXFAILCOUNT ]; then
    echo "Both detectors are down "$NOW", will check again after a pause: "$PAUSEBWATTMP" sec" >> $LOGFILE
else
    exit 0
fi

# Make 2nd ping attempt after a pause.
sleep $PAUSEBWATTMP

NOW=$(date +'%Y-%m-%d %T')
FAILCOUNT1=0
FAILCOUNT2=0

# Check detector #1
while true; do
    ping -c 1 $GW1 >/dev/null 2>&1
    if [ "$?" -ne 0 ] ; then #if ping exits nonzero...
        FAILCOUNT1=$[FAILCOUNT1+1]
    else
	FAILCOUNT1=0 # Zero if one of previous pings failed and now OK.
        break
    fi
    if [ $FAILCOUNT1 -ge $MAXFAILCOUNT ]; then
	break
    fi
    sleep $PAUSEBWPINGS #check again in SLEEP seconds
done

# Check detector #2
while true; do
    ping -c 1 $GW2 >/dev/null 2>&1
    if [ "$?" -ne 0 ] ; then #if ping exits nonzero...
        FAILCOUNT2=$[FAILCOUNT2+1]
    else
        FAILCOUNT2=0 # Zero if one of previous pings failed and now OK.
        break
    fi
    if [ $FAILCOUNT2 -ge $MAXFAILCOUNT ]; then
        break
    fi
    sleep $PAUSEBWPINGS #check again in SLEEP seconds
done

if [ $FAILCOUNT1 -ge $MAXFAILCOUNT ] && [ $FAILCOUNT2 -ge $MAXFAILCOUNT ]; then
    echo "Finally both detectors are down "$NOW", shutting down the system" >> $LOGFILE
    /home/andrei/scripts/2shutdownvms.sh
    sleep 3m
    sync; echo 3 > /proc/sys/vm/drop_caches

    service sanlock stop
    service supervdsmd stop
    service vdsmd stop
    service ovirt-ha-broker stop
    service ovirt-ha-agent stop
    service nfs-client.target stop
    
    sync; echo 3 > /proc/sys/vm/drop_caches
    shutdown -P
else
    exit 0
fi

# End of 1anvpoweroff.sh

Shutdown all virtual machines script for each oVirt node.

May need a plenty of runtime if you have large number of VMs running on each node.

#!/bin/bash
#2shutdownvms.sh

LIST_VM=`virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf list | grep running | awk '{print $2}'`
TIMEOUT=90
DATE=`date -R`
LOGFILE="/var/log/anvshutdownkvm.log"

VM_1ST=${LIST_VM[0]}
if [ "x$VM_1ST" =  "x" ]
	then
	exit 0
fi

for activevm in $LIST_VM
do
	PIDNO=`ps ax | grep $activevm | grep kvm | cut -c 1-6 | head -n1`
	echo "$DATE : Shutdown : $activevm : $PIDNO" >> $LOGFILE
	virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf shutdown $activevm > /dev/null
	COUNT=0

	while [ "$COUNT" -lt "$TIMEOUT" ]
	do
		ps --pid $PIDNO > /dev/null
		if [ "$?" -eq "1" ]
		then
			COUNT=110
		else
			sleep 5
			COUNT=$(($COUNT+5))
		fi
	done

	if [ $COUNT -lt 110 ]
	then
		echo "$DATE : $activevm not successful force shutdown" >> $LOGFILE
		virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf destroy $activevm > /dev/null
	fi
done

Leave a Reply