As we know, with Babylon the process for managing validators changed significantly from Olympia. Gone are the node hot wallets in favour of on-ledger validator components that can be managed using the transaction manifest and the owner badge.
One such method that can be called on the validator component is the “update_key” method. This enables a validator to update the public key associated with a validator, meaning that a backup node can be sync’d to the network, and by calling the update_key method, can automatically take over consensus at the next epoch.
With this in mind, I have created a script that can be run as a systemd service which automatically fails over using the update_key method. By querying the gateway periodically for missed proposals, the script automatically calls the method and signs the transaction if a threshold of missed proposals is met within the last 3 epochs.
For further details and instructions, check out my GitHub repo here:
Hey Faraz, great job on the script and documentation!
Just a question: querying the gateway I’ve noticed that the missed proposal metric gets updated only at each epoch end. Correct me if I’m wrong but, won’t this mean that at least 2 epochs will be lost in case of downtime? One during the outage (until the gateway updates its missed proposal counter), and another to wait until the backup node becomes the validating one.
Thanks for the feedback - that’s a misunderstanding on my part if that’s the case. I thought the missed proposals metric was updated within the same epoch.
In that case, it’s only necessary to poll the gateway once every 5 minutes, but you’re right it would take at least 2 epochs to trigger a failover.
I’ll update the docs and frequency, and would advise validators to set the threshold for misses to coincide with an entire epoch for their stake weight (4 in my case).
Hi guys, I’ve just updated the auto-failover script following Mattia’s feedback. It now only polls the Gateway once every 4 minutes. More than that is overkill really. Also Dark in Discord highlighted that the data in one of the requests was not updating in the while loop, so this has now been fixed too.