Validator Node Software Update Best Practices/Process?

ZeroPointThree · 24 March 2022 15:57

Hello all…

In preparation for the eventual upgrade of node software, I was wondering if you guys help could verify the high-level steps of an validator node software update process, or point out what may be unnecessary, missing, or just plain wrong.

The way i see it, one can upgrade the software within the same server… or one can use a backup server to aid in the switch to the upgrade version. What are the pros and cons to each option? I am thinking that using a backup is superior because it allows roll-back to the primary if something goes wrong in the software upgrade process.

Also, should the validators always be un-registered and re-registered within this software upgrade process? I am thinking if they are not, proposals will be missed if a validator is stopped/restarted while it is still registered, so un-register should always be done before stop/restart and re-register after the stop/restart, correct? (assuming all is working as expected) This would introduce a small period of time when the validator is not generating rewards, but that should be a relatively short amount of time, correct? Or is there a way to do this without any downtime at all? (without risk of missed proposals)

Below are the high-level steps I’ve drafted. Any feedback for refinement is welcome

Same/single server

Unregister node as validator
Wait until the current epoch ends
Stop the node
Upgrade the node software
Start the node
Ensure node is up without error and is syncing correctly. Wait until sync’d.
Register node as validator again
Ensure validator is processing proposals without issues

Using backup server

Step 0… pre-requisite step: Ensure a backup server is already running as a non-validator node using a non-validator keystore (to ensure it is always in sync, and no lengthy catch-up sync times are ever needed)
BACKUP SERVER: Stop the node
BACKUP SERVER: Upgrade the node software
BACKUP SERVER: Start the node
BACKUP SERVER: Ensure node is up without error and is syncing correctly. Wait until sync’d.
BACKUP SERVER: Move the validator keystore into the secrets directory in preparation for node restart as validator
PRIMARY SERVER: Unregister node as validator
PRIMARY SERVER: Wait until the current epoch ends
PRIMARY SERVER: Stop the node
BACKUP SERVER: Restart the backup node, which now has validator keystore and is as sync’d up as possible
BACKUP SERVER: Ensure node is up without error, is syncing correctly. Wait until sync’d. BACKUP SERVER is now the new primary validator node server. Old primary server is now a backup.
BACKUP SERVER (now new PRIMARY): Register node as validator again
BACKUP SERVER (now new PRIMARY): Ensure validator is processing proposals without issues
PRIMARY SERVER (now new BACKUP): Switch to non-validator keystore, upgrade software, start node, and ensure it is up and syncing without error. PRIMARY is now the new BACKUP that is syncing continuously in case a failover from PRIMARY is ever needed.

Faraz · 24 March 2022 16:24

Hi,

You’re pretty much spot on - there are 2 main options, unregister or live-failover.

Your steps for the single server are correct, the caveat to this approach being that when you unregister your stakers will lose all rewards for the epochs you are unregistered, however it will preserve your uptime and is the safest option in the event something goes wrong. (you will be unregistered so no affect on the network as a whole)

Many of us prefer to use the live failover option, which is similar to your ‘backup server’ approach, the main difference would be at step 6, rather than unregistering you simply stop the node on the primary and immediately after, restart the backup with the primary keystore. The benefit with this approach, is if your proposal rate is not too high and you time it correctly, you can failover in the same epoch with no loss of rewards or uptime.

I know that Florian has created a script which waits for the next proposal to be made before performing the swap. You can find this in the Discord node-runner channel if you’re interested (and using systemd). I use Docker and similarly monitor the proposals completed on the node before I perform the failover. With 0.5% network stake I’m completing 1 proposal every 35 seconds so there’s plenty of time to switch without missing a proposal. For the more heavily staked, it becomes a bit more tricky to do, but for the benefit of stakers is probably preferable to miss a proposal or 2 than to lose an entire epoch by unregistering.

ZeroPointThree · 24 March 2022 18:45

Thanks very much for this useful response. I’ll look more into those scripts from Florian that you mentioned.