In preparation for the eventual upgrade of node software, I was wondering if you guys help could verify the high-level steps of an validator node software update process, or point out what may be unnecessary, missing, or just plain wrong.
The way i see it, one can upgrade the software within the same server… or one can use a backup server to aid in the switch to the upgrade version. What are the pros and cons to each option? I am thinking that using a backup is superior because it allows roll-back to the primary if something goes wrong in the software upgrade process.
Also, should the validators always be un-registered and re-registered within this software upgrade process? I am thinking if they are not, proposals will be missed if a validator is stopped/restarted while it is still registered, so un-register should always be done before stop/restart and re-register after the stop/restart, correct? (assuming all is working as expected) This would introduce a small period of time when the validator is not generating rewards, but that should be a relatively short amount of time, correct? Or is there a way to do this without any downtime at all? (without risk of missed proposals)
- Unregister node as validator
- Wait until the current epoch ends
- Stop the node
- Upgrade the node software
- Start the node
- Ensure node is up without error and is syncing correctly. Wait until sync’d.
- Register node as validator again
- Ensure validator is processing proposals without issues
Using backup server
- Step 0… pre-requisite step: Ensure a backup server is already running as a non-validator node using a non-validator keystore (to ensure it is always in sync, and no lengthy catch-up sync times are ever needed)
- BACKUP SERVER: Stop the node
- BACKUP SERVER: Upgrade the node software
- BACKUP SERVER: Start the node
- BACKUP SERVER: Ensure node is up without error and is syncing correctly. Wait until sync’d.
- BACKUP SERVER: Move the validator keystore into the secrets directory in preparation for node restart as validator
- PRIMARY SERVER: Unregister node as validator
- PRIMARY SERVER: Wait until the current epoch ends
- PRIMARY SERVER: Stop the node
- BACKUP SERVER: Restart the backup node, which now has validator keystore and is as sync’d up as possible
- BACKUP SERVER: Ensure node is up without error, is syncing correctly. Wait until sync’d. BACKUP SERVER is now the new primary validator node server. Old primary server is now a backup.
- BACKUP SERVER (now new PRIMARY): Register node as validator again
- BACKUP SERVER (now new PRIMARY): Ensure validator is processing proposals without issues
- PRIMARY SERVER (now new BACKUP): Switch to non-validator keystore, upgrade software, start node, and ensure it is up and syncing without error. PRIMARY is now the new BACKUP that is syncing continuously in case a failover from PRIMARY is ever needed.