DRAFT: A Validators Guide to Disaster Recovery

Stuart · 2 February 2022 09:31

Overview

A major incident is when your validator node becomes unavailable on the Radix Public Network and you are unable to fail over to a backup node in a timely manner.

You should strongly consider unregistering your node as a validator if the outage occurs for longer than 2 or 3 epochs because your outage will affect the performance of the entire Radix Public Network - especially if your node has a large delegated stake.

1 Preparation

An effective disaster recovery strategy requires some preparation in advance. A few steps now will save you a lot of pain later when you are under pressure dealing with a major incident.

1.1 Keep multiple copies of your Node Keystore File

Take a copy of your validator’s node-keystore.ks and save at least 2 copies. One to your local computer, and another copy on a removable USB thumb drive. Even better, keep a 3rd copy on a USB thumb drive and keep it at another location. Dropbox/Google Drive are other other strategies for keeping multiple copies - but it might be worth encrypting the file using GPG or similar as an additional security measure.

1.2 Keep multiple copies of your Keystore File Password

Keep a local copy of the node-keystore.ks password. The file is useless without the corresponding password. Save the password in your password manager or in an encrypted archive file with the keystore file.

1.3 Install Python 3

Install Python 3 on your local computer if not already installed and ensure that it works.

1.4 Save a copy of the Python Validator Unregister script

Save a copy of the Validator Unregister Python script that allows you to unregister you validator node when you don’t have access to the node or it isn’t fully synced.

Script available here:
https://github.com/radixpool/validator-tools

Usage: unregister.py [OPTIONS]

  Unregisters a Validator node using a Keystore file and password

Options:
  -f, --filename FILE             Keystore filename  [default: node-
                                  keystore.ks]
  -p, --password TEXT             Keystore password. Will be prompted if not
                                  provided as an option.
  -n, --network [mainnet|stokenet]
                                  Radix Network  [default: mainnet]
  -d, --dry-run                   Do not make any changes
  -v, --verbose                   Show details of api calls and responses
  --yes                           Confirm the action without prompting.
  -h, --help                      Show this message and exit.

Note: Script is still under active development (will break up into functions, honest!) and currently uses the old Archive API calls. New Gateway API is missing the ability to unregister/register validator nodes. Issue raised with Dev team.

1.5 Install Python Dependencies

Fetch the requirements.txt file and install the python dependencies using the following command:

pip install -r requirements.txt

1.6 Test the Validator Unregister script

Do a “dry-run” of the validator unregister script to ensure that Python, script dependencies and the script itself are all working correctly. The --dry-run or -d option will perform all steps in the script except make any changes to the Radix ledger.

For example: dry-run on Mainnet using node-keystore.ks file in current directory and showing all request and response JSON messages

python unregister.py --dry-run --verbose

1.7 Keep at least 12 XRD into your Validator Node Wallet

Keep a balance of at least 12 XRD in your Validator Node wallet so that you can unregister the node (~5 XRD) and re-register it later (~5 XRD).

1.8 Bookmark Service Providers Status/Maintenance Pages

Find and bookmark your service provider’s status page and maintenance page. Also bookmark any social media accounts run by your service provider as these will often be updated before any status page updates (Amazon is notorious for not updating their status pages in a timely manner to reflect major incidents).

This will help you to quickly assess whether an incident is local to you or a more general issue.

2 When Disaster Strikes

[TODO]

2.1 Assess Incident Severity

Make a quick assessment of the severity of the incident by considering the following:

How long has you validator node already been unavailable?
Is the cause of the issue known?
Does the incident just affect you or is it more general?

Sources of this information may come from:

Service providers Status Page
Service providers Planned Maintenance Page
Service providers social media account (or searching for “ProviderName down”)

Your node will probably already fallen below the 98% uptime threshold to receive rewards for the current epoch - so give yourself a short, fixed period of time to assess the severity of the issue. Try to keep this assessment time limit within the current epoch if possible or for up to one full epoch at the very most.

2.2 Notify the Community

As soon as you are able, notify the community on Discord and Telegram that you are aware of a problem and that you are taking action to resolve the issue.

An example message could be:

ValidatorName is currently down. We are investigating the problem and will provide further updates of the actions we are taking to resolve the issue.

This very simple, but important step notifies the community that you are aware of the issue and that you are taking measures to resolve it.

2.3 Choose a Recovery Strategy

There are broadly 3 different recovery strategies:

2.3.1 Do Nothing
In the case of a temporary, localised and known issue then the appropriate strategy may be to do nothing and wait. Examples may include planned maintenance/upgrades by the service provider, an application or operating system issue that can be resolved on the server.

The expectation is that the issue will resolve itself (wait for router to reboot) or can be easily fixed through operator action (reboot, restart service, clear disk space, etc.)

Set a fixed time limit on how long you are prepared to wait before taking further action - ideally before the end of the epoch.
Continue to make preparations for further recovery strategies while you wait in case the issue does not resolve itself.

2.3.2 Fail over to Backup Node
[DRAFT] Do this if you have a synced backup node in another location and can easily do so.

2.3.3 Unregister your Node
[DRAFT] Do this if you are not able to fail over to a synced backup node. Takes the immediate pressure off, stops disruption of the Radix Public Network, stops your downtime continuously spiralling down. Apply one of the other Recovery Strategies.

2.4 Update the Community

Provide an update to the community to let them know whether:

a) The issue has been resolved

or

b) Recovery is in progress and, if possible, an estimated time to resolution (eg. Radix node database is syncing from scratch and will take about 12 hours to complete)

3 Post Incident Analysis

A Post Incident Analysis (also known as a Post-Mortem) is a valuable exercise to learn from an incident and ideally share your findings with other Node Runners so that we may all benefit from findings.

3.1 Internal Investigation

After the incident is resolved and is still fresh in your mind, take a moment to make notes about what happened and how you resolved it.

Consider:

What worked well? (eg. Monitoring system promptly generated alerts about the issue )
What didn’t work well? (eg. Phone was set to silent so missed the alerts )
Is there anything I can do to prevent the issue from occurring again? (eg. Monitor additional metrics, create alerts for low disk space, automate some process, pay more attention to planned maintenance notifications)
Are their opportunities to automate some of the recovery steps using tested scripts that reduce operator error when under stress?
[TODO]

3.2 Share your Findings

[DRAFT] The node runner community benefits hugely from sharing our expertise and experiences with others.

RadUp.io · 2 February 2022 16:56

This is awesome - thank you @Stuart. I know this is a work in progress, but I could see value in step 6 potentially including instructions for using the script to unregister a Stokenet node for testing purposes. Just a thought.

Stuart · 2 February 2022 17:02

Can definitely do that. My idea was to have a dry run option that does everything except submitting the final transaction that makes the change on the ledger. It ensures you have the right python dependencies installed, that your keystore file and password work, and that you have network connectivity to build a transaction. Will get to work on it soon™…
Done! Ability to choose mainnet or stokenet, and to do a dry-run which doesn’t submit the final request to apply the change.

Stuart · 5 February 2022 10:27

The unregister.py script does not currently work on Stokenet.

The old Archive API has been removed from Stokenet and the new Gateway API doesn’t currently support Register/Unregister of validators. The dev team have put this on the development backlog and the script will be updated to handle the new Gateway API as soon as this is released.

RadUp.io · 5 February 2022 12:55

Ran into this last night but figured that was the reason.

This page is developing into one the most important pages that all node runners should familiarise themselves with. Thank you for everything you do for the community.

Jazzer_9F · 5 February 2022 22:08

Did they confirm this will be added to the Gateway API before decommissioning the archive nodes on mainnet? Because that’s happening rather soon IIRC. It would be pretty bad if we can’t deregister non-running nodes for a couple of months…

Jazzer_9F · 27 February 2022 11:20

I’ve loaded the unregister.py script on my nodes and gone through how it works.
Are you planning a re-register option as well, or do we follow the regular process for that?

Thanks & Cheers!