Overview
A major incident is when your validator node becomes unavailable on the Radix Public Network and you are unable to fail over to a backup node in a timely manner.
You should strongly consider unregistering your node as a validator if the outage occurs for longer than 2 or 3 epochs because your outage will affect the performance of the entire Radix Public Network - especially if your node has a large delegated stake.
1 Preparation
An effective disaster recovery strategy requires some preparation in advance. A few steps now will save you a lot of pain later when you are under pressure dealing with a major incident.
1.1 Keep multiple copies of your Node Keystore File
Take a copy of your validator’s node-keystore.ks
and save at least 2 copies. One to your local computer, and another copy on a removable USB thumb drive. Even better, keep a 3rd copy on a USB thumb drive and keep it at another location. Dropbox/Google Drive are other other strategies for keeping multiple copies - but it might be worth encrypting the file using GPG or similar as an additional security measure.
1.2 Keep multiple copies of your Keystore File Password
Keep a local copy of the node-keystore.ks password. The file is useless without the corresponding password. Save the password in your password manager or in an encrypted archive file with the keystore file.
1.3 Install Python 3
Install Python 3 on your local computer if not already installed and ensure that it works.
1.4 Save a copy of the Python Validator Unregister script
Save a copy of the Validator Unregister Python script that allows you to unregister you validator node when you don’t have access to the node or it isn’t fully synced.
Script available here:
https://github.com/radixpool/validator-tools
Usage: unregister.py [OPTIONS]
Unregisters a Validator node using a Keystore file and password
Options:
-f, --filename FILE Keystore filename [default: node-
keystore.ks]
-p, --password TEXT Keystore password. Will be prompted if not
provided as an option.
-n, --network [mainnet|stokenet]
Radix Network [default: mainnet]
-d, --dry-run Do not make any changes
-v, --verbose Show details of api calls and responses
--yes Confirm the action without prompting.
-h, --help Show this message and exit.
Note: Script is still under active development (will break up into functions, honest!) and currently uses the old Archive API calls. New Gateway API is missing the ability to unregister/register validator nodes. Issue raised with Dev team.
1.5 Install Python Dependencies
Fetch the requirements.txt file and install the python dependencies using the following command:
pip install -r requirements.txt
1.6 Test the Validator Unregister script
Do a “dry-run” of the validator unregister script to ensure that Python, script dependencies and the script itself are all working correctly. The --dry-run
or -d
option will perform all steps in the script except make any changes to the Radix ledger.
For example: dry-run on Mainnet using node-keystore.ks file in current directory and showing all request and response JSON messages
python unregister.py --dry-run --verbose
1.7 Keep at least 12 XRD into your Validator Node Wallet
Keep a balance of at least 12 XRD in your Validator Node wallet so that you can unregister the node (~5 XRD) and re-register it later (~5 XRD).
1.8 Bookmark Service Providers Status/Maintenance Pages
Find and bookmark your service provider’s status page and maintenance page. Also bookmark any social media accounts run by your service provider as these will often be updated before any status page updates (Amazon is notorious for not updating their status pages in a timely manner to reflect major incidents).
This will help you to quickly assess whether an incident is local to you or a more general issue.
2 When Disaster Strikes
[TODO]
2.1 Assess Incident Severity
Make a quick assessment of the severity of the incident by considering the following:
- How long has you validator node already been unavailable?
- Is the cause of the issue known?
- Does the incident just affect you or is it more general?
Sources of this information may come from:
- Service providers Status Page
- Service providers Planned Maintenance Page
- Service providers social media account (or searching for “ProviderName down”)
Your node will probably already fallen below the 98% uptime threshold to receive rewards for the current epoch - so give yourself a short, fixed period of time to assess the severity of the issue. Try to keep this assessment time limit within the current epoch if possible or for up to one full epoch at the very most.
2.2 Notify the Community
As soon as you are able, notify the community on Discord and Telegram that you are aware of a problem and that you are taking action to resolve the issue.
An example message could be:
ValidatorName is currently down. We are investigating the problem and will provide further updates of the actions we are taking to resolve the issue.
This very simple, but important step notifies the community that you are aware of the issue and that you are taking measures to resolve it.
2.3 Choose a Recovery Strategy
There are broadly 3 different recovery strategies:
2.3.1 Do Nothing
In the case of a temporary, localised and known issue then the appropriate strategy may be to do nothing and wait. Examples may include planned maintenance/upgrades by the service provider, an application or operating system issue that can be resolved on the server.
The expectation is that the issue will resolve itself (wait for router to reboot) or can be easily fixed through operator action (reboot, restart service, clear disk space, etc.)
- Set a fixed time limit on how long you are prepared to wait before taking further action - ideally before the end of the epoch.
- Continue to make preparations for further recovery strategies while you wait in case the issue does not resolve itself.
2.3.2 Fail over to Backup Node
[DRAFT] Do this if you have a synced backup node in another location and can easily do so.
2.3.3 Unregister your Node
[DRAFT] Do this if you are not able to fail over to a synced backup node. Takes the immediate pressure off, stops disruption of the Radix Public Network, stops your downtime continuously spiralling down. Apply one of the other Recovery Strategies.
2.4 Update the Community
Provide an update to the community to let them know whether:
a) The issue has been resolved
or
b) Recovery is in progress and, if possible, an estimated time to resolution (eg. Radix node database is syncing from scratch and will take about 12 hours to complete)
3 Post Incident Analysis
A Post Incident Analysis (also known as a Post-Mortem) is a valuable exercise to learn from an incident and ideally share your findings with other Node Runners so that we may all benefit from findings.
3.1 Internal Investigation
After the incident is resolved and is still fresh in your mind, take a moment to make notes about what happened and how you resolved it.
Consider:
- What worked well? (eg. Monitoring system promptly generated alerts about the issue )
- What didn’t work well? (eg. Phone was set to silent so missed the alerts )
- Is there anything I can do to prevent the issue from occurring again? (eg. Monitor additional metrics, create alerts for low disk space, automate some process, pay more attention to planned maintenance notifications)
- Are their opportunities to automate some of the recovery steps using tested scripts that reduce operator error when under stress?
- [TODO]
3.2 Share your Findings
[DRAFT] The node runner community benefits hugely from sharing our expertise and experiences with others.