RFC: Dry Run Protocol Upgrade

Problem:

Before making any changes to the Radix Engine, we need to ensure that nothing breaks after the changes are deployed and live on mainnet. Most developers that developed the protocol are obviously long gone and retrieving this knowledge would be very challenging. Many developers still interested on the protocol layer need confidence that they can test their code changes on a running, live network.

Solution:

The solution proposed here is to explore, analyse and complete a protocol upgrade, end to end. This would mean, relearning the entire stack back to front until there is a better, leaner stack altogether. The main components (I believe and correct me if I am wrong):

  • Babylon node
  • Babylon Gateway
  • Scrypto
  • Mobile (iOS/Android)

The protocol upgrade would be to simply introduce a one liner variable change or equivalent where it does not affect any core functionality. This way, we can work up to a live dry run upgrade on mainnet where the developers executing on this upgrade are able to do so with confidence. The main things we want to learn are:

  • Order of components to upgrade
  • How to monitor for changes
  • How to rollback changes
  • How to confirm a liveness break
  • Do a purposely broken protocol upgrade to learn about how we can fix this

The changes itself, right now, here are some suggestions by community:

  • Variable name change upgrade
  • Protocol upgrade name change

Thoughts and comments of the main goal: “Learn how to do a protocol upgrade” would be appreciated.

Edits:

  1. As @projectShift mentioned, added a section of what code changes to make specifically and also add a point to do a purposely broken protocol upgrade
2 Likes

Small suggestion for this dry run:

Official changes: change the protocol’s moniker/name, nothing else, goal is to try the process

Helper: introduce two lines of code, somewhere:

1 - # This is a comment - YES, THE BUG IS HERE
2- do smtg stupid that breaks the network, like halting it

That way, we can test out the process AND the debugging as well, including rollback to provide stability until bug correction, re-deploy and so forth

The only think not actually being tested is the RL ability to actual debug a bug, but that’s always a risk that can’t be avoided.

3 Likes

Really good points, will update this to include monitoring/fixing a liveness break/halted network

3 Likes