IBC Instability Post Mortem of December 2021

Summary

Last week was a tough week for the relayer teams due to many networks coming online as well as upgrades.

As a relay runner myself, I want to thank @CryptoCrewValidators and @BlockNgine for their diligence. Both are small teams that are incurring significant costs for the sole benefit of the network. Please provide them support by delegating to them.

I also want to thank SCRT Labs (@guy, @reuven, @assafmo, @Cashmaney, et al.) for recognizing the difficulty, cost, and effort put forth by the relayers by outlining (and acting on!) a delegation program.

Finally, I want to thank the support team: @reversesigh and @MrGarbonzo, as well as @SecretSkrillah, for holding down the fort while this happening. I saw how many tickets there were and… wow.

What happened?

As many of you have seen over the last week, IBC connections between Cosmos, Terra, and to a lesser to extent Osmosis, had been sporadic at best. There were many confounding factors here, but the following are the most significant:

  1. There was an emergency, critical bugfix that occurred on Luna that resulted in a chain halt, and closed-source upgrade. If a bugfix is closed source, that means it was dangerous and may lead to significant losses if discovered by nefarious actors. This will be open sourced after the softfork at height 5900000, in order to give validators plenty of time to upgrade and mitigate the risk of danger.

  2. Following the Vega upgrade on Cosmos, an issue with relayer software compatibility arose. This is still being investigated and is an ongoing issue.

  3. The upgrade to Osmosis v5.0.0 resulted in a critical bug that didn’t allow new channels to be opened, and resulted in an emergency upgrade to v6.0.0 a week following the previous upgrade.

The behavior above is exhibited on Map of Zones [0]. Many networks IBC transfer pre-upgrade still have not recovered, which is unfortunately to be expected. The Cosmos upgrade has retained a level of instability that the root cause has yet to be identified, which propogates throughout the system.

What Does This Mean?

I’m outlining all of this because running a relaying machine is unique in that all nodes need to run on the same server in order to make it as stable as possible. When a network upgrades, the server/relayer needs to be temporarily disabled in order to make the upgrade. When there are back to back upgrades, this results in significant downtime.

Futhermore, relaying costs are paid directly from the team running them, making it an extremely expensive, and generally thankless, endeavor. As a rule of thumb, a relaying team will prioritize networks they validate on in order to help recoup the costs [1]. I’ll come back to this later.

Why This is Unlikely to Happen Again

The last 2 weeks were unique in that there were 3 new network that came online, and 5 network upgrades within the Cosmos that I’m aware of. All of this was done in order to beat the Christmas/holiday season, and resulted in a rough release cycle. To be clear, I don’t mean this as a slight to the developers that made the instability occur, just a recognition of the difficulty and time constraints they faced. As far as I’m aware, there hasn’t been a release cycle like this before resulting in such instability.

In addition, SCRT Labs has outlined a team delegation program for relayer teams to help cover costs and recognize the amount of effort is required to run reliable relayers. This is a massive boon, and will undoubtedly lead to even greater stability as more teams will join and there’ll be incentivizes to keep Secret as stable as possible.

Closing Thoughts

Relaying is a tough business. Right now Secret Network has 3 relayers: Lavender.Five Nodes, BlockNgine, and Cryptocrew. Note that you don’t see Cryptocrew in the validator set: https://secretnodes.com/secret/chains/secret-4/validators. That means they are paying all relaying costs out of pocket, with no way to recoup costs. They stepped up to help because I had reached out to them when trying to compile a stable and experienced relayer team to meet the Supernova upgrade in full force.

We, the entire Secret Network, are extremely fortunate to have both of these teams to support the network. @BlockNgine is already a validator, and deserves far more support. @CryptoCrewValidators will be spinning theirs up soon, and also deserves all the support they can get.

7 Likes

Thanks @dylanschultzie for the post mortem. Will reach out to the affected users with this.

You guys at Lavendar.five deserve a massive shout as well for everything you did!

I’m really impressed at the speed you all got this fixed given the out-of-office hours & technicality. Tis’ a pleasure to be on the frontier with you all!

6 Likes

Thank you for the informative review on the entire situation. Many new users experienced difficulties but cant even understand where they are coming from. I hope the public gets to know more about the relayer structure bit by bit and that their delegations and the upcoming relayer DAO will keep IBC flowing! You are foing such an important job, thank you for that!

2 Likes

your interesting information and good question but sorry i am not idea.