Improving Secret Network Scalability by Phasing Out Older Hardware

The Secret Network’s reliance on Intel SGX for its privacy-preserving smart contracts is currently limited by using SMB (small medium sized business) grade hardware. With use cases emerging that necessitate higher performance such as Satoshis Palace, Silent Swap, and AI, the limitations of older-generation machines are becoming increasingly apparent. Even with the applications we have today such as Shadeprotocol, we find ourselves in a situation where a very small handful of transactions per block (less than 10 per block) or queries per block can cause many validator’s to miss blocks and API nodes to crumble.

Technical Information

  1. Memory Constraints on Older Machines

    • Older generation machines such as the E-21xx, E-22xx, and E-23xx have severe limitations in enclave memory, ranging from a minimum of 96 MB to a maximum of 512 MB. In stark contrast, enterprise servers equipped with scalable Xeon CPUs have a minimum of 8 GB and can reach up to 512 GB of enclave memory and memory comes from the systems RAM.
    • Secret Network’s secretd allocates 4 GB of memory at startup for enclave operations. On machines where the SGX memory is less than 4 GB, the system relies on disk paging, creating significant performance bottlenecks. This paging delay becomes a critical issue during high-load consensus operations on older nodes. (Validators missing blocks)
    • The older machines are designed for small business applications, and the scalable cpu line is designed for enterprise applications.
  2. Enclave Memory Utilization

    • Smart contract executions typically require around 12 MB of memory but can theoretically demand up to 128 MB or 256 MB. This further exacerbates the memory constraints on older machines, leading to increased paging operations and reduced performance.
  3. Scaling improvements from roadmap

    • Even with scaling improvements or a more efficient engine, the constraints of older hardware interferes with significant performance gains. For example, concurrent executions is on the roadmap for an improvement this year, even if it’s possible to add that and have it technically function with the old hardware it will likely exacerbate the performance issues given constrained memory.

Proposed path forward

To prioritize the performance and scalability of the Secret Network, a discussion on phasing out older machines is essential. This transition would involve us formally upgrading the network’s hardware requirements, validators provisioning newer machines, and rendering non-scalable CPUs non compatible via a network upgrade.

Proposed Timeline:

  • Early 2025: I think this should happen early 2025. This provides ample lead time for node operators to upgrade their hardware and align with the new requirements.
  • Communication from Leadership: Secret Labs and the Secret Network Foundation (SNF) should work closely with the community to ensure a smooth upgrade process, minimizing disruption.

Economic Implications

Transitioning to enterprise equipment comes with financial implications for APIs and validators:

  • Cost of New Machines: Renting or purchasing next-generation enterprise servers will increase operational costs. While older machines can be rented for $80-$170 per month, the next-gen machines will at least start at $200-$350 per month.
  • Long-Term Benefits: Without moving to the better hardware, we will continue to have current hardware based bottlenecks.

Final Thoughts

We really need to get this discussion going in the open, so I put some technical information out and a suggestion for how we can proceed. This topic has been lightly discussed in various channels, such as the Secret network validator room, Secret governance, and was touched upon in the validator reduction proposal. Though there is no formal communications from Secret Labs or SNF on the topic. If we don’t start this discussion now, many will be surprised when the day comes where we need to do it swiftly, and participants surely would not be prepared to make this move in a timely manner.

Cc @LisaIsLoud @alexz

Source to back up claims : https://www.cse.iitd.ac.in/~srsarangi/files/papers/sgxgauge.pdf

4 Likes

My personal opinion, not from SNF:

Good points all around, +1

1 Like

Good points, thanks for bringing this up.

This is only one possible solution though. We will be working in parallel to improve the WASM runtime, and potentially other things.

If someone can do some benchmarking of various machines’ performance, that would be helpful.

Labs will get to this later this year, doing comprehensive work on several fronts.

1 Like

Any possible improvements you could make to the engine would still put more stress on older machines. Again, They have a constrained amount of memory so if more throughput was to happen, It would mean more paging operations on constrained hardware. Is Secret Labs planning on doing comprehensive testing to ensure alternative solutions outside of removing hardware that has known bottlenecks is not going to exacerbate the problem?

Many of machines on the network already have trouble not missing blocks when the network is under high load, One could simply look at the historical up time from an archive node to see the percentage of validator that miss blocks to see a majority of hardware struggle.

We will do benchmarking, yes, not necessarily on all possible configurations, but on most available ones. It will be great to have some help from community. And I do agree that eventually we will have to get rid of the old machines, just need to make sure we are not risking too much decentralization

Thank you for being forward looking and having foresight in bottlenecks to come. I agree that we need to upgrade to prevent such bottlenecks. I do think Alex brings up a good point on the new machines potentially being cost prohibitive and centralizing the network. 2025 seems like a good timeline and hopefully by then, the price of secret will have risen enough that the cost of the machines becomes less of an issue.

1 Like

Based on the research paper, any scenario where we increase the amount of data that goes through the constrained enclaves will exacerbate the performance problems. We can certainly prioritize maintaining the exact level of decentralization we have over being scalable, if we collectively, or if slabs wants to do that. Unfortunately, it largely seems to be an either or scenario.

I think this is a good opportunity to position secret as being confidential chain with performance closer to Solana but still meaningfully decentralized. Most of the delegations comes from slabs so not sure the profile of decentralization changes much anyways. Even with the new barrier to entry, I think we can still have enough nodes if slabs / snf commits to prioritizing performance and communicates well ahead of time with the validator set.

Correct me if i’m mistaken but what we actually need is to enforce hardware requirements for validators, while still allowing query nodes to operate on the network.

If nodes are excluded from the network based on hardware requirements, this could actually hurt the economic scalability of API node providers. Older machines being cheaper are better bang for your buck when it comes to serving some finite load of clients querying.

2 Likes

This is fundamentally untrue. Newer machines offer a better bang for your buck if you pack nodes on them via kubernetes, docker, or other methods. I know this for sure from purchasing next gen hardware, deploying them into production, and having plenty of older machines as well.

Does that hold for VPS too? In other words, renting and operating next gen hw from the few providers is definitely more cost effective than their older CPUs?

If that’s the case then I’m all for it.

1 Like

When renting an old-gen node for example, $100 per month per node, these nodes often have less than 100MB of memory for the enclave. Under heavy load, old-gen nodes are for sure less performant, and more prone to app hashing. So 1 next gen node is worth more than 1 of the worst performing nodes for API use, and that get’s better if the hardware requirements of validators are higher than they are today. Cost wise assuming you are used to $100 per node, next gen starts to get cheaper as you find offerings that allow a density of around 6 nodes, and then it gets cheaper per node from there from what I’ve seen. This can vary depending on how much memory you allocate per node, There are certainly configurations that make it cheaper, even at a density of three nodes though there are some nuances to making that run smoothly.

Another thing to consider is… if validators upgrade, and APIs dont, then the APIs on older hardware could hardly be able to keep up with what the chain would process under true load, true load being an amount of accepted and processed TXs higher than was previously able to be processed per block.

Increasing the performance of the chain could be a huge gamer changer for being able to deploy many of our games. We expect an increase in sustained chain usage with the release of our first game.

Additionally being able to lower the block times to 2-3 seconds would be huge quality of life improvements for some of our planned games. Some turn based games with multiple people will be much smoother with faster block finality, something that would go hand and hand with more performance.

2 Likes

I did a deeper dive and gathered some reference data on pricing for nodes. Here’s a summary of my findings:

Old Machines:

  • Cheapest per node cost: $74 for older machines as noted by @SecretSaturn in governance room.
  • Real World Reality : Limited supply and among the lowest tier EPC size makes these some of the lowest quality nodes a person can get.
  • Broader Reality: The lower priced old machines are, generally the lower the EPC size for all offerings, and the higher priced ones have more EPC and are better.

Community Findings on Old Machine Pricing from Secret Network docs:
According to the Secret docs, the following prices have been found by the community for old machine pricing:

Average Price for Renting an Old Node: $137.62

Next-Generation Machines:
As an example of next-generation machine pricing. Here’s the best example I found with a quick search:

  • Per node price for next-generation machines: The example I found has $69 - $46 per node, over time more next gen rentals will be an option, and prices will get more competitive, whereas over time old machines will simply be decommissioned from providers as that is standard practice among VPS providers.
    • Term Lengths: Typically, these prices require entering longer-term agreements for good pricing.
2 Likes

Can you explain how you arrive at the $69 - $46 cost per node? In the screenshot, it looks like rental is as much as $1125 per month

Because you would run multiple nodes on the host in a cluster using the same approach that people use to scale traditional applications. Docker, kubernetes, k8s, alternative orchestration methods, or the method showed in the secret docs. People have been scaling clusters of applications that need to be load balanced for decades, our use case isn’t special.

That’s true for api clusters. Not so much for validators that aren’t serving queries. For people running validators, additional nodes are only used for sentries or backups, and neither of those are good ideas to run on the same physical machine

You are correct. The numbers I am showing are specifically to demonstrate It is cheaper to run API nodes on next generation equipment vs old. I believe I even clarified this in various parts of what I said in this thread. The cost to removing the technical debt we have from old machines and allowing the network to scale is that validator would cost more to run. However, really what Intel has decided to do with the scalable CPU line is to push the skus In demand from hyperscalers first, Once more boards become available for the silver scalable lines, which have as low as 8 cores, and once bare metal providers start offering those the costs for people who want to run a single node will go down and be more on par with the old generation pricing. I’ve already seen various motherboards targeted for less density, They typically do not support the higher wattage from the platinum series CPU’s And are limited to eight DIMM slots. These will be significantly less expensive than the higher density machines.

1 Like

Since not many people will run api clusters, and validator performance is what determines performance of the network in general, which is the crux of this thread, I think any mention of price for next-gen should be presented in single node per physical machine costs, since that is how validators will be running. Otherwise it gives a false impression of what validators will actually be paying. In determining how they feel about forced upgrade, people should be considering the actual costs to validators as opposed to costs to a very limited number of api cluster operators, whose clusters won’t have any impact on block production (network performance)

3 Likes

When speaking of cost per node for API clusters, I clarified. When speaking of cost for a validator, I said it would be higher, and pointed out over time that will lower. I’ve also said if people want to hold back the performance of the network because of the cost of running a validator, we can collectively decide that or secret Labs can decide that. No false impression has been given. Starting the thread is not an issue this transition will take time if we decide to do it. And if we don’t decide to do it, then it’s our own fault that we can’t scale.

I wasn’t implying that the false impression was intended. Just stating that the single post that lays out costs for old gen and next gen doesn’t clearly specify that the next gen prices listed do not apply to validators, which would be at $1125 per month using the same figures. I’m mostly trying to clarify for anyone reading, who might not have been able to make that cost adjustment intuitively.

And regarding using a lower figure based on long contract prices, you can’t really enter long contracts for secret nodes. There have been multiple times where a latest mainnet upgrade has made certain hardware unusable for up to 6 months, so locking yourself into specific hardware for years will likely force you to be paying for unused nodes for some period of time while you need to provision new hardware. Might not be a risk a lot of validators that rent are willing to take