The API team has outlined several critical issues relating to the scalability, performance, and reliability of the current network infrastructure. These concerns are organized into various categories:
Load Testing: Difficulty in simulating real-world traffic for load testing, not realistic to test outside of mainnet and expect to be able to simulate real world situations well.
Scalability: Challenges around scaling Cosmos nodes, particularly with SGX. Also, the inability to easily spin up and terminate nodes as per demand.
SGX Issues: SGX is cited as a bottleneck for autoscaling and possibly as the root cause of performance issues when multiple SGX calls occur in parallel.
Node Count & Over-Provisioning: A conflict exists between having too many nodes for day-to-day operations and the need for extra nodes to handle ‘thundering herd’ or “problematic cross contract calls” that ripple performance issues over the network. Over-provisioning is acknowledged as intentional in order to deliver on the promise of an always available public good API, but problematic specifically due to cost concerns in the bear market. In the past, when we did not over provision, users would say the network was broken when certain usage patterns emerged impacting UX.
User Experience & Affordability: The team aims to maintain a high-quality user experience and network stability. However, it is willing to compromise on its scope if it means resolving broader market and affordability issues.
Rate Limiting: A non-issue in the current setup.
Network Stability & Single Actor Vulnerability: The network’s susceptibility to performance issues from individual actors, particularly through cross-contract calls / arb bots.
Testing Recommendations: A suggestion to analyze archive node information to better understand network bottlenecks and performance issues.
The team identifies a clear need for improved load testing capabilities that can more accurately simulate real-world conditions, but also notes this is outside of the scope of the API team and would be uncompensated work. We are willing to advise and help where we can as time permits.
The challenge of scaling Cosmos nodes, especially with SGX, shows an inherent limitation in the existing architecture. This challenge extends to the ability to dynamically manage nodes based on demand, leading to inefficiencies in resource utilization.
The discussions around SGX underline its role as a bottleneck in system performance and scalability. It is also suspected to be the underlying cause of performance issues, particularly when handling multiple calls in parallel. The SGX driver, in particular, is pointed out as a probable culprit for performance issues.
Over-Provisioning and User Experience
The team acknowledges the over-provisioning of nodes as a conscious decision to ensure network stability. However, it also identifies that this strategy is not sustainable and our scope needs adjustment from claiming to provide an always available API to providing a public good that is available most of the time (SLA adjustments).
Affordability vs. Stability
While the API team is concerned with stability and user experience, it recognizes that affordability is the primary concern for the user base and the testing requests are not compatible with reducing our scope. A fine balance needs to be struck between these conflicting requirements.
Rate limiting appears to be a well-handled aspect, and the team does not view it as an issue.
Network Stability Concerns
The vulnerability of the network to individual actors performing specific actions is a glaring issue that needs to be addressed, perhaps at an architectural level.
Future Testing Recommendations for Secret Network (not specific to API team)
The team suggests a more thorough analysis of archive node information as a viable test to understand network bottlenecks better and potentially resolve them by finding patterns in what contracts are executed when many network validators miss many blocks.
In summary, the comments from the API team, and other experienced individuals show that while there are concerns, there is and always has been a willingness to address concerns. The core issues seem to revolve around the limitations of the existing architecture, particularly concerning scalability and performance, affordability due to availability of resources in the community pool, and there’s a general consensus that these need to be addressed for long-term sustainability. Further testing from us was ultimately requested by only 3 individuals, while this is out of scope / uncompensated work, members of the API team are happy to assist as time permits in an unofficial capacity.
Expect an adjusted proposal from the API team when the current one expires.
Special thanks to Gmail, Luigi, and Assaf for providing constructive input.