The API Proposal IV

The API Proposal IV

This proposal is a continuation of the API Proposal, which compensates for maintaining reliable and high-quality API nodes for public use in the Secret Community.

Decentralized API

The provided nodes will be load balanced via DNS to two geographically resilient load balancers in 2 different data centers. Queries are only assigned to active and healthy nodes, while unhealthy nodes are automatically eliminated and reintroduced once they regain their health. Each team maintains and is responsible for its own geographically distributed nodes.

Reporting & Endpoints

Endpoints can be found here :point_right:t5: endpoints. A detailed report is available at Server Statistics. However, given the large data volume generated daily, only a partial report showing traffic for a certain period is available. Ensure to check the date range for comprehensive understanding. Additional reporting can be found here Additional Stats.

Teams

The teams involved and node budgets are Secret Saturn (16 nodes), Delta Flyer (16 nodes), Trivium (14 nodes), Consensus One (16 nodes), and Quiet Monkey Mind (10 nodes). Payments will be awarded based on the monthly provision of nodes only. In cases where a provider do not maintain a node, no payment will be made for unprovided nodes, and leftover funds will carry forward to the succeeding month. This also includes Delta Flyer (1 LB node) x Consensus One (1 LB node).

Amounts

With 5 teams this proposal will cost (74 nodes x 150$ x 3 months) 33.300$ and provide 74 total API nodes to the community. These numbers are not including the 5% volatility buffer and the rolled over funds from the previous period, but the final ask will include it.

4 Likes

Secret Express Features

1. Secret Autoheal
The Autoheal feature ensures that only healthy nodes serve your requests. This enhances the reliability and robustness of our API.

2. Intelligent Caching (Alpha Release)
Our custom caching system is designed to optimize response times. It stores responses to frequent requests, serving cached responses immediately when available. If a cached response isnā€™t available, the request is forwarded to a healthy node in the API cluster, and the response is cached for future use. The cache is automatically invalidated once a new block is committed to the chain. This system significantly increases the rate limit compared to standard endpoints. Though, weā€™re still refining this system to provide the best performance. In tests, which are not the same as production, we are seeing 98%+ of queries served from cache. Iā€™m hopeful that production will still see at least 90%+ of queries served from cache, but weā€™ll have to see how it goes.

:warning: Please note that our cached endpoints are in alpha testing and are not yet production-ready. However, we welcome testing and feedback in the meantime.

Alpha endpoints:

Monitor the efficiency of our caching system with live statistics. These stats provide insights into the performance of our caching system.

Cache Statistics:

3. Real-time Status Reports
Stay updated with real-time status reports of our system at Secret Express Status. This feature provides transparency and allows you to monitor the health of our services.

4. Production Endpoints

1 Like

Weā€™ve made a minor adjustment to remove 2 regular nodes and replace them with 2 LB nodes mentioned in the proposal, to ensure there is incentive on that portion of the infrastructure. Weā€™ll also be working to get health checks between the LBs so in cases of failure of 1 of the 2 LBs, there is still no impact on users.

This proposal is live on chain now.

https://www.mintscan.io/secret/proposals/259

1 Like

Sorry for responding late to this.

General points:

  1. this API is better than anything we have had in the past and works great to accomodate many dApps through this bear market and to get started with making dApps on secret.
  2. The teams in the proposal all have their own way of providing nodes which is healthy for redundancy and have a track record with deeply understanding Secret Network infra
  3. The cost for running the nodes is not too high in comparison to many other high uptime providers but would still love for it to be slightly cheaper.
  4. Archive access is a problem with many nodes Statesyncing to get back up even the State isnt preserved. I would like to see Archive nodes added when possible again or even a separate API consisting of nodes that keep State for longer (maybe ~20 days) which can just be a subset of all the nodes available.
  5. Glad to see the team keep innovating with a Cache system on top of the Auto heal LBs

Youve got mine and L5ā€™s support

2 Likes

Curious if the API team can do a trial to reduce the amount of nodes behind the LB, Based on usage numbers of other providers in the network i have the assumption that the Secret.Express cluster is currently up to twice as big as needed.

Would love to receive some numbers of a potential test like this to better understand whether or not we are paying for a service thats more than required.

Weā€™ve actually done this test before (on accident because of hardware compliance issues from upgrades). We have found for example the API is at half count and high usage comes through (even from a single script/user such as arb bots or stashh of the day as 2 examples), that the API needs to add very restrictive rate limits to prevent the whole cluster from going down and making things like keplr unusable.

What do you expect the average requests/node/s (or per min) to be?

And

Can you share some data about the current rate-limiting settings (forgive me if this can be found somewhere) and which API users are currently whitelisted to perform non-rate limited action?

It depends based on load due to the load balancing algorithm.

We adjust it from time to time but i would say restrictive would be 20 rps, and my understanding is no one is currently without rate limit on express endpoints.

sorry i should reword, what do you expect the average and max requests/node/s (or per min) to be the express endpoint can handle.

Edit:

Shouldnt keplr get whitelisted access at minimum?

Due to the recurrent challenges weā€™ve observed over the years, including issues with single scripts (like arb bots doing 2 txā€™s per blockā€¦) disrupting nodes, finding a definitive answer to your question about network capacity is elusive.

If we define ā€˜normal usageā€™ as operations that donā€™t negatively impact performance, then express can indeed accommodate much more traffic than it currently does. However, the problem arises when a minority of users execute problematic executions or combinations of transactions, thereby straining all network nodes (in and out of express) and causing issues like app hashes.

To put it simply, there are two options for API capacity: it can either be robust enough to withstand these stress scenarios, or it can fall short, causing services like Kepler to become non-functional during such events. it seems we have identified a capacity threshold that keeps Kepler operational, the long-term solution should focus on eliminating network bottlenecks that enable certain actors, such as arbitrage bots, to create these issues in the first place.

Historically the API team has taken the approach that it is better for us to have a capacity that does not result in users saying the network / keplr is unusable.

Maybe being more generous with the whitelist and harder on the non-whitelisted ratelimit can result in a situation where we can improve performance per dollar invested.

Id still like to see secret.express try the above mentioned test and see how the endpoint performs and report back if possible.

As discussed in Telegram Secret.Express team will take some time to make a better assesment on some of the questions asked, i personally hope some data will be added there as well.

Delta also made me aware that the current transparancy page only reflects 1/2 loadbalancers meaning the traffic is roughly 2x as much as i had thought. Although this slightly lightens my concerns i still think optimizing for this much potential peak load will cause a lot of costs that are not needed in this market.

Lets give the team some time to come back with some data from the tests and we can take it from there.

We have already performed the tests that you requested. They were not intentionally performed but we know the answer nonetheless. The best option to making it viable to lower the count without impacting availability of the public good during certain usage events that are known to occur, is improving the underlying network performance to ensure things like arb bots (as one example) donā€™t cause so many app hashes, and general degradation. Otherwise, you are suggesting we lower the count to a point where users would say the network was unusable when these events occurred.

So the team is not intending to do such tests again, monitor and report back with data?

No offense but i have a hard time relying on just these relatively arbitrary potential results and arguments of why lowering isnt justified.

@SecretSaturn do you want to add to this? Iā€™m not sure what to add other than what weā€™ve already discussed internally and found with our experience.

Note : our argument is not arbitrary and is based on facts.

Summary

The API team has outlined several critical issues relating to the scalability, performance, and reliability of the current network infrastructure. These concerns are organized into various categories:

Load Testing: Difficulty in simulating real-world traffic for load testing, not realistic to test outside of mainnet and expect to be able to simulate real world situations well.

Scalability: Challenges around scaling Cosmos nodes, particularly with SGX. Also, the inability to easily spin up and terminate nodes as per demand.

SGX Issues: SGX is cited as a bottleneck for autoscaling and possibly as the root cause of performance issues when multiple SGX calls occur in parallel.

Node Count & Over-Provisioning: A conflict exists between having too many nodes for day-to-day operations and the need for extra nodes to handle ā€˜thundering herdā€™ or ā€œproblematic cross contract callsā€ that ripple performance issues over the network. Over-provisioning is acknowledged as intentional in order to deliver on the promise of an always available public good API, but problematic specifically due to cost concerns in the bear market. In the past, when we did not over provision, users would say the network was broken when certain usage patterns emerged impacting UX.

User Experience & Affordability: The team aims to maintain a high-quality user experience and network stability. However, it is willing to compromise on its scope if it means resolving broader market and affordability issues.

Rate Limiting: A non-issue in the current setup.

Network Stability & Single Actor Vulnerability: The networkā€™s susceptibility to performance issues from individual actors, particularly through cross-contract calls / arb bots.

Testing Recommendations: A suggestion to analyze archive node information to better understand network bottlenecks and performance issues.

Analysis

Load Testing

The team identifies a clear need for improved load testing capabilities that can more accurately simulate real-world conditions, but also notes this is outside of the scope of the API team and would be uncompensated work. We are willing to advise and help where we can as time permits.

Scalability

The challenge of scaling Cosmos nodes, especially with SGX, shows an inherent limitation in the existing architecture. This challenge extends to the ability to dynamically manage nodes based on demand, leading to inefficiencies in resource utilization.

SGX Limitations

The discussions around SGX underline its role as a bottleneck in system performance and scalability. It is also suspected to be the underlying cause of performance issues, particularly when handling multiple calls in parallel. The SGX driver, in particular, is pointed out as a probable culprit for performance issues.

Over-Provisioning and User Experience

The team acknowledges the over-provisioning of nodes as a conscious decision to ensure network stability. However, it also identifies that this strategy is not sustainable and our scope needs adjustment from claiming to provide an always available API to providing a public good that is available most of the time (SLA adjustments).

Affordability vs. Stability

While the API team is concerned with stability and user experience, it recognizes that affordability is the primary concern for the user base and the testing requests are not compatible with reducing our scope. A fine balance needs to be struck between these conflicting requirements.

Rate Limiting

Rate limiting appears to be a well-handled aspect, and the team does not view it as an issue.

Network Stability Concerns

The vulnerability of the network to individual actors performing specific actions is a glaring issue that needs to be addressed, perhaps at an architectural level.

Future Testing Recommendations for Secret Network (not specific to API team)

The team suggests a more thorough analysis of archive node information as a viable test to understand network bottlenecks better and potentially resolve them by finding patterns in what contracts are executed when many network validators miss many blocks.

In summary, the comments from the API team, and other experienced individuals show that while there are concerns, there is and always has been a willingness to address concerns. The core issues seem to revolve around the limitations of the existing architecture, particularly concerning scalability and performance, affordability due to availability of resources in the community pool, and thereā€™s a general consensus that these need to be addressed for long-term sustainability. Further testing from us was ultimately requested by only 3 individuals, while this is out of scope / uncompensated work, members of the API team are happy to assist as time permits in an unofficial capacity.

Expect an adjusted proposal from the API team when the current one expires.

Special thanks to Gmail, Luigi, and Assaf for providing constructive input.

I will post here a response i posted in Telegram as well.

Again, i am looking for more clarity on usage and stability which can lead to us assessing on whether or not this API is over-provisioned by too-much, even for serving peak loads.

RE the above Analysis:
I dont think it tells us anything though, last time afaik that there was significant problems with over-usage of contract calls and nodes being hammered was pre-6m gas or at 6m gas pre the gas re-metering done by labs.

although nodes remain to apphash at certain points i am not convinced the network is as unstable anymore as it is being made out to be here. If express thinks it is then atleast the liveness logs dont seem to agree with that.

Ive never asked to conduct network load tests, only to do API specifc tests like adjusting rate limits and removing nodes. i dont consider that out of scope and the fact that there is not a single mention of a number in this whole reaction post is telling tbh. It seems to be mainly oudated opinions, i dont think thats something to make current decisions on.

We donā€™t understand why Erteman is asking us to test running the api at half capacity when we already acknowledged that we over provision to ensure smooth operation of the API always instead of just sometimes. Additionally, this test is useless without also seeing what the results would be during other stresses on the network which would impact the api, so we see how it handles edge cases.

Examples

  1. Many times during upgrades various providers have to wait for patches, and the total api capacity is reduced. Buffer gives us breathing room in these events.
  2. In cases where problematic cross contract calls cause wider network performance issues such as apphashes, this can impact the reliability of APIs on the network and reduce capacity significantly. Buffer also helps ensure that we can provide a stable API during these occurrences.

We are willing to adjust our scope for future proposals to account for the valid concern of affordability that some brought up. People reading should understand, that we already know what the outcome of such a test would be, as weā€™ve had to run the api at reduced capacity many times for various reasons.

1 is fair but you keep on coming back to 2, something that as i have stated now many times hasnt happened for a long time afaik as network stability has increased. When is the last time Secret.Express had a significant amount of node failures due to network strain? The logs go back a little over a week and show close to no out-of-sync notices.

1 Like