I’m starting this thread as a main thread for infrastructure needs of the network.
We’ve run into issues time and time again during peak traffic from dapp launches and it seems to me that we are still not planning aggressively enough as a network on infrastructure.
We were recently funded with a $64,600 hardware budget (which has increased in value and will be spent all on infra) to add 40 secretnodes to the community api. Speaking with suppliers we are quoted 4-6 week wait times, and no assurances there won’t be further delay (due to a silicon supply shortage). Because of this anything we purchase needs to be planned further in advance than with the previous generation hardware.
It seems to me this won’t be enough to handle much over what our current networks top APIs can handle. Considering we still have several product launches upcoming and are also onboarding a lot more users in general (which is generating more network activity and load on apis), I’d like to discuss expanding this budget to a more ambitious number so dapps could handle far larger peak traffic. The number I think we should target is 300 nodes, but we could do it in batches larger than 40 (perhaps batches of 100 or 150). Please share your feedback / thoughts on this topic and ask any questions.
I’d like to see other people get involved in running API like Dan from CoS. I’d also like to see the full SCRT budget that was issued from the first spend used toward its original purpose.
2 things prevent Dan from getting involved with secret infrastructure proposals. 1) Dan keeps saying no to collaborating. And 2) Would be better discussed on calls. But I have concerns related to past experiences. We each seem to have our conditions for working together and we have not found a mutually agreeable middle ground. That will not change unless he agrees to discuss and we mutually agree on something.
Furthermore, the more decentralized an API is the less performant it is. So personally I’m not interested in being apart of a very distributed api. Scaling here makes more sense by dan scaling out his own api, figment growing theirs, and community api also growing. Each needs to be robust, because if one goes down, we need backups that can handle peak traffic.
Then please comment here so we all can decide how much flexibility there is. I clearly shared my preference, and will state again, i will abide by any standards that the community sets in the charter.
I like that you’re looking forward. In my 4 years of experience here one thing that has been frustrating is that time and time again we only grow once we’ve reached a barrier, often times one that could have been foreseen and avoided. If we don’t fully expect a massive surge in usage then what are we doing here? The motto has always been to bring dApps that support millions of users. If we can’t handle the traffic millions of users will bring, then how will we achieve our vision?
I’m sure this would be very expensive… but not as expensive as having our infrastructure crack every time a new dApp comes out! I’m tired of turning users away. Let’s be precautionary instead of reactionary, and try our best to make that a habit going forward in all facets of the ecosystem.
Ok so here’s how i’d like to proceed with this.
Tomorrow I will check in with my supplier on wait times for an upgraded / increased unit order. Then I will come back and post a budget for a 300 node secret cluster. Then discuss from there.
Okay, here’s my 2 cents.
First of all, it’s important to realize that performance issues are normal. Look at any half decent successful launch and you’ll see performance issues. The reason is that it doesn’t make economic sense to buy 10 or 100x the capacity if you’re only going to use it for a couple of hours. Also sometimes we just aren’t that good at estimating the amount of users we will see, or how our applications will be have at such scales.
The best example of this is Secretswap. It was wonky on launch because we didn’t optimize it enough, and we didn’t predict performance impact of specific features at scale. Once we got those figured out it was mostly smooth sailing, even though we didn’t do too much work on the infrastructure.
Now, on to the SN community infrastructure itself. Firstly, I’m not a huge fan of buying more servers. I think we’re at a point where for a fraction of the upfront cost we can create a scalable cloud cluster (i.e. k8s) which is open-sourced and public so it can be deployed by anyone (including a public or community cluster), and will be more robust and require less maintenance than we can do with bare-metal.
That aside, I think right now (especially with supernova around the corner) we should wait and focus on benchmarking & improving the robustness of what we currently have, rather than committing to purchasing and maintaining more hardware.
For example, do we have the numbers on what kind of requests/second the current deployed hardware can do? What do we expect the new hardware (the 40 nodes) to be able to support? Are we using caching, rate limiting, and implementing DR best-practices? Is recovering nodes from crashes automated? Is there automatic failure detection for out-of-sync nodes? Are the nodes load balanced between? These are all questions that I would tackle before scaling to even more hardware, since then those issues become even more complex. There is more to the issue than simply adding raw horsepower
+1 to autoscaling k8s
+1 to assess current 40 nodes can do, + you have the other part of the budget for K8 forever <3
If people want to scale in the cloud I respect that. But we will not be using our budget for cloud machines.
Timing has been an open question of mine (if people want to wait or not). I’ll do whatever people want us to do timing wise, so if now is not the right time or if you think we shouldn’t get more physical hardware then those are things I want to hear.
Thanks for sharing your feedback @Cashmaney
I have answers to these questions, and I’ll share more information after we deploy the hardware.
From this juncture I am opting to stick to the original plan which is to deliver exactly 40 secret nodes on 3rd gen scalable hardware (no more no less) and relay infrastructure. Due to silicon shortage, we expect to be able to deliver this sometime in Q1 2022 but we will let people know if we are able to before then. Thanks everyone for who chimed in🙂
After further discussions with cashmaney he indicated he doesn’t care if we do cloud or bare metal physical hardware and that there are pros and cons to each. Im not claiming this as an endorsement from him, but we are looking to proceed with our plans after this input.
I’m assuming that the current state of this has moved over to the “spartan proposal” thread.
I would like to suggest that if the community proceeds forward that new nodes should be launched in phases. Proof of the first 20 nodes being successfully launched should be given to the community before new nodes are approved.
A phased solution with proof of successful launch and uptime for existing nodes is ideal.