Over the last 4 weeks, I've been rigorously testing various scenarios in an attempt to improve NEAR chunk performance for validator nodes. Here are the tests I performed: # Testing on NEAR Validator Pools: - `stardust.poolv1.near` - block and chunk producer ~2M stake - `stardust.pool.near` - chunk producer ~200K stake - `galactic.poolv1.near` - block and chunk producer ~900K stake ## 1st Set of Tests: Hardware, CPU Type, RAID, etc. ### 1a. Scaling Hardware Up Slightly: - **Original Server:** Intel Xeon-E 2386G - 6c/12t, 32GB RAM, 1TB NVME Drive - **New Server:** AMD Ryzen 3900, 64Gb RAM, 2 x 0.5Tb RAID-0 NVME **Result:** No performance improvement noticed ### 1b. Scaling Hardware Significantly: - **Original Server:** Intel Xeon-E 2386G - 6c/12t, 32GB RAM, 1TB NVME Drive - **New Server:** AMD Epic 64-core with 512GB RAM and 2x3.84 NVME RAID-0 Drives. **Result:** No performance improvement noticed ## 2. Operating System Comparison: Ubuntu 20.04 vs Ubuntu 22.04 vs NIX - As reported by some validators, using Ubuntu 20.04 had better uptime - in my case, it didn't show any improvements ## 3. CPU Comparison: Intel vs AMD - Slight better performance noticed on AMD processors ## 4. RAID Level Comparison - It was noticed that RAID0 of 2xDisks is performing better than no raid Single Disk **Note:** Most tests were run for at least 5 epochs or 2.5 days, with some running a week or more. # Conclusion Oversized hardware, the version of the OS, CPU type, and RAID level do not significantly change chunk performance. A setup slightly better than the minimum requirements should be sufficient to run a NEAR Validator node comfortably. I would recommend: - AMD Ryzen 5900 or Epyc 12+ thread processors, with 32RAM, and 2x1Tb NVME drivers (to allow for growth) as a minimum spec. # 2nd Set of Tests: Network, Node Location, Peering ## Tests Performed ### 1. Node Location with Latency Measured: Results didn't show significant performance deviation when the node is in the US versus in the EU. According to the attached map/image, even nodes in Japan perform well, so the higher latency (100-200ms) to the validator node from most other nodes didn't show any problem. ### 2. Peering This is where changes were visible straight away, improvement noticed. Here's what I changed: 1. **Initial Observations**: - On average, there are about 20-25% of validator nodes and 75-80% of RPC nodes on the validator peer list. - Considering there are only 35 peers in the default config, that makes only 8-10 validator nodes peers. - The goal was to increase the number of validator nodes in the peer list. 2. **Possible Strategies**: 1. **Blacklist RPC nodes**: Not ideal - they need to connect to the network too as they are also an important part of the chain. 2. **Use Boot Nodes**: Add all or most validator nodes to boot nodes to increase the number of validator nodes to 15-20 or 50% of all - this is also not ideal as validators may switch servers, requiring a new boot nodes list that constantly needs to be updated. 3. **Increase the Number of Peers**: Chosen for its simplicity and lower maintenance requirement. I also wanted to activate Tier1 network outgoing peers to achieve the same. Tier1 is a network containing only validator nodes - RPC not included. 3. **Configuration Changes**: ```json "network": { "addr": "0.0.0.0:24567", "boot_nodes": "", "whitelist_nodes": "", "max_num_peers": 300, "minimum_outbound_peers": 5, "ideal_connections_lo": 100, "ideal_connections_hi": 190, "peer_recent_time_window": { "secs": 600, "nanos": 0 } }, "experimental": { "inbound_disabled": false, "connect_only_to_boot_nodes": false, "skip_sending_tombstones_seconds": 0, "tier1_enable_inbound": true, "tier1_enable_outbound": true, "tier1_connect_interval": { "secs": 60, "nanos": 0 } } ``` changes: "max_num_peers": 300, "ideal_connections_lo": 100, "ideal_connections_hi": 190, "tier1_enable_outbound": true, 4. **Results**: - **Immediate improvement observed**: before, our Stardust v1 node missed 4-10 chunks per epoch; after the change, it's 0 to a maximum of 1-2. - To test further, I asked one of the top validators: Zavodil @vadim.near node, to apply the same changes to view the result of the node with high chunks assignments- the results were the same, now the node does not miss more than 2-3 per epoch, which mostly accounts for 99.98% of chunks uptime and 100% blocks uptime. ## Further Observations - If there are a significant number of node restarts from your node peers list, it's highly likely your node will miss chunks together with nodes restarting, especially during network upgrades when many nodes are restarting due to the upgrade. - A 0% uptime node in your peers can affect chunks performance. If one of the validator nodes is constantly running at 0% uptime, it's possible to blacklist it: ```json "blacklist": ["23.88.65.34"] ``` - Even after increasing the number of peers to 300, which results in about ~100 peers on the node, there are still >50% RPC nodes. So, about 20-30 nodes are validators which seems enough to achieve good chunk performance.