Skip to content

Sudden loss of connectvity with peers and missing blocks (mainnet) #4899

@b0dski

Description

@b0dski

My mainnet validator node (v101.1.2) has experienced bouts of peer connection failures this evening. The logs show lots of connection failures/errors coming from CometBFT p2p module. The node was unable to keep up, missing a high proportion of blocks for some time. Checking the system load showed the namadan node process sustaining 100% use of a single core.

The host system internet is via VPN and cannot receive incoming connection requests. During the incident the host system otherwise retained normal internet connectivity via the VPN.

Checking an explorer for mainnet showed a few other validators missing a high number of blocks at the time.

Restarting the node service and the host's network interface did not appear to help. There are several service restarts in the attached logs, please note the systemd[1]: namadad.service... messages to identify the different service instances of namadad.service.


Full log:
bodski-mainnet-2025-10-05T20_20_00Z.log

Created with: journalctl -u namadad.service --since 2025-10-05T20:20Z --utc --no-hostname -o short-iso-precise

CometBFT logging config: CMT_LOG_LEVEL=*:info (includes p2p and pex entries).


From the logs everything appears normal up execution of block height=3802320:

2025-10-05T20:23:24.302006+00:00 namadan[1421207]: I[2025-10-05|21:23:24.301] Timed out                                    module=consensus dur=3s height=3802320 round=0 step=RoundStepPropose
2025-10-05T20:23:24.330446+00:00 namadan[1421207]: I[2025-10-05|21:23:24.330] executed block                               module=state height=3802320 num_valid_txs=0 num_invalid_txs=0

The next log entries relating to block height=3802320 are delayed somewhat:

2025-10-05T20:23:42.390481+00:00 namadan[1421207]: I[2025-10-05|21:23:42.390] Ensure peers                                 module=pex numOutPeers=16 numInPeers=0 numDialing=0 numToDial=-6
2025-10-05T20:23:54.713226+00:00 namadan[1421163]: 2025-10-05T20:23:54.713150Z  INFO namada_node::shell: Committed block hash: 142562317ab701d0e5c426c0776e814e57f2f5a3db35f3e5c024f4f41ed28e1b, height: 3802320
2025-10-05T20:23:54.753721+00:00 namadan[1421207]: I[2025-10-05|21:23:54.753] committed state                              module=state height=3802320 num_txs=0 app_hash=142562317AB701D0E5C426C0776E814E57F2F5A3DB35F3E5C024F4F41ED28E1B
2025-10-05T20:23:54.882773+00:00 namadan[1421207]: I[2025-10-05|21:23:54.882] indexed block events                         module=txindex height=3802320

Following that for height=3802321 it complains about a commit for an unknown block:

2025-10-05T20:23:54.884157+00:00 namadan[1421207]: I[2025-10-05|21:23:54.883] Timed out                                    module=consensus dur=-26.700205485s height=3802321 round=0 step=RoundStepNewHeight
2025-10-05T20:23:55.227169+00:00 namadan[1421207]: I[2025-10-05|21:23:55.227] commit is for a block we do not know about; set ProposalBlock=nil module=consensus height=3802321 commit_round=0 proposal= commit=48536E401179FFC967A0E9C3FD52E682EA20D3055B426ACF4BF7E65B26180CC6
2025-10-05T20:23:55.459375+00:00 namadan[1421207]: I[2025-10-05|21:23:55.459] received complete proposal block             module=consensus height=3802321 hash=48536E401179FFC967A0E9C3FD52E682EA20D3055B426ACF4BF7E65B26180CC6
2025-10-05T20:23:55.474917+00:00 namadan[1421207]: I[2025-10-05|21:23:55.474] finalizing commit of block                   module=consensus height=3802321 hash=48536E401179FFC967A0E9C3FD52E682EA20D3055B426ACF4BF7E65B26180CC6 root=142562317AB701D0E5C426C0776E814E57F2F5A3DB35F3E5C024F4F41ED28E1B num_txs=0
2025-10-05T20:23:55.507403+00:00 namadan[1421163]: 2025-10-05T20:23:55.507279Z  INFO namada_node::shell::process_proposal: Received block proposal proposer="64F0B257817C936B5053EECA8125C8A18E979643" height=3802321 hash="48536E401179FFC967A0E9C3FD52E682EA20D3055B426ACF4BF7E65B26180CC6" n_txs=0
2025-10-05T20:23:55.508505+00:00 namadan[1421163]: 2025-10-05T20:23:55.508451Z  INFO namada_node::shell::finalize_block: Block height: 3802321, epoch: 1223, is new epoch: false, is masp new epoch: false.
2025-10-05T20:23:57.464793+00:00 namadan[1421163]: 2025-10-05T20:23:57.464718Z  INFO namada_node::shell::finalize_block: Applied 0 transactions. Wrappers: 0, successful inner txs: 0, rejected inner txs: 0, errored inner txs: 0, unrun txs: 0, valid txs discarded by failing atomic batch: 0, vp cache size: 2 - 2, tx c>
2025-10-05T20:23:57.464793+00:00 namadan[1421163]: 2025-10-05T20:23:57.464747Z  INFO namada_node::shell::finalize_block: txs executed: 0
2025-10-05T20:23:57.505512+00:00 namadan[1421207]: I[2025-10-05|21:23:57.505] executed block                               module=state height=3802321 num_valid_txs=0 num_invalid_txs=0

And then connection failures appear:

2025-10-05T20:23:57.886845+00:00 namadan[1421207]: I[2025-10-05|21:23:57.886] Timed out                                    module=consensus dur=3s height=3802321 round=0 step=RoundStepPropose
2025-10-05T20:24:12.390030+00:00 namadan[1421207]: I[2025-10-05|21:24:12.389] Ensure peers                                 module=pex numOutPeers=16 numInPeers=0 numDialing=0 numToDial=-6
2025-10-05T20:24:42.389660+00:00 namadan[1421207]: I[2025-10-05|21:24:42.389] Saving AddrBook to file                      module=p2p book=/home/bod/.local/share/namada/namada.5f5de2dd1b88cba30586420/cometbft/config/addrbook.json size=1002
2025-10-05T20:24:42.390736+00:00 namadan[1421207]: I[2025-10-05|21:24:42.390] Ensure peers                                 module=pex numOutPeers=16 numInPeers=0 numDialing=0 numToDial=-6
2025-10-05T20:24:46.607117+00:00 namadan[1421207]: E[2025-10-05|21:24:46.607] Connection failed @ sendRoutine              module=p2p [email protected]:26655 conn=MConn{148.113.9.92:26655} err="pong timeout"
2025-10-05T20:24:46.607117+00:00 namadan[1421207]: I[2025-10-05|21:24:46.607] service stop                                 module=p2p [email protected]:26655 msg="Stopping MConnection service" impl=MConn{148.113.9.92:26655}
2025-10-05T20:24:46.607651+00:00 namadan[1421207]: E[2025-10-05|21:24:46.607] Stopping peer for error                      module=p2p peer="Peer{MConn{148.113.9.92:26655} c8f1e8090686493e947c84e26ee124c6b1d855b0 out}" err="pong timeout"
2025-10-05T20:24:46.607651+00:00 namadan[1421207]: I[2025-10-05|21:24:46.607] service stop                                 module=p2p [email protected]:26655 msg="Stopping Peer service" impl="Peer{MConn{148.113.9.92:26655} c8f1e8090686493e947c84e26ee124c6b1d855b0 out}"
2025-10-05T20:24:46.607651+00:00 namadan[1421207]: I[2025-10-05|21:24:46.607] Reconnecting to peer                         module=p2p [email protected]:26655

...etc

Something weird I noticed when restarting namadad.service, connection attempts to invalid peer addresses:

2025-10-05T20:39:47.894452+00:00 namadan[1458379]: E[2025-10-05|21:39:47.894] Can't add peer's address to addrbook         module=p2p err="Cannot add non-routable address [email protected]:26656"
2025-10-05T20:39:47.894674+00:00 namadan[1458379]: E[2025-10-05|21:39:47.894] Can't add peer's address to addrbook         module=p2p err="Cannot add invalid address [email protected]:26670: invalid IP"
2025-10-05T20:39:47.894810+00:00 namadan[1458379]: E[2025-10-05|21:39:47.894] Can't add peer's address to addrbook         module=p2p err="Cannot add non-routable address [email protected]:26656"
2025-10-05T20:39:47.894959+00:00 namadan[1458379]: E[2025-10-05|21:39:47.894] Can't add peer's address to addrbook         module=p2p err="Cannot add non-routable address [email protected]:26656"
2025-10-05T20:39:47.895052+00:00 namadan[1458379]: E[2025-10-05|21:39:47.894] Can't add peer's address to addrbook         module=p2p err="Cannot add non-routable address [email protected]:26656"
2025-10-05T20:39:47.895124+00:00 namadan[1458379]: E[2025-10-05|21:39:47.895] Can't add peer's address to addrbook         module=p2p err="Cannot add non-routable address [email protected]:26656"
2025-10-05T20:39:47.895196+00:00 namadan[1458379]: E[2025-10-05|21:39:47.895] Can't add peer's address to addrbook         module=p2p err="Cannot add non-routable address [email protected]:26656"
2025-10-05T20:39:47.895196+00:00 namadan[1458379]: E[2025-10-05|21:39:47.895] Can't add peer's address to addrbook         module=p2p err="Cannot add non-routable address [email protected]:26656"
2025-10-05T20:39:47.895347+00:00 namadan[1458379]: E[2025-10-05|21:39:47.895] Can't add peer's address to addrbook         module=p2p err="Cannot add non-routable address [email protected]:26656"

And also entries like this where we're rejecting inbound peer connections from localhost (???):

2025-10-05T20:39:48.723910+00:00 namadan[1458379]: I[2025-10-05|21:39:48.723] Inbound Peer rejected                        module=p2p err="filtered CONN<127.0.0.1:49830>: duplicate CONN<127.0.0.1:49830>" numPeers=3


Eventually, the node recovered by being left to just run. It's hard to pinpoint where recovery began, being a gradual process and p2p/pex errors seem to be normal at node startup when comet p2p logging is enabled.

There is panic in the logs that occurs in tower_abci::v037::server::Server::listen_tcp during node shutdown. I've seen this before and suspect it's an unrelated bug triggered by unlucky timing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions