Skip to content

remove-brick randomly fails due to unanswered request - "bailing out frame" #4620

@flesniak

Description

@flesniak

I'm running a glusterfs distribute volume on two peers with 10 bricks. One of these bricks shall be removed, but the remove-brick operation repeatedly fails at a random point in time after running for 30 minutes up to 9 hours.

Status after failure:

$ gluster volume remove-brick tank node1:/mnt/md5/data status
     Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
    node1            11276         2.7GB        136050             8             0               failed        3:31:10

tank-rebalance.log on the peer running the remove-brick operation:

[2025-09-22 01:00:06.894942 +0000] E [rpc-clnt.c:167:call_bail] 0-tank-client-28: bailing out frame type(GlusterFS 4.x v1), op(ENTRYLK(31)), xid = 0x82f9d, unique = 1528357, sent = 2025-09-22 00:30:01 +0000, timeout = 1800 for 192.168.42.2:51408
[2025-09-22 01:00:06.895106 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1418:client4_0_entrylk_cbk] 0-tank-client-28: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}] 
[2025-09-22 01:00:06.924366 +0000] E [MSGID: 109023] [dht-rebalance.c:2872:gf_defrag_migrate_single_file] 0-tank-dht: migrate-data failed for /backups/cam/20210818/100828.jpg [Transport endpoint is not connected]
[2025-09-22 01:00:36.895197 +0000] E [rpc-clnt.c:167:call_bail] 0-tank-client-28: bailing out frame type(GlusterFS 4.x v1), op(ENTRYLK(31)), xid = 0x8330e, unique = 1533932, sent = 2025-09-22 00:30:32 +0000, timeout = 1800 for 192.168.42.2:51408
[2025-09-22 01:00:36.906739 +0000] E [MSGID: 109023] [dht-rebalance.c:2872:gf_defrag_migrate_single_file] 0-tank-dht: migrate-data failed for /backups/cam/20210818/163125.jpg [Transport endpoint is not connected]
[2025-09-22 01:00:36.895247 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1418:client4_0_entrylk_cbk] 0-tank-client-28: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}] 

From the bailing out frame message, I suppose an operation was not answered by a brick after 1800 seconds. I ran a wireshark dump of the network traffic between the two nodes where the original operation is visible. There is not reply recorded however, which makes the rebalance process abort 1800 seconds after the request has been sent. Notice that wireshark appends the reply packet ID to the description of each call, but for the unanswered call no reply is found. The network connection seems fine, there is no dropped packets and the requests shows up in packet captures on both peers.

Image

I ran another remove-brick operation with diagnostics.brick-log-level DEBUG, but no suspicious message appears at the time where the operation is lost. I'm not entirely sure if the log snippet is from the right ENTRYLK call since ~5 calls are received within the second given by the rebalance log timestamp, but the most interesting is one where 3 calls are scheduled at once:

[2025-09-22 00:30:01.601104 +0000] D [MSGID: 0] [server-rpc-fops_v2.c:2638:server4_inodelk_resume] 0-/mnt/md0/data: frame 0x60947771b538, xlator 0x609476b77618 
[2025-09-22 00:30:01.601104 +0000] D [MSGID: 0] [io-threads.c:365:iot_schedule] 0-tank-io-threads: ENTRYLK scheduled as least priority fop 
[2025-09-22 00:30:01.601194 +0000] D [MSGID: 0] [io-threads.c:365:iot_schedule] 0-tank-io-threads: SETATTR scheduled as normal priority fop 
[2025-09-22 00:30:01.601199 +0000] D [MSGID: 0] [io-threads.c:365:iot_schedule] 0-tank-io-threads: INODELK scheduled as least priority fop 
[2025-09-22 00:30:01.601272 +0000] D [MSGID: 0] [posix-metadata.c:127:posix_fetch_mdata_xattr] 0-tank-posix: No such attribute:trusted.glusterfs.mdata for file /mnt/md0/data/.glusterfs/13/c7/13c7ec74-5fa9-4ac4-bf97-948410ff7225 gfid: 13c7ec74-5fa9-4ac4-bf97-948410ff7225

Some observations I made throughout six tries:

  • it happens on any brick on the remote node, not on any brick which is local to the one running the remove-brick operation
  • the unanswered call is always a ENTRYLK operation
  • once, multiple ENTRYLK operations issued in a 3-second time frame remained unanswered and were reported with a bailing out frame message

Any suggestions what might be the cause, or on how to debug this further?

Full volume settings
------                                   -----
auth.allow                               *
auth.reject                              (null) (DEFAULT)
auth.ssl-allow                           *
changelog.capture-del-path               off (DEFAULT)
changelog.changelog-barrier-timeout      120
changelog.changelog-dir                  {{ brick.path }}/.glusterfs/changelogs (DEFAULT)
changelog.changelog                      off (DEFAULT)
changelog.encoding                       ascii (DEFAULT)
changelog.fsync-interval                 5 (DEFAULT)
changelog.rollover-time                  15 (DEFAULT)
client.bind-insecure                     (null) (DEFAULT)
client.event-threads                     8
client.keepalive-count                   9
client.keepalive-interval                2
client.keepalive-time                    20
client.send-gids                         on (DEFAULT)
client.ssl                               off
client.strict-locks                      off
client.tcp-user-timeout                  0
cluster.background-self-heal-count       8 (DEFAULT)
cluster.brick-graceful-cleanup           disable
cluster.brick-multiplex                  disable
cluster.choose-local                     true (DEFAULT)
cluster.consistent-metadata              no (DEFAULT)
cluster.daemon-log-level                 INFO
cluster.data-change-log                  on (DEFAULT)
cluster.data-self-heal-algorithm         (null) (DEFAULT)
cluster.data-self-heal                   off (DEFAULT)
cluster.dht-xattr-name                   trusted.glusterfs.dht (DEFAULT)
cluster.disperse-self-heal-daemon        enable (DEFAULT)
cluster.eager-lock                       on (DEFAULT)
cluster.enable-shared-storage            disable
cluster.ensure-durability                on (DEFAULT)
cluster.entry-change-log                 on (DEFAULT)
cluster.entry-self-heal                  off (DEFAULT)
cluster.extra-hash-regex                 (null) (DEFAULT)
cluster.favorite-child-policy            none (DEFAULT)
cluster.force-migration                  off
cluster.full-lock                        yes (DEFAULT)
cluster.granular-entry-heal              no (DEFAULT)
cluster.halo-enabled                     False (DEFAULT)
cluster.halo-max-latency                 5 (DEFAULT)
cluster.halo-max-replicas                99999 (DEFAULT)
cluster.halo-min-replicas                2 (DEFAULT)
cluster.halo-nfsd-max-latency            5 (DEFAULT)
cluster.halo-shd-max-latency             99999 (DEFAULT)
cluster.heal-timeout                     600 (DEFAULT)
cluster.heal-timeout                     600 (DEFAULT)
cluster.heal-wait-queue-length           128 (DEFAULT)
cluster.local-volume-name                (null) (DEFAULT)
cluster.locking-scheme                   full (DEFAULT)
cluster.lock-migration                   off
cluster.lookup-optimize                  on (DEFAULT)
cluster.lookup-unhashed                  on (DEFAULT)
cluster.max-bricks-per-process           250
cluster.metadata-change-log              on (DEFAULT)
cluster.metadata-self-heal               off (DEFAULT)
cluster.min-free-disk                    1%
cluster.min-free-inodes                  5% (DEFAULT)
cluster.optimistic-change-log            on (DEFAULT)
cluster.post-op-delay-secs               1 (DEFAULT)
cluster.quorum-count                     (null) (DEFAULT)
cluster.quorum-reads                     no (DEFAULT)
cluster.quorum-type                      none (DEFAULT)
cluster.randomize-hash-range-by-gfid     off (DEFAULT)
cluster.readdir-optimize                 on
cluster.read-hash-mode                   1 (DEFAULT)
cluster.read-subvolume-index             -1 (DEFAULT)
cluster.read-subvolume                   (null) (DEFAULT)
cluster.rebalance-stats                  off (DEFAULT)
cluster.rebal-throttle                   normal
cluster.rmdir-optimize                   on (DEFAULT)
cluster.rsync-hash-regex                 (null) (DEFAULT)
cluster.self-heal-daemon                 on (DEFAULT)
cluster.self-heal-readdir-size           1KB (DEFAULT)
cluster.self-heal-window-size            8 (DEFAULT)
cluster.server-quorum-ratio              51
cluster.server-quorum-type               off
cluster.shd-max-threads                  1 (DEFAULT)
cluster.shd-wait-qlength                 1024 (DEFAULT)
cluster.subvols-per-directory            (null) (DEFAULT)
cluster.switch-pattern                   (null) (DEFAULT)
cluster.use-anonymous-inode              yes
cluster.use-compound-fops                off
cluster.weighted-rebalance               on (DEFAULT)
config.brick-threads                     16
config.client-threads                    16
config.gfproxyd                          off
config.global-threading                  off
ctime.noatime                            on
debug.delay-gen                          off
debug.error-failure                      (null) (DEFAULT)
debug.error-fops                         (null) (DEFAULT)
debug.error-gen                          off
debug.error-number                       (null) (DEFAULT)
debug.exclude-ops                        (null) (DEFAULT)
debug.include-ops                        (null) (DEFAULT)
debug.log-file                           no (DEFAULT)
debug.log-history                        no (DEFAULT)
debug.random-failure                     off (DEFAULT)
debug.trace                              off
delay-gen.delay-duration                 100000 (DEFAULT)
delay-gen.delay-percentage               10% (DEFAULT)
delay-gen.enable                          (DEFAULT)
dht.force-readdirp                       on (DEFAULT)
diagnostics.brick-log-buf-size           5 (DEFAULT)
diagnostics.brick-log-flush-timeout      120 (DEFAULT)
diagnostics.brick-log-format             (null) (DEFAULT)
diagnostics.brick-logger                 (null) (DEFAULT)
diagnostics.brick-log-level              DEBUG
diagnostics.brick-sys-log-level          CRITICAL (DEFAULT)
diagnostics.client-log-buf-size          5 (DEFAULT)
diagnostics.client-log-flush-timeout     120 (DEFAULT)
diagnostics.client-log-format            (null) (DEFAULT)
diagnostics.client-logger                (null) (DEFAULT)
diagnostics.client-log-level             ERROR
diagnostics.client-sys-log-level         CRITICAL (DEFAULT)
diagnostics.count-fop-hits               off
diagnostics.dump-fd-stats                off (DEFAULT)
diagnostics.fop-sample-buf-size          65535 (DEFAULT)
diagnostics.fop-sample-interval          0 (DEFAULT)
diagnostics.latency-measurement          off
diagnostics.stats-dnscache-ttl-sec       86400 (DEFAULT)
diagnostics.stats-dump-format            json (DEFAULT)
diagnostics.stats-dump-interval          0 (DEFAULT)
disperse.background-heals                8 (DEFAULT)
disperse.cpu-extensions                  auto (DEFAULT)
disperse.eager-lock                      on (DEFAULT)
disperse.eager-lock-timeout              1 (DEFAULT)
disperse.heal-wait-qlength               128 (DEFAULT)
disperse.optimistic-change-log           on (DEFAULT)
disperse.other-eager-lock                on (DEFAULT)
disperse.other-eager-lock-timeout        1 (DEFAULT)
disperse.parallel-writes                 on (DEFAULT)
disperse.quorum-count                    0 (DEFAULT)
disperse.read-policy                     gfid-hash (DEFAULT)
disperse.self-heal-window-size           32 (DEFAULT)
disperse.shd-max-threads                 1 (DEFAULT)
disperse.shd-wait-qlength                1024 (DEFAULT)
disperse.stripe-cache                    4 (DEFAULT)
features.acl                             enable
features.alert-time                      86400 (DEFAULT)
features.auto-commit-period              180 (DEFAULT)
features.barrier                         disable
features.barrier-timeout                 120
features.bitrot                          disable
features.cache-invalidation              on
features.cache-invalidation-timeout      1800
features.cloudsync                       off
features.cloudsync-product-id            (null) (DEFAULT)
features.cloudsync-remote-read           off
features.cloudsync-store-id              (null) (DEFAULT)
features.cloudsync-storetype             (null) (DEFAULT)
features.ctime                           on
features.ctime                           on (DEFAULT)
features.default-retention-period        120 (DEFAULT)
features.default-soft-limit              80% (DEFAULT)
features.enforce-mandatory-lock          off
features.expiry-time                     120
features.failover-hosts                  (null) (DEFAULT)
features.hard-timeout                    5 (DEFAULT)
feature.simple-quota-pass-through        true
feature.simple-quota.use-backend         false
features.inode-quota                     off
features.lease-lock-recall-timeout       60 (DEFAULT)
features.leases                          off
features.locks-monkey-unlocking          false (DEFAULT)
features.locks-notify-contention-delay   5 (DEFAULT)
features.locks-notify-contention         yes (DEFAULT)
features.locks-revocation-clear-all      false (DEFAULT)
features.locks-revocation-max-blocked    0 (DEFAULT)
features.locks-revocation-secs           0 (DEFAULT)
features.quota-deem-statfs               off
features.quota                           off
features.read-only                       off (DEFAULT)
features.retention-mode                  relax (DEFAULT)
features.scrub                           false (DEFAULT)
features.scrub-freq                      biweekly
features.scrub-throttle                  lazy
features.sdfs                            off
features.selinux                         on
features.shard-block-size                64MB (DEFAULT)
features.shard-deletion-rate             100 (DEFAULT)
features.shard-lru-limit                 16384 (DEFAULT)
features.shard                           off
features.show-snapshot-directory         off
features.signer-threads                  4
features.snapshot-directory              .snaps
features.soft-timeout                    60 (DEFAULT)
features.tag-namespaces                  off
features.timeout                         45 (DEFAULT)
features.trash-dir                       .trashcan (DEFAULT)
features.trash-eliminate-path            (null) (DEFAULT)
features.trash-internal-op               off (DEFAULT)
features.trash-max-filesize              5MB (DEFAULT)
features.trash                           off (DEFAULT)
features.uss                             off
features.worm-file-level                 off
features.worm-files-deletable            on
features.worm                            off
ganesha.enable                           off
geo-replication.ignore-pid-check         off
geo-replication.ignore-pid-check         off
geo-replication.indexing                 off
geo-replication.indexing                 off
glusterd.vol_count_per_thread            100
locks.mandatory-locking                  off (DEFAULT)
locks.trace                              off (DEFAULT)
network.compression.compression-level    1 (DEFAULT)
network.compression.debug                false (DEFAULT)
network.compression.mem-level            8 (DEFAULT)
network.compression.min-size             1024 (DEFAULT)
network.compression                      off
network.compression.window-size          -15 (DEFAULT)
network.frame-timeout                    1800 (DEFAULT)
network.inode-lru-limit                  1000000
network.ping-timeout                     42 (DEFAULT)
network.remote-dio                       disable (DEFAULT)
network.tcp-window-size                  (null) (DEFAULT)
network.tcp-window-size                  (null) (DEFAULT)
nfs.acl                                  on (DEFAULT)
nfs.addr-namelookup                      off (DEFAULT)
nfs.auth-cache-ttl-sec                   (null) (DEFAULT)
nfs.auth-refresh-interval-sec            (null) (DEFAULT)
nfs.disable                              on
nfs.drc                                  off (DEFAULT)
nfs.drc-size                             0x20000 (DEFAULT)
nfs.dynamic-volumes                      off (DEFAULT)
nfs.enable-ino32                         no (DEFAULT)
nfs.event-threads                        2 (DEFAULT)
nfs.export-dir                            (DEFAULT)
nfs.export-dirs                          on (DEFAULT)
nfs.exports-auth-enable                  (null) (DEFAULT)
nfs.export-volumes                       on (DEFAULT)
nfs.mem-factor                           15 (DEFAULT)
nfs.mount-rmtab                          /var/lib/glusterd/nfs/rmtab (DEFAULT)
nfs.mount-udp                            off (DEFAULT)
nfs.nlm                                  on (DEFAULT)
nfs.outstanding-rpc-limit                16 (DEFAULT)
nfs.port                                 2049 (DEFAULT)
nfs.ports-insecure                       off (DEFAULT)
nfs.rdirplus                             on (DEFAULT)
nfs.readdir-size                         (1 * 1048576ULL) (DEFAULT)
nfs.read-size                            (1 * 1048576ULL) (DEFAULT)
nfs.register-with-portmap                on (DEFAULT)
nfs.rpc-auth-allow                       all (DEFAULT)
nfs.rpc-auth-null                        on (DEFAULT)
nfs.rpc-auth-reject                      none (DEFAULT)
nfs.rpc-auth-unix                        on (DEFAULT)
nfs.rpc-statd                            /sbin/rpc.statd (DEFAULT)
nfs.server-aux-gids                      off (DEFAULT)
nfs.trusted-sync                         off (DEFAULT)
nfs.trusted-write                        off (DEFAULT)
nfs.volume-access                        read-write (DEFAULT)
nfs.write-size                           (1 * 1048576ULL) (DEFAULT)
Option                                   Value
performance.aggregate-size               128KB (DEFAULT)
performance.cache-capability-xattrs      true (DEFAULT)
performance.cache-ima-xattrs             true (DEFAULT)
performance.cache-invalidation           on
performance.cache-max-file-size          0 (DEFAULT)
performance.cache-min-file-size          0 (DEFAULT)
performance.cache-priority                (DEFAULT)
performance.cache-refresh-timeout        1 (DEFAULT)
performance.cache-samba-metadata         false (DEFAULT)
performance.cache-size                   1GB
performance.cache-size                   1GB
performance.cache-swift-metadata         false (DEFAULT)
performance.client-io-threads            on
performance.ctime-invalidation           false (DEFAULT)
performance.enable-least-priority        on (DEFAULT)
performance.flush-behind                 on (DEFAULT)
performance.force-readdirp               true (DEFAULT)
performance.global-cache-invalidation    true (DEFAULT)
performance.high-prio-threads            16 (DEFAULT)
performance.io-cache                     on
performance.io-cache-pass-through        false (DEFAULT)
performance.io-cache-size                32MB (DEFAULT)
performance.iot-cleanup-disconnected-reqs off (DEFAULT)
performance.io-thread-count              16
performance.iot-pass-through             false (DEFAULT)
performance.iot-watchdog-secs            (null) (DEFAULT)
performance.lazy-open                    yes (DEFAULT)
performance.least-prio-threads           1 (DEFAULT)
performance.low-prio-threads             16 (DEFAULT)
performance.md-cache-pass-through        false (DEFAULT)
performance.md-cache-statfs              off (DEFAULT)
performance.md-cache-timeout             600
performance.nfs.flush-behind             on (DEFAULT)
performance.nfs.io-cache                 off
performance.nfs.io-threads               off
performance.nfs.quick-read               off
performance.nfs.read-ahead               off
performance.nfs.stat-prefetch            off
performance.nfs.strict-o-direct          off (DEFAULT)
performance.nfs.strict-write-ordering    off (DEFAULT)
performance.nfs.write-behind             on
performance.nfs.write-behind-trickling-writes on (DEFAULT)
performance.nfs.write-behind-window-size 1MB (DEFAULT)
performance.nl-cache-limit               10MB
performance.nl-cache                     on
performance.nl-cache-pass-through        false (DEFAULT)
performance.nl-cache-positive-entry      on
performance.nl-cache-timeout             1800
performance.normal-prio-threads          16 (DEFAULT)
performance.open-behind                  on
performance.open-behind-pass-through     false (DEFAULT)
performance.parallel-readdir             on
performance.qr-cache-timeout             1 (DEFAULT)
performance.quick-read-cache-invalidation false (DEFAULT)
performance.quick-read-cache-size        128MB (DEFAULT)
performance.quick-read-cache-timeout     1 (DEFAULT)
performance.quick-read                   on
performance.rda-cache-limit              512MB
performance.rda-high-wmark               128KB (DEFAULT)
performance.rda-low-wmark                4096 (DEFAULT)
performance.rda-request-size             131072
performance.read-after-open              yes (DEFAULT)
performance.read-ahead                   off
performance.read-ahead-page-count        4 (DEFAULT)
performance.read-ahead-pass-through      false (DEFAULT)
performance.readdir-ahead                on
performance.readdir-ahead-pass-through   false (DEFAULT)
performance.resync-failed-syncs-after-fsync off (DEFAULT)
performance.stat-prefetch                on
performance.strict-o-direct              off (DEFAULT)
performance.strict-write-ordering        off (DEFAULT)
performance.write-behind                 on
performance.write-behind-pass-through    false (DEFAULT)
performance.write-behind-trickling-writes on (DEFAULT)
performance.write-behind-window-size     1MB (DEFAULT)
performance.xattr-cache-list              (DEFAULT)
rebalance.ensure-durability              on (DEFAULT)
server.allow-insecure                    on (DEFAULT)
server.all-squash                        off (DEFAULT)
server.anongid                           65534 (DEFAULT)
server.anonuid                           65534 (DEFAULT)
server.dynamic-auth                      on (DEFAULT)
server.event-threads                     8
server.gid-timeout                       300 (DEFAULT)
server.keepalive-count                   9
server.keepalive-interval                2
server.keepalive-time                    20
server.manage-gids                       off (DEFAULT)
server.outstanding-rpc-limit             64 (DEFAULT)
server.own-thread                        (null) (DEFAULT)
server.root-squash                       off (DEFAULT)
server.ssl                               off
server.statedump-path                    /var/run/gluster (DEFAULT)
server.tcp-user-timeout                  42 (DEFAULT)
ssl.ca-list                              (null) (DEFAULT)
ssl.certificate-depth                    (null) (DEFAULT)
ssl.cipher-list                          (null) (DEFAULT)
ssl.crl-path                             (null) (DEFAULT)
ssl.dh-param                             (null) (DEFAULT)
ssl.ec-curve                             (null) (DEFAULT)
ssl.own-cert                             (null) (DEFAULT)
ssl.private-key                          (null) (DEFAULT)
storage.batch-fsync-delay-usec           0 (DEFAULT)
storage.batch-fsync-mode                 reverse-fsync (DEFAULT)
storage.build-pgfid                      off (DEFAULT)
storage.create-directory-mask            0777 (DEFAULT)
storage.create-mask                      0777 (DEFAULT)
storage.fips-mode-rchecksum              off (DEFAULT)
storage.force-create-mode                0000 (DEFAULT)
storage.force-directory-mode             0000 (DEFAULT)
storage.gfid2path                        on
storage.gfid2path-separator              : (DEFAULT)
storage.health-check-interval            600
storage.health-check-timeout             120
storage.linux-aio                        off (DEFAULT)
storage.linux-io_uring                   off (DEFAULT)
storage.max-hardlinks                    100 (DEFAULT)
storage.node-uuid-pathinfo               off (DEFAULT)
storage.owner-gid                        -1 (DEFAULT)
storage.owner-uid                        -1 (DEFAULT)
storage.reserve                          3
transport.address-family                 inet
transport.keepalive                      1
transport.listen-backlog                 1024
gluster volume status
Status of volume: tank
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick node1:/mnt/md1/data                  53879     0          Y       2278940
Brick node1:/mnt/md0/data                  50861     0          Y       2278978
Brick node2:/mnt/md1/data                  53946     0          Y       1363542
Brick node1:/mnt/md5/data                  51931     0          Y       2279047
Brick node2:/mnt/md0/data                  51408     0          Y       1363580
Brick node1:/mnt/md6/data                  57994     0          Y       2279087
Brick node2:/mnt/md3/data                  57386     0          Y       1363620
Brick node2:/mnt/md5/data                  56964     0          Y       1363658
Brick node2:/mnt/md2/data                  56353     0          Y       1363696
Brick node1:/mnt/md2/data                  58383     0          Y       2279132
Brick node1:/mnt/md3/data                  60034     0          Y       2279186
Brick node2:/mnt/md4/data                  52472     0          Y       1363734

Task Status of Volume tank
------------------------------------------------------------------------------
Task                 : Remove brick
ID                   : 879e440d-ecd9-4054-9259-cfad4f5b3d51
Removed bricks:
node1:/mnt/md5/data
Status               : failed

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions