Skip to content

Bug Report/RFC: lagging tablet(s) can cause EmergencyReparentShard to fail #18529

@timvaillancourt

Description

@timvaillancourt

Overview of the Issue

Somewhat related to #18528, this issue is to discuss the problem of a single lagging tablet causing ERS to either take an unnecessary amount of time, or to timeout if the lag does not recover in X period of time

The ERS code today attempts to, for EVERY tablet:

  1. Stop Replication and get GTID positions
  2. Wait for relay logs to apply
  3. Pick a most advanced candidate

Let's imagine we have a 4 x tablet shard:

  1. PRIMARY
  2. REPLICA with negligible lag ✅
  3. REPLICA with negligible lag ✅
  4. REPLICA with 180 seconds of IO-thread lag
    • Example scenario: tablet that just finished restore, tablet that "can't keep up", etc

Today using the example shard above, baring unrelated failures, the ERS code will:

  1. Succeed to execute the StopReplicationAndGetStatus RPC on all tablets
  2. The "wait for relay logs" phase will wait for all tablets. Tablet number 4 (with 180 seconds of lag) will take a long time to execute all relay logs
  3. The ERS will potentially/likely fail after exceeding --wait-replicas-timeout (default 15s), due to the single lagging replica being so far behind

Solution

The solution I'd like to propose is we don't wait for outlier candidates, in terms of replication lag. Using the known GTID positions from StopReplicationAndGetStatus, we should be able to be more clever while ensuring the most-advanced candidate(s) are waited for

Proposed solution:

  1. Before we reach the "wait for relay logs" phase of the ERS, determine what replicas are the most advanced using the After relaylog positions from StopReplicationAndGetStatus
    • By the time this RPC returns, replication is stopped everywhere - no moving target
  2. Filter-out a minority of least-advanced candidates
    • The least-advanced minority will still apply their relaylogs async, but we won't "wait" for it
  3. Only wait for a majority of most-advanced candidates in the "wait for relay logs" phase
    • Today ERS will wait for "all tablets" no matter what
  4. Pick a new Primary (unchanged)
  5. Reparent all tablets (unchanged)
    • The least-advanced candidates should catch-up async post-ERS

Your thoughts are appreciated, especially blind-spots in this approach!

Reproduction Steps

  1. Create a shard with many possible candidates for reparent
  2. Introduce long-lived IO-thread lag on a replica that is much-larger than --wait-replicas-timeout (default 15s)
  3. Run an EmergencyReparentShard on the test shard
  4. Notice the ERS times out, or is significantly more likely to timeout

Binary Version

v19+

Operating System and Environment details

Linux

Log Fragments

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions