-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
Component: VTorcVitess Orchestrator integrationVitess Orchestrator integrationComponent: vtctlType: BugType: EnhancementLogical improvement (somewhere between a bug and feature)Logical improvement (somewhere between a bug and feature)Type: RFCRequest For CommentRequest For Comment
Description
Overview of the Issue
Somewhat related to #18528, this issue is to discuss the problem of a single lagging tablet causing ERS to either take an unnecessary amount of time, or to timeout if the lag does not recover in X period of time
The ERS code today attempts to, for EVERY tablet:
- Stop Replication and get GTID positions
- Wait for relay logs to apply
- Pick a most advanced candidate
Let's imagine we have a 4 x tablet shard:
PRIMARY✅REPLICAwith negligible lag ✅REPLICAwith negligible lag ✅REPLICAwith 180 seconds of IO-thread lag- Example scenario: tablet that just finished restore, tablet that "can't keep up", etc
Today using the example shard above, baring unrelated failures, the ERS code will:
- Succeed to execute the
StopReplicationAndGetStatusRPC on all tablets - The "wait for relay logs" phase will wait for all tablets. Tablet number 4 (with 180 seconds of lag) will take a long time to execute all relay logs
- The ERS will potentially/likely fail after exceeding
--wait-replicas-timeout(default 15s), due to the single lagging replica being so far behind
Solution
The solution I'd like to propose is we don't wait for outlier candidates, in terms of replication lag. Using the known GTID positions from StopReplicationAndGetStatus, we should be able to be more clever while ensuring the most-advanced candidate(s) are waited for
Proposed solution:
- Before we reach the "wait for relay logs" phase of the ERS, determine what replicas are the most advanced using the
Afterrelaylog positions fromStopReplicationAndGetStatus- By the time this RPC returns, replication is stopped everywhere - no moving target
- Filter-out a minority of least-advanced candidates
- The least-advanced minority will still apply their relaylogs async, but we won't "wait" for it
- Only wait for a majority of most-advanced candidates in the "wait for relay logs" phase
- Today ERS will wait for "all tablets" no matter what
- Pick a new Primary (unchanged)
- Reparent all tablets (unchanged)
- The least-advanced candidates should catch-up async post-ERS
Your thoughts are appreciated, especially blind-spots in this approach!
Reproduction Steps
- Create a shard with many possible candidates for reparent
- Introduce long-lived IO-thread lag on a replica that is much-larger than
--wait-replicas-timeout(default15s) - Run an
EmergencyReparentShardon the test shard - Notice the ERS times out, or is significantly more likely to timeout
Binary Version
v19+Operating System and Environment details
LinuxLog Fragments
Metadata
Metadata
Assignees
Labels
Component: VTorcVitess Orchestrator integrationVitess Orchestrator integrationComponent: vtctlType: BugType: EnhancementLogical improvement (somewhere between a bug and feature)Logical improvement (somewhere between a bug and feature)Type: RFCRequest For CommentRequest For Comment