Bug Report/RFC: lagging tablet(s) can cause `EmergencyReparentShard` to fail

### Overview of the Issue

Somewhat related to https://github.com/vitessio/vitess/issues/18528, this issue is to discuss the problem of a single lagging tablet causing ERS to either take an unnecessary amount of time, or to timeout if the lag does not recover in X period of time

The ERS code today attempts to, for EVERY tablet:
1. Stop Replication and get GTID positions
2. Wait for relay logs to apply
3. Pick a most advanced candidate

Let's imagine we have a 4 x tablet shard:
1. `PRIMARY` ✅ 
2. `REPLICA` with negligible lag ✅ 
3. `REPLICA` with negligible lag ✅ 
4. `REPLICA` with 180 seconds of IO-thread lag
    - Example scenario: tablet that just finished restore, tablet that "can't keep up", etc

Today using the example shard above, baring unrelated failures, the ERS code will:
1. Succeed to execute the `StopReplicationAndGetStatus` RPC on all tablets
2. The "wait for relay logs" phase will wait for all tablets. Tablet number 4 _(with 180 seconds of lag)_ will take a long time to execute all relay logs
3. The ERS will potentially/likely fail after exceeding `--wait-replicas-timeout` _(default 15s)_, due to the single lagging replica being so far behind

#### Solution

The solution I'd like to propose is we don't wait for outlier candidates, in terms of replication lag. Using the known GTID positions from `StopReplicationAndGetStatus`, we should be able to be more clever while ensuring the most-advanced candidate(s) are waited for

Proposed solution:
1. Before we reach the "wait for relay logs" phase of the ERS, determine what replicas are the most advanced using the `After` relaylog positions from `StopReplicationAndGetStatus`
    - By the time this RPC returns, replication is stopped everywhere - no moving target
3. Filter-out a minority of least-advanced candidates
    - The least-advanced minority will still apply their relaylogs async, but we won't "wait" for it
5. Only wait for a majority of most-advanced candidates in the "wait for relay logs" phase
    - Today ERS will wait for "all tablets" no matter what
6. Pick a new Primary _(unchanged)_
7. Reparent all tablets _(unchanged)_
    - The least-advanced candidates should catch-up async post-ERS

Your thoughts are appreciated, especially blind-spots in this approach!

### Reproduction Steps

1. Create a shard with many possible candidates for reparent
2. Introduce long-lived IO-thread lag on a replica that is much-larger than `--wait-replicas-timeout` _(default `15s`)_
3. Run an `EmergencyReparentShard` on the test shard
4. Notice the ERS times out, or is significantly more likely to timeout

### Binary Version

```sh
v19+
```

### Operating System and Environment details

```sh
Linux
```

### Log Fragments

```sh

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report/RFC: lagging tablet(s) can cause `EmergencyReparentShard` to fail #18529

Overview of the Issue

Solution

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug Report/RFC: lagging tablet(s) can cause EmergencyReparentShard to fail #18529

Description

Overview of the Issue

Solution

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug Report/RFC: lagging tablet(s) can cause `EmergencyReparentShard` to fail #18529