-
Notifications
You must be signed in to change notification settings - Fork 2.3k
EmergencyReparentShard: support ignoring minority/count of lagging tablets
#18707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
EmergencyReparentShard: support ignoring minority/count of lagging tablets
#18707
Conversation
…tablets Signed-off-by: Tim Vaillancourt <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #18707 +/- ##
=======================================
Coverage 67.51% 67.51%
=======================================
Files 1606 1606
Lines 263588 263681 +93
=======================================
+ Hits 177953 178021 +68
- Misses 85635 85660 +25 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Copying some conclusions from an offline discussion with @arthurschreiber:
|
|
The reason, AFAIUI, for the current behavior is to prevent any of the healthy tablets from becoming forever unhealthy/unusable due to the new primary not having binary logs covering/containing the GTIDs that the lagging replica(s) may still need (the new primary may have recently been restored from a backup and have minimal binary logs). Have you already thought about that in this context? |
@mattlord I don't think that has changed, but I would appreciate you double checking my assumption because that is a very important functionality One part of the code that could affect that was actually changed in #18531. Previous to this PR, the code called The TL;DR on that wrapper func: we do the same sort but put now prioritise positions with the most advanced SQL thread if two combined sets are equal. In the end the same And in terms of being certain we're ignoring the right tablets: after |
Description
Building on top of #18531, this PR addresses #18529 by adding support for
EmergencyReparentShardto ignore a minority of lagging tablets ahead of the wait-for-relaylogs phaseWhy? Severely lagging tablets can cause the ERS to fail due to timeout waiting for relaylogs to apply, which before this PR must happen on ALL tablets 😬
Is is safe to ignore some candidates? Yes. Before we start the wait-for-relaylog phase we issue a
StopReplicationAndGetStatusRPC, which stops the IO thread and returns the now-static GTID sets of each candidate. This gives us an opportunity to "ignore" some candidates we know will never be most-advanced, ahead of actually waiting for them to apply logsThis PR adds "modes" to ERS to support the existing and future-desired behaviours, and to allow the existing "all tablets" behaviour to remain the default.
The modes:
ALL(default) - wait for all tablets to apply relaylogs, like <= v22MAJORITY- wait for only a majority of most-advanced tabletsCOUNT- wait for an exact number of tablets. Count specified by additional flag/RPC fieldI'm on the fence if that is the best idea, because the existing
ALLbehaviour is risky in itself. I think theMAJORITYbehaviour is probably safer/an-improvement, but I suppose I'm hesitant to make it the default just-yetIf we decide to make
ALLthe default in v23, we should have a plan forMAJORITYbecoming the default. I've added aDEFAULT(0) mode to the protobuf to potentially support this. Thoughts appreciated!Related Issue(s)
Resolves: #18529
Checklist
Deployment Notes
AI Disclosure
AI creeps me out