-
Notifications
You must be signed in to change notification settings - Fork 15
[TNTP-2109] Switch to safe mode on vshard rebalance #467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| pgroup_duplicates.test_duplicates = function(g) | ||
| t.skip_if( | ||
| not ( | ||
| utils.tarantool_version_at_least(3, 1) and (g.params.backend == "config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: implement test for other backend types. 2.11 and cartridge
vakhov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I’ve gone through the changes and left a few small questions.
4a8e4d2 to
bb11b86
Compare
| local utils = require('crud.common.utils') | ||
| local stats = require('crud.stats') | ||
| local readview = require('crud.readview') | ||
| local schema = require('crud.schema') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, the commits are absolutely chaotic, this is not the way, we develop open-source modules. We should rebase that branch in master to make the commit history linear, and not just merge it. So, all commits like Minor changes according to review comments will be visible to our users. You can check out, how Georgy Moiseev did it: proper commits, proper commit messages, tests in every commit (e.g. 8d7cae0).
Now, I'm forced to review more than 1k lines in one step, which is very inconvenient and increases the chanse of missed bugs. And our users won't be able to check the individual commits, if they want to. Of course, it's up to you, since I'm not the maintainer of that module, but it's just not nice to develop and merge code like that.
IMHO, the features should be properly split between commits, every commit must include the associated tests, proper commit messages and mentioning of the #448 ticket. Of course, refactoring or code moving should be in the separate commits. Between commits all of the test must pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, the commits are absolutely chaotic
Of course these commits will not go to master, I will re-split them before merging the PR.
Now, I'm forced to review more than 1k lines in one step
My bad. Never thought of this PR as of multiple features that can be reviewed separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-split PR into 3 commits.
crud.lua
Outdated
|
|
||
| -- @refer rebalance.router_cache_length | ||
| -- @function router_cache_length | ||
| crud.rebalance.router_cache_length = rebalance.router.cache_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we export router_cache_length and router_cache_last_clear_ts to public API? I understand that for router_cache_clear, which is supposed to be called by the TCM/cartridge/ATE, but I don't understand, why getters are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they will be useful in TCM/Cartridge to indicate that cache has actually been cleared and when did it happen.
crud.lua
Outdated
| function crud.init_router() | ||
| rawset(_G, 'crud', crud) | ||
| rawset(_G, 'crud', crud) | ||
| rebalance.metrics.enable_router_metrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a crutch. Will it be disabled, when user disables metrics? It seems the whole metrics part should be done in the metrics_registry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics registry is a part of stats module, it provides statistics for crud requests, has a performance impact and can be enabled/disabled by user. I think these metrics should be independent from stats. I've simplified them and got rid of these enable_* functions.
crud/common/rebalance.lua
Outdated
| safe_mode = false, | ||
| safe_mode_enable_hooks = {}, | ||
| safe_mode_disable_hooks = {}, | ||
| _router_cache_last_clear_ts = fiber.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we initialize the timestamp with current time on module load? It may lead to bugs, we didn't really cleared the router`s cache on module load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, thanks! Changed initial value to zero.
crud/common/rebalance.lua
Outdated
| end) | ||
| end | ||
|
|
||
| M.init = rebalance_init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's make the code consistent with all other modules, it seems we tend to make the variable with proper name and assign methods to it .
crud/crud/stats/local_registry.lua
Line 39 in c52a6f5
| function registry.init(opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed M to a proper name.
| if not ref_ok then | ||
|
|
||
| table.insert(bucket_ref_errs, bucket_refrw_err.bucket_ref_errs[1]) | ||
| goto continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we continue on error and then just unref all of them? We should go to unref stright away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing, it has to be brake, not continue. Fixed.
| set_fast_mode() | ||
| end | ||
|
|
||
| if not hooks_registered then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You won't need it, if internal.trigger will be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| end | ||
| end | ||
|
|
||
| set_fast_mode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, and what if rebalance module is currently in the safe mode, then the trigger won't work and the bucket_ref_unref will remain in the fast. Or am I missing smth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Simplified this part, removed unnecessary init function, setting inital functions based on rebalance.safe_mode value.
crud/common/call.lua
Outdated
| end | ||
| end | ||
|
|
||
| if redirect_replicaset ~= nil then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why can't we just change the replicaset variable, seems redirect_replicaset is not needed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, removed this extra var.
crud/insert.lua
Outdated
| noreturn = opts.noreturn, | ||
| fetch_latest_metadata = opts.fetch_latest_metadata, | ||
| }) | ||
| local function make_insert() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just make function wrappers for single bucket and multiple ones? Why do we always repeat the same code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got rid of these local functions completely in all crud methods.
d25a4ff to
d25ad97
Compare
`crud.common.rebalance` module introduces on_safe_mode_toggle trigger that can be used to change the implementation of crud methods to work in safe mode during rebalance and in fast mode at other times. This mechanism automatically enables safe mode when vshard rebalance starts, but safe mode can only be disabled manually. Co-authored-by: Satbek Turganbayev <[email protected]>
For the most of crud methods, when safe mode is enabled, bucket_ref() for appropriate bucket is called before calling the method itself. This prevents from writing tuples to the wrong replicaset during or after vshard rebalance. Closes #448. Co-authored-by: Satbek Turganbayev <[email protected]>
d25ad97 to
30c22b1
Compare
I didn't forget about